Loading BoingBoing into MongoDB with Scala
==========================================
I want to play around with Rogue_ by the Foursquare folks, but first I needed a
decent sized collections of items in a MongoDB_. I recalled that BoingBoing_
had just released `all their posts in a single file`_, so I downloaded that and
put together a little Scala to convert from XML to JSON_. The built-in XML
support in Scala and the excellent lift-json_ DSL turned the whole thing into no
work at all:
.. _Rogue: http://engineering.foursquare.com/2011/01/21/rogue-a-type-safe-scala-dsl-for-querying-mongodb/
.. _MongoDB: http://www.mongodb.org/
.. _BoingBoing: http://boingboing.net
.. _all their posts in a single file: http://www.boingboing.net/2011/01/25/eleven-years-worth-o.html
.. _JSON: http://www.json.org/
.. _lift-json: https://github.com/lift/lift/tree/master/framework/lift-base/lift-json/
.. read_more
.. code:: scala
val dateFmt = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val xml = scala.xml.XML.loadFile("/home/ry4an/bbpostsdump.xml")
for (postXml <- (xml \ "row")) {
val created_on = dateFmt.parse((postXml \ "created_on").text).getTime
val body_more = (postXml \ "body_more").text.trim match {
case "NULL" => ""
case something => " " + something
}
val postJson =
("_id" -> (postXml \ "permalink").text) ~
("basename" -> (postXml \ "basename").text) ~
("author" -> (postXml \ "author").text) ~
("title" -> (postXml \ "title").text) ~
("comment_count" -> (postXml \ "comment_count").text) ~
("categories" -> (postXml \ "categories").text.split(',')
.toList.filterNot(_ == "NULL")) ~
("body" -> ((postXml \ "body").text + body_more)) ~
("created_on" -> ("$date" -> created_on))
println(compact(JsonAST.render(postJson)))
}
Certainly nothing earth shattering, but pleasantly readable and certainly quick.
In the real world you'd want to use the scala.xml.pull_ package to avoid
pulling all the XML into memory, but all eleven years fit in a gig of ram. The
output lines loaded up directly with ``mongoimport``.
All in all it took an evening, including spinning up a micro instance at Amazon
EC2 to host the DB and provide tomcat for the next steps. The code, along with
the sbt_ project can be found in a `repository at BitBucket`_. Next: Rogue.
.. _scala.xml.pull: http://www.scala-lang.org/api/current/scala/xml/pull/package.html
.. _sbt: http://code.google.com/p/simple-build-tool/
.. _repository at BitBucket: https://bitbucket.org/Ry4an/boingboing-json-mongo
.. raw:: html
.. tags: scala,software,mongodb,ideas-built