I want to play around with Rogue by the Foursquare folks, but first I needed a decent sized collections of items in a MongoDB. I recalled that BoingBoing had just released all their posts in a single file, so I downloaded that and put together a little Scala to convert from XML to JSON. The built-in XML support in Scala and the excellent lift-json DSL turned the whole thing into no work at all:
val dateFmt = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") val xml = scala.xml.XML.loadFile("/home/ry4an/bbpostsdump.xml") for (postXml <- (xml \ "row")) { val created_on = dateFmt.parse((postXml \ "created_on").text).getTime val body_more = (postXml \ "body_more").text.trim match { case "NULL" => "" case something => " " + something } val postJson = ("_id" -> (postXml \ "permalink").text) ~ ("basename" -> (postXml \ "basename").text) ~ ("author" -> (postXml \ "author").text) ~ ("title" -> (postXml \ "title").text) ~ ("comment_count" -> (postXml \ "comment_count").text) ~ ("categories" -> (postXml \ "categories").text.split(',') .toList.filterNot(_ == "NULL")) ~ ("body" -> ((postXml \ "body").text + body_more)) ~ ("created_on" -> ("$date" -> created_on)) println(compact(JsonAST.render(postJson))) }
Certainly nothing earth shattering, but pleasantly readable and certainly quick. In the real world you'd want to use the scala.xml.pull package to avoid pulling all the XML into memory, but all eleven years fit in a gig of ram. The output lines loaded up directly with mongoimport.
All in all it took an evening, including spinning up a micro instance at Amazon EC2 to host the DB and provide tomcat for the next steps. The code, along with the sbt project can be found in a repository at BitBucket. Next: Rogue.
This work is licensed under a
Creative Commons Attribution-NonCommercial 3.0 Generic License.
©Ry4an Brase | Powered by: blohg 0.10.1+/77f7616f5e91