CouchDB, XML, and E4X

4 March 2008

Not that long ago, CouchDB moved from XML document representations and a custom query language (dubbed “Fabric”) to JSON for documents and Javascript for views. Apparently, that move attracted a lot of new people to the project, myself included.

Not long after the switch, some think about defining JSON encodings of common XML formats. Others ask about using XML in CouchDB. Simply add back the XML backend and let people choose what they prefer? Hell, no.

Turns out there’s a much better way to support XML data in CouchDB: ECMA-357, also known as “ECMAScript for XML”, also known as E4X. And Mozilla’s SpiderMonkey Javascript engine, which CouchDB uses as the default view server, conveniently implements E4X. So it’s just a matter of enabling that support. Which means that, all of the sudden, and without any changes to the core, CouchDB is pretty well positioned for storing and querying XML data in addition to JSON.

For example:

// by_lang
function(doc) {
  var html = new XML(doc.content);
  map(html.@lang, {title: html.head.title.text(), …});

To be fair, this is already possible if you use other view servers (such as the Ruby or Python ones), where you have access to the XML support provided by the respective standard libraries. Given CouchDB’s incremental view update model, you usually don’t care so much about the performance of view functions as you care about the data they produce. So if your view function can somehow parse the XML and put some data into the view index, that's usually all you need. Actually querying the view is going to be really fast.

But E4X is an exceptionally convenient API for XML. I think using E4X is going to be a pretty good approach for those who want to use CouchDB to store and query XML content.