CouchDB "Joins"

25 October 2007
00:34

CouchDB - relax

I've been playing more and more with CouchDB lately. After putting together a Python library, I worked on a brand new included HTML/AJAX interface. Now I'm starting to dive into the Erlang code, which is my first serious encounter with Erlang. In particular, I started a branch that aims to replace the HTTP server underpinnings with Bob Ippolito´s MochiWeb library.

Despite all that activity (and past experience with the conceptually similar Lotus Notes), the correct approach to designing applications “the CouchDB way” isn't always obvious to me at this point. Just today, there was a discussion on IRC how you'd go about modeling a simple blogging system with “post” and “comment” entities, where any blog post might have N comments. If you'd be using an SQL database, you'd obviously have two tables with foreign keys and you'd be using joins. (At least until you needed to add some denormalization.)

But what would the “obvious” approach in CouchDB look like?

Note: I've updated this post to clarify the role of view collation, and to stress that all three approaches are equally valid for different kinds of applications.

Approach #1: Comments Inlined

A simple approach would be to have one document per blog post, and store the comments inside that document:

{
  "_id": "myslug",
  "_rev": "123456",
  "author": "john",
  "title": "My blog post",
  "content": "Bla bla bla …",
  "comments": [
    {"author": "jack", "content": "…"},
    {"author": "jane", "content": "…"}
  ]
}

(Note: Of course the model of an actual blogging system would be more extensive, you'd have tags, timestamps, etc etc. This is just to demonstrate the basics.)

The obvious advantage of this approach is that the data that belongs together is stored in one place. Delete the post, and you automatically delete the corresponding comments, and so on.

You may be thinking that putting the comments inside the blog post document would not allow us to query for the comments themselves, but you'd be wrong. You could trivially write a CouchDB view that would return all comments across all blog posts, keyed by author:

function(doc) {
  for (var i in doc.comments) {
    map(doc.comments[i].author, doc.comments[i].content);
  }
}

Now you could list all comments by a particular user by invoking the view and passing it a ?key="username" query string parameter.

However, this approach has a drawback that can be quite significant for many applications: To add a comment to a post, you need to:

  1. fetch the blog post document
  2. add the new comment to the JSON structure
  3. send the updated document to the server

Now if you have multiple client processes adding comments at roughly the same time, some of them will get a 409 Conflict error on step 3 (that's optimistic concurrency in action). For some applications this makes sense, but in many other apps, you'd want to append new related data regardless of whether other data has been added in the meantime.

The only way to allow non-conflicting addition of related data is by putting that related data into separate documents.

Approach #2: Comments Separate

Using this approach you'd have one document per blog post, and one document per comment. The comment documents would have a “backlink” to the post they belong to.

The blog post document would look similar to the above, minus the comments property. Also, we'd now have a type property on all our documents so that we can tell the difference between posts and comments:

{
  "_id": "myslug",
  "_rev": "123456",
  "type": "post",
  "author": "john",
  "title": "My blog post",
  "content": "Bla bla bla …"
}

The comments themselves are stored in separate documents, which also have a type property (this time with the value “comment”), and in addition feature a post property containing the ID of the post document they belong to:

{
  "_id": "ABCDEF",
  "_rev": "123456",
  "type": "comment",
  "post": "myslug",
  "author": "jack",
  "content": "…"}
}, {
  "_id": "DEFABC",
  "_rev": "123456",
  "type": "comment",
  "post": "myslug",
  "author": "jane",
  "content": "…"
}

To list all comments per blog post, you'd add a simple view, keyed by blog post ID:

function(doc) {
  if (doc.type == "comment") {
    map(doc.post, {author: doc.author, content: doc.content});
  }
}

And you'd invoke that view passing it a ?key="post_id" query string parameter.

Viewing all comments by author is just as easy as before:

function(doc) {
  if (doc.type == "comment") {
    map(doc.author, {post: doc.post, content: doc.content});
  }
}

So this is better in some ways, but it also has a disadvantage: Imagine you want to display a blog post with all the associated comments on the same web page. With our first approach, we needed just a single request to the CouchDB server, namely a GET request to the document. With this second approach, we need two requests: a GET request to the post document, and a GET request to the view that returns all comments for the post.

That is okay, but not quite satisfactory. Just imagine you wanted to added threaded comments: you'd now need an additional fetch per comment. What we'd probably want then would be a way to join the blog post and the various comments together to be able to retrieve them with a single HTTP request.

This was when Damien Katz, the author of CouchDB, chimed in to the discussion on IRC to show us the way.

Optimization: Using the Power of View Collation

Obvious to Damien, but not at all obvious to the rest of us: it's fairly simple to make a view that includes both the content of the blog post document, and the content of all the comments associated with that post. The way you do that is by using complex keys. Until now we've been using simple string values for the view keys, but in fact they can be arbitrary JSON values, so let's make some use of that:

function(doc) {
  if (doc.type == "post") {
    map([doc._id, 0], doc);
  } else if (doc.type == "comment") {
    map([doc.post, 1], doc);
  }
}

Okay, this may be confusing at first. Let's take a step back and look at what views in CouchDB are really about:

CouchDB views are basically highly efficient on-disk dictionaries that map keys to values, where the key is automatically indexed and can be used to filter and/or sort the results you get back from your views. When you “invoke” a view, you can say that you're only interested in a subset of the view rows by specifying a ?key=foo query string parameter. Or you can specify ?startkey=foo and/or ?endkey=bar query string parameters to fetch rows over a range of keys.

It's also important to note that keys are always used for collating (i.e. sorting) the rows. CouchDB has well defined (but as of yet undocumented) rules for comparing arbitrary JSON objects for collation. For example, the JSON value ["foo", 2] is sorted after (considered “greater than”) the values ["foo"] or ["foo", 1, "bar"], but before e.g. ["foo", 2, "bar"]. This feature enables a whole class of tricks that are rather non-obvious…

With that in mind, let's return to the view function above. First note that, unlike the previous view functions we've used here, this view handles both "post" and "comment" documents, and both of them end up as rows in the same view. Also, the key in this view is not just a simple string, but an array. The first element in that array is always the ID of the post, regardless of whether we're processing an actual post document, or a comment associated with a post. The second element is 0 for post documents, and 1 for comment documents.

Let's assume we have two blog posts in our database. Without limiting the view results via key, startkey, or endkey, we'd get back something like the following:

{
  "total_rows": 5, "offset": 0, "rows": [{
      "id": "myslug",
      "key": ["myslug", 0],
      "value": {...}
    }, {
      "id": "ABCDEF",
      "key": ["myslug", 1],
      "value": {...}
    }, {
      "id": "DEFABC",
      "key": ["myslug", 1],
      "value": {...}
    }, {
      "id": "other_slug",
      "key": ["other_slug", 0],
      "value": {...}
    }, {
      "id": "CDEFAB",
      "key": ["other_slug", 1],
      "value": {...}
    },
  ]
}

(The “...” placeholder here would contain the complete JSON encoding of the corresponding document)

Now, to get a specific blog post and all associated comments, we'd invoke that view with the query string:

?startkey=["myslug"]&endkey;=["myslug", 2]

We'd get back the first three rows, those that belong to the “ myslug” post, but not the others. Et voila, we now have the data we need to display a post with all associated comments, retrieved via a single GET request.

You may be asking what the 0 and 1 parts of the keys are for. They're simply to ensure that the post document is always sorted before the the associated comment documents. So when you get back the results from this view for a specific post, you'll know that the first row contains the data for the blog post itself, and the remaining rows contain the comment data.

One remaining problem with this model is that comments are not ordered, but that's simply because we don't have date/time information associated with them. If we had, we'd add the timestamp as third element of the key array, probably as ISO date/time strings. Now we would continue using the query string ?startkey=["myslug"]&endkey;=["myslug", 2] to fetch the blog post and all associated comments, only now they'd be in chronological order.

Epilogue

Please note that each of the approaches described above is completely valid. Which one you choose depends on the nature of the application at hand, and maybe even on your infrastructure or personal preferences.

I personally think that the last approach I described is ideal because it achieves conflict-free adding of comments, while still allowing to fetch all the data related to a post in a single HTTP request. If either of those goals is not important for your application, you can just as well use one of the other approaches.