Last week, I had the pleasure of attending an excellent CouchDB presentation by the excellent Sam Bisbee.
CouchDB just hit the release 1.0 milestone, and I believe it’s a really compelling option for persistence – especially if your data relationships are relatively light (i.e. not a ton of many-to-many relationships, not a ton queries involving the joining of many, many tables).
Going into it, I knew the following about CouchDB:
- It’s a document database (as opposed to a relational database, like SQL)
- The documents it stores are JSON strings
- Individual record manipulation is handled via HTTP verbs
- Create = POST, Read = GET, Update = PUT, Delete = DELETE
- Multi-record queries are handled via Views, which use the MapReduce approach popularized by Google (a very distributable way to compute results).
- JavaScript is the most popular and likely way to write MapReduce functions
Here’s what I learned in Sam’s presentation:
- View definition is handled via “Design Documents” – which are just like any other CouchDB document, except that they sit at a reserved URL. Futon (the build in CouchDB web interface) has a nice way of presenting Design Documents that lets you edit / copy / paste in your JavaScript code
- Views work in the following way
- The Map function takes the full set of documents, and create key/value output
- The Reduce function (optional) aggregates multiple key/value output rows
- The resulting row are filtered by key via CouchDB Querystring Options (value filters, range filters, etc.)
- (for a more detailed explanation, check out this excellent presentation by Kore Nordmann)
- Put another way: MapReduce doesn’t handle “query arguments” – if you’re looking for blog posts by “gbarnett”, neither Map nor Reduce receives “gbarnett” as an argument. Those two functions transform and aggregate the source data to present the author field as the “key” to the resulting rows; CouchDB querystring options will filter those rows based a runtime parameter of key=gbarnett.
- Each Design Document is backed with a B+Tree index – a high performance lookup index that is transparently updated the first time a view is read after the underlying documents change
- Transactions are handled via revision id check – if two clients post updates to the same document, the first attempt will be accepted (since its revision id matches the doc in the DB) and the internal revision id will be updated. The second attempt will be rejected, because its revision id is out of date – the client must handle this scenario.
- Because of this, it makes sense to model your data in a way that minimizes the number of potential transaction conflicts. For example – modeling comments as an array within your blogpost document is a bad idea, since many comments are likely to be posted at the same time – yielding many conflicts. Better to have two types of documents (blogpost and comment), and allow each comment to live independently (referencing the blogpost to which it refers by string doc id) without creating transaction conflict problems.
I’m already using CouchDB in production (it provides the analytics store for my new iOS musical instrument – Hexaphone), and I’m eager to apply some of this new knowledge to put together some interesting analytics queries.
I also believe that CouchDB provides a great alternative to SQL databases, and soon will be the de facto basic application persistence standard (with SQL still seeing use in military-grade apps that require up-to-the-nanosecond transactional behavior or use heavy relational queries – financial transaction processing, for example).
Learn more: