Storing Hierarchical Data in CouchDB
Much to my surprise, my last post generated more traffic in a single day than my blog has ever gotten in a single month. Apparently people are quite interested in making web applications with Python. I’ve started on part two, but since so many people showed interest I want to spend more time on it than I spent on the last one. So instead, you get this post.
So I’ve been fiddling around with CouchDB lately. Since it’s common to store tree-based data, and it’s kind of a pain to do so in your standard relational DB, I thought it would be a good exercise to see how hard it is to store hierarchical data in CouchDB.
Turns out it’s pretty easy.
For comparison, you might want to check out this article on storing trees in a relational database. It covers how to store tree-like data using both Adjacency Lists and Modified Preorder Tree Traversal (that’s a mouthful). I’ll cover how I put the data into CouchDB and some of the ways you might want to pull it out.
Storing the Tree
Rather than keeping track of parents as in the Adjacency List method or ‘left’ and ‘right’ as in the Modified Preorder Tree Traversal method, I store the full path to each node as an attribute in that node’s document. I then use this data in the views to organize the data as I need it.
The test data I used is as follows:
[ {"_id":"Food", "path":["Food"]}, {"_id":"Fruit", "path":["Food","Fruit"]}, {"_id":"Red", "path":["Food","Fruit","Red"]}, {"_id":"Cherry", "path":["Food","Fruit","Red","Cherry"]}, {"_id":"Tomato", "path":["Food","Fruit","Red","Tomato"]}, {"_id":"Yellow", "path":["Food","Fruit","Yellow"]}, {"_id":"Banana", "path":["Food","Fruit","Yellow","Banana"]}, {"_id":"Meat", "path":["Food","Meat"]}, {"_id":"Beef", "path":["Food","Meat","Beef"]}, {"_id":"Pork", "path":["Food","Meat","Pork"]} ]
In a real system you’d probably want to use some sort of UUID instead of descriptive strings, since conflicts between node names could be bad. In fact, it’d probably be much faster to just use numbers, since comparisons on numbers are generally much faster. For the purposes of this post, however, it’s much easier to understand if it’s descriptive text.
Once that data is in your DB, it’s time to get it out again!
Retrieving the whole tree
The CouchDB map function to retrieve the whole tree is nice and simple:
function(doc) { emit(doc.path, doc) }
Using the path as the key, the documents will be sorted as above, with each parent immediately followed by its children.
One option to get the data into an actual tree would be to add a reduce function to the view:
function(keys, vals) { tree = {}; for (var i in vals) { current = tree; for (var j in vals[i].path) { child = vals[i].path[j]; if (current[child] == undefined) current[child] = {}; current = current[child]; } current['_data'] = vals[i]; } return tree; }
Note: don’t use this reduce function, since it doesn’t take the rereduce parameter into account, and would most likely not work correctly if a rereduce was done.
I chose to write a similar function in Python and use that to generate my tree on the client side:
class TreeNode(dict): pass def tree_from_rows(list): tree = {} for item in list: current = tree for child in item.value['path']: current = current.setdefault(child, TreeNode()) current.data = item.value return tree
This code does the job nicely and allows me to use the same function to build a tree from several different views without duplicating code.
Getting a subtree
To get all the nodes which are underneath a specific node, I implemented the view’s reduce function as follows:
function(doc) { for (var i in doc.path) { emit([doc.path[i], doc.path], doc) } }
Again, this is pretty simple. The only difference from the last view is that I can now query this view with a startkey and endkey (see the CouchDB HttpViewApi) to get only nodes under a certain node. I could actually do that with the previous view, except I’d have to include the full path to the node in my startkey, which is a bit too much.
For example, if you had CouchDB running on your machine right now with my example data loaded and went to http://localhost:5984/tree/_view/tree/descendants?startkey=["Fruit"]&endkey=["Fruit",{}]
How Many Descendants
Getting the number of descendants for a given node is simple. The view is as follows:
'descendant_count': { 'map': 'function(doc) { for (var i in doc.path) { emit(doc.path[i], 1) } }', 'reduce': 'function(keys, values) { return sum(values) }' }
This will count the parent node as well, so you will probably want to subtract one from it at some point. To use this view simply call it with the key parameter set to the id of the desired root node.
Getting the immediate children of a node
Sometimes you just want to get a list of nodes which are immediately under a given node. This can be done by using a map with the following map function:
function(doc) { emit([doc.path.slice(-2,-1)[0], doc.path], doc) }
This map function simply takes the second-to-last element from the path and uses that as the first element in the key. You can query this view in the same way as the “getting a subtree” view above.
Adding a node
Adding a node to the tree is fairly simple. Set the new node’s path to be the path of the desired parent node with the new node’s ID appended to the end. That’s it.
Deleting a node
Deleting a node is a bit trickier since any given node may have some number of children. You can get the list of nodes in the subtree as outlined above and then do a bulk update to delete each of them.
Depending on the data being stored, deleting the whole sub-tree might not ever be something you want to do, in a discussion forum, for example, you might want to simply delete a single offensive post, leaving any replies which might have been posted. Even in this case, it’s more likely that you’d want to set a flag indicating the deletion rather than actually deleting the post.
Moving a node to another parent
This is an instance where being able to update just certain fields in a document would be handly, since bulk-updating a large chunk of documents could start to kill performance.
Either way, if something needs to be reparented, it’s just a matter of getting all nodes which are children of a certain node, then doing a bulk update to change their paths to wherever they need to be.
This part worries me a bit, because there’s a chance that somebody else could add a new child node while you are in the process of moving the sub-tree, leaving that new node dangling by itself in a sub-tree which no longer exists. I’m not sure of the best approach to avoid such a problem.
Conclusion
After my initial experimentation, it seems that CouchDB could potentially do a good job handling hierarchical data. It’s simpler to understand and implement than Modified Preorder Tree Traversal, but still has the advantage of being able to get a whole tree in a single query, unlike the Adjacency List model.
I wrote some python code to load in my test data and query the various views I created. It requires CouchDB-Python, which can be gotten via that link or from EasyInstall by running easy_install CouchDB.
My code can be found in the appropriate spot on my Gitweb.

July 5th, 2008 at 4:58 am
Heya Paul, this is a great article, thanks a lot! Could be persuade you to put a copy into our documentation wiki (http://wiki.apache.org/couchdb/FrontPage) somewhere?
Cheers kam
July 5th, 2008 at 4:58 am
Cheers
Jan
I can not even type my own name :/
July 7th, 2008 at 1:29 pm
I’m having a problem getting your git code to work. It seems like it creates the view ok (thank you for being the first example of that I’ve seen) but when I navigate to the view it throws an alertbox and says;
“Error: error
{{nocatch,{bad_value,”Cannot encode ‘undefined’ values as JSON”}}, [{couch_query_servers,promprt,2}, {couch_query_servers,'-map_docs/2-fun-0-',2), {lists,map,2}, {lists,map,2}, {couch_query_servers,map_docs,2}, {couch_view,view_compute,2}, {couch_view,update_group,1}, {couch_view,update_loop,5}]}
I hand typed that so forgive me if I made a mistake.
July 7th, 2008 at 4:28 pm
Jay: Is there perhaps some documents other than the ones which my script created in the database you’re using? If so, that’s probably the issue, since they perhaps wouldn’t have a ‘path’ value set.
Other than that, what version of CouchDB are you using? I suppose I should have specified that I’m using version 0.8 (now that I think about it, it’s a pre-release version. I should probably upgrade to the latest)
July 9th, 2008 at 2:44 pm
Hey pib I think my post yesterday got lost (I’m a king at crashing FF). I do indeed have other databases & documents and indeed they don’t have a ‘path’ value.
Luckily I had to post today cause I finally figured out how to create a view on the HTTP interface (i.e. filling the JSON in manually) and it means that I just noticed the “View Code” drop down!
So I changed the all view’s map function to;
function(doc) { if (doc.path) emit(doc.path, doc) }
i.e. I added the if clause and it’s running fine now! It’s a little strange to me that views work across databases so it’s probably good to have some sort of conditional in there (or run one couch instance per “task”).
Thanks again, now if you can explore the python library so more that’d be another big help for me! For my example I got;
for row in db.view(“all_stocks/all”):
and
for row in db.view(“all_stocks/all”, key=GOOG):
But I can’t figure out how to issue the other view query arguments…. yet! :>
July 9th, 2008 at 2:45 pm
Sorry to anyone else learning from examples as I am, that last line of code should have “GOOG” as being quoted.
July 9th, 2008 at 2:48 pm
Ok, I kinda “embelished” (w/o knowing it). It did work but then I browsed away and came back and it throws the error again… Close though!
October 3rd, 2008 at 5:53 am
[...] General Discussion:CouchDB – Use caseCouchdb Joins by Paul Joseph DavisCouchDB Joins by Christopher LenzStoring hierachical data in CouchDB [...]
November 22nd, 2008 at 3:09 am
[...] PIBlog » Blog Archive » Storing Hierarchical Data in CouchDB [...]
January 29th, 2009 at 4:06 am
Hey, thanks for the great example! The part about selecting descendants was kinda tricky and I didn’t really figure out what it was doing until I ran it myself on Futon (I was thinking the start/end keys would select the doc before passing it to the map function). I ended up noticing that what you get indeed is the sub-tree, including the sub-tree root. So if you really only want to get the descendants then you have to exclude the self, like this (forgive my poor, unoptimized js):
function(doc) { for (var i in doc.path) { if (i < doc.path.length -1) { emit([doc.path[i], doc.path], doc) } } }
May 19th, 2009 at 2:29 pm
[...] had seen this blog post about how to store hierarchical data in CouchDB and decided to play with the example data and views the guy provided (big thanks to Paul Bonser!). [...]
December 8th, 2009 at 7:35 pm
It took me a little bit to apply the information in this post – however having played a bit it all makes sense and has saved me a bit of time.
The issue with using the doc uuid’s is they are too long for the reduce functions offered above. I’ve managed to do everything I need from the map functions so that is good.
Thanks a lot for the information
Nick
December 27th, 2009 at 3:06 am
Hey,
Nice article. I really would like to apply this, but the trouble is how would we get this data into couch db in the first place. Lets say we have a lot of data in a self-referential table of a legacy system. I was wondering what would be the most efficient way to pull this information into couchDb.
Thanks, HariKrishnan
December 27th, 2009 at 4:32 am
I’d say write up a quick script that loads as many ducuments as possinble into memory (within reason), and then does bulk inserts to put them in CouchDB. Since it will be your initial import, I’d suggesting using batch mode, too, since you don’t need data safety until you’ve actually got something up and running.
Check out the sections here on bulk inserts and batch mode: http://books.couchdb.org/relax/reference/high-performance
March 4th, 2010 at 8:05 am
Thanks for the good example. I tried to figure it out on my own before and the missing bit was the endkey syntax of ["foo", {}]. What exactly does it mean? Where can I get more information about how startkey and endkey are compared to the actual keys?
Thanks, Michael