Storing Hierarchical Data in CouchDB

Posted Fri, Jul 4, 2008 in:

Much to my surprise, my last post generated more traffic in a single day than my blog has ever gotten in a single month. Apparently people are quite interested in making web applications with Python. I’ve started on part two, but since so many people showed interest I want to spend more time on it than I spent on the last one. So instead, you get this post.

So I’ve been fiddling around with CouchDB lately. Since it’s common to store tree-based data, and it’s kind of a pain to do so in your standard relational DB, I thought it would be a good exercise to see how hard it is to store hierarchical data in CouchDB.

Turns out it’s pretty easy.

For comparison, you might want to check out this article on storing trees in a relational database. It covers how to store tree-like data using both Adjacency Lists and Modified Preorder Tree Traversal (that’s a mouthful). I’ll cover how I put the data into CouchDB and some of the ways you might want to pull it out.

Storing the Tree

Rather than keeping track of parents as in the Adjacency List method or ‘left’ and ‘right’ as in the Modified Preorder Tree Traversal method, I store the full path to each node as an attribute in that node’s document. I then use this data in the views to organize the data as I need it.

The test data I used is as follows:

    [
        {"_id":"Food",   "path":["Food"]},
        {"_id":"Fruit",  "path":["Food","Fruit"]},
        {"_id":"Red",    "path":["Food","Fruit","Red"]},
        {"_id":"Cherry", "path":["Food","Fruit","Red","Cherry"]},
        {"_id":"Tomato", "path":["Food","Fruit","Red","Tomato"]},
        {"_id":"Yellow", "path":["Food","Fruit","Yellow"]},
        {"_id":"Banana", "path":["Food","Fruit","Yellow","Banana"]},
        {"_id":"Meat",   "path":["Food","Meat"]},
        {"_id":"Beef",   "path":["Food","Meat","Beef"]},
        {"_id":"Pork",   "path":["Food","Meat","Pork"]}
    ]

In a real system you’d probably want to use some sort of UUID instead of descriptive strings, since conflicts between node names could be bad. In fact, it’d probably be much faster to just use numbers, since comparisons on numbers are generally much faster. For the purposes of this post, however, it’s much easier to understand if it’s descriptive text.

Once that data is in your DB, it’s time to get it out again!

Retrieving the whole tree

The CouchDB map function to retrieve the whole tree is nice and simple:

    function(doc) {
        emit(doc.path, doc)
    }

Using the path as the key, the documents will be sorted as above, with each parent immediately followed by its children.

One option to get the data into an actual tree would be to add a reduce function to the view:

    function(keys, vals) {
        tree = {};
        for (var i in vals)
        {
            current = tree;
            for (var j in vals[i].path)
            {
                child = vals[i].path[j];
                if (current[child] == undefined) 
                    current[child] = {};
                current = current[child];
            } 
            current['_data'] = vals[i];
        }
        return tree;
    }

Note: don’t use this reduce function, since it doesn’t take the rereduce parameter into account, and would most likely not work correctly if a rereduce was done.

I chose to write a similar function in Python and use that to generate my tree on the client side:

    class TreeNode(dict): pass
    
    def tree_from_rows(list):
        tree = {}
        for item in list:
            current = tree
            for child in item.value['path']:
                current = current.setdefault(child, TreeNode())
            current.data = item.value
        return tree

This code does the job nicely and allows me to use the same function to build a tree from several different views without duplicating code.

Getting a subtree

To get all the nodes which are underneath a specific node, I implemented the view’s reduce function as follows:

    function(doc) { 
        for (var i in doc.path) { 
            emit([doc.path[i], doc.path], doc) 
        } 
    }

Again, this is pretty simple. The only difference from the last view is that I can now query this view with a startkey and endkey (see the CouchDB HttpViewApi) to get only nodes under a certain node. I could actually do that with the previous view, except I’d have to include the full path to the node in my startkey, which is a bit too much.

For example, if you had CouchDB running on your machine right now with my example data loaded and went to http://localhost:5984/tree/_view/tree/descendants?startkey=[“Fruit”]&endkey=[“Fruit”,{}]

How Many Descendants

Getting the number of descendants for a given node is simple. The view is as follows:

    'descendant_count': {
        'map':    'function(doc) { for (var i in doc.path) { emit(doc.path[i], 1) } }',
        'reduce': 'function(keys, values) { return sum(values) }'
    }

This will count the parent node as well, so you will probably want to subtract one from it at some point. To use this view simply call it with the key parameter set to the id of the desired root node.

Getting the immediate children of a node

Sometimes you just want to get a list of nodes which are immediately under a given node. This can be done by using a map with the following map function:

    function(doc) { 
        emit([doc.path.slice(-2,-1)[0], doc.path], doc) 
    }

This map function simply takes the second-to-last element from the path and uses that as the first element in the key. You can query this view in the same way as the “getting a subtree” view above.

Adding a node

Adding a node to the tree is fairly simple. Set the new node’s path to be the path of the desired parent node with the new node’s ID appended to the end. That’s it.

Deleting a node

Deleting a node is a bit trickier since any given node may have some number of children. You can get the list of nodes in the subtree as outlined above and then do a bulk update to delete each of them.

Depending on the data being stored, deleting the whole sub-tree might not ever be something you want to do, in a discussion forum, for example, you might want to simply delete a single offensive post, leaving any replies which might have been posted. Even in this case, it’s more likely that you’d want to set a flag indicating the deletion rather than actually deleting the post.

Moving a node to another parent

This is an instance where being able to update just certain fields in a document would be handly, since bulk-updating a large chunk of documents could start to kill performance.

Either way, if something needs to be reparented, it’s just a matter of getting all nodes which are children of a certain node, then doing a bulk update to change their paths to wherever they need to be.

This part worries me a bit, because there’s a chance that somebody else could add a new child node while you are in the process of moving the sub-tree, leaving that new node dangling by itself in a sub-tree which no longer exists. I’m not sure of the best approach to avoid such a problem.

Conclusion

After my initial experimentation, it seems that CouchDB could potentially do a good job handling hierarchical data. It’s simpler to understand and implement than Modified Preorder Tree Traversal, but still has the advantage of being able to get a whole tree in a single query, unlike the Adjacency List model.

I wrote some python code to load in my test data and query the various views I created. It requires CouchDB-Python, which can be gotten via that link or from EasyInstall by running easy_install CouchDB.

My code can be found in the appropriate spot on my Gitweb.

Have some questions? Have an idea for a better way of storing hierarchical data in CouchDB? Any other comments? Then leave a comment below!

Probably Programming