Friday, April 27, 2012

Picking a NoSQL Database

Relational databases and me never got along very well. I think this is caused by the impedance mismatch between objects and the relational model. The fact that the child (e.g. the order line) refers to the parent (the order)  has always seem bizarre, the natural order of traversal is parent to child, the direction we know from the object and network models. This reversal causes a huge accidental complexity in object oriented languages, add to this the required type conversion and the need to spread what is basically one entity with some fully owned children over many different tables, each with their associated maintenance nightmares. And spending 14.5 Mb in code memory and untold megabytes of dynamic memory for each of your customers just to run an ORM like JPA has always stroke me as a bit, well, not right. The embedded world goes out of their ways to minimize runtime costs.

Now I am old enough to remember CODASYL and their networked database model, but I largely witnessed their demise against the relational model. I've also witnessed the attempts of the object oriented databases like Gemstone and others. It is clear that these models failed while it is hard to deny that relational databases are widely used and very successful.  I do realize that the discipline required and type checks part of the relational model have an advantage. I also see the advantage of the maturity. That said, I really think ORMs suck to program with.

So in my new project I decided to start with a NoSQL database. This is a bit of an unfortunate name because one of the things I really like about the relational model is the query language , which happens to be called SQL. Many NoSQL do not have a query language and that is a bit too sparse for me.

So what I am looking for is a store for documents with a query language. These documents will have lots of little data items that will likely vary in type over time. The (obvious) model is basically a JSON store. It should be easy to create a collection of JSON documents, store them, retrieve them efficiently on different aspects (preferably with a query language) and apply changes, preferably partial updates. On top of that I expect horizontal scalability and full fault tolerance.

Though I strongly prefer to have such a store in Java since it integrates easier with OSGi I found a database that looks on paper to fit the bill: mongodb. Except for the awful name (sometimes you wonder if there is actually a need for marketing people), it offers exactly what I need. The alternatives do not look bad but I really like the 100% focus on JavaScript.

Clearly JavaScript is now the only language available on the browser and it is geared to play a much larger role in the server.  If you have not looked at JavaScript for the last two years, look again. It is incredibly impressive what people are doing nowadays in the browser and in also in the server. It seems obvious that we're moving back to fat clients and the server will only provide access to the underlying data, I really fail to see any presentation code on the server in the future. Since JSON is native to JavaScript is quickly becoming the lingua franca of software.

Unfortunately, the Mongodb Java API does not play very nicely. Since Mongodb uses JavaScript as the interface it is necessary to interact with special maps (DBObject). This creates really awkward code in Java. Something that looks like the following in Javascript:

> db.coll.insert( type: "circle", center : { x:30, y:40 }, r:10 )
> db.coll.find( { "center.x":{ $gt: 2 } } )
Looks like the following in Java:

  BasicDBObject doc = new BasicDBObject();
  doc.put("type", "circle");
  doc.put("center", new BasicDBObject("x", 30).append("y", 40));
  doc.put("r", 10);

  BasicDBObject filter = new BasicDBObject();
  filter.put("center", new BasicDBObject("$gt", 2));

  for ( DBObject o : db.getCollection("coll").find(filter) ) {

Obviously this kind of code is not what you want to write for a living. The JavaScript is more than twice as concise and therefore better readable. And in this case we do not get bonus points for type safety since the Java code reverts to strings for the attributes. Not good!

So to get to know Mongodb better I've been playing with a (for me) better Java API, based on my Data objects. Using the Java Data objects as the schema enforces consistency throughout an application and helps the developers use the right fields. So the previous example looks like:

public class Shape {
  public String type;
  public Point  center;
  public int r;

  // ... toString, etc

Store store = new Store(Shape.class,db.getCollection("coll"));

Shape s = new Shape();
s.type = "circle";
s.point = new Point();
s.point.x = 30;
s.point.y = 40;
s.r = 10;

for ( Shape i : store.find("point.x>2")

Though maybe not as small as the JavaScript example it at least provide proper data types and completion in the IDE. It also provides much more safety since the Store class can do a lot of verification against the type information from the given class.

So after spending two days on Mongodb I am obviously just getting started but I like  the programming model (so far). They key issue is of course how does it work in practice? I already got some mails from friends pointing to disgruntled mongodb users. Among the reviewers there are many Pollyanna's not hindered by their lack of knowledge but there are also some pretty decent references that backup their experiences with solid experience. Alas, the proof of the pudding is in the eating.

Peter Kriens


  1. This comment has been removed by the author.

  2. Hello Peter,

    I think MongoDB is a great solution, but there are significant object to store mapping issues that surface at deployment time when optimizing large object graphs for insert and query (index) performance.

    Designing how big objects are stored in either deep documents or to normalize them across collections is the challenge. Since Mongo has no join concept, objects that are normalized across mongo collection boundaries, must be recombined using metadata references to child or parent references in other collections and it must be done with multiple distinct queries.

    Some of the problems are solved when a model driven approach (EMF) is used on top of MongoDB.

    See the MongoEMF project

    There is an initial simple query language and a new query project being considered as well:

    I am in the process of working up an ODA driver so one can use Eclipse BIRT with Mongo EMF.

    The week link I have encountered is with the MongoEMF query languages. While these prove to be a better solution than the Java drivers especially since they can be encoded as url queries, they still are object/model unaware, which leaves the business reports designer somewhat blind when they drill into the object graphs at report design time.

    Current plan is to supplement the MongoEMF query language with a model aware OCL interpreter to aid the report designer when drilling into objects.

    Perhaps your query language may solve these problems?

    Look forward to see what you come up with.


  3. @John: I do recognize the issues you raise but in a way they are similar to SQL or any database. The choice when something is a document (table) and when it is embedded (column) is the same as the old contained problem. Design, is, and will always remain, hard because it is so hard to predict what will grow and what will remain simple.

    I hope by using the Data (entity) classes I at least have a good anchor point to document the schema. Though I understand the model driven approach and always try to write table driven software EMF has always been a too big step for me. But who knows, this project might teach me.

    Interestingly, the query string in the find() method is the good old OSGi LDAP filter, this was trivially easy to convert to monogodb's query language. Since the Store class knows the Java type of the Data class I can adjust the query string type accordingly and check for existing fields.