Tuesday, May 18, 2010

Comparing PNUTS, HBase and Cassandra


A lot of NoSQL systems have been sprouting up recently and an increasing number of people are using NoSQL data stores and moving away from RDBMS systems. There's nothing wrong with relational database systems but they are optimized for certain use cases, which they handle very well. NoSQL systems (Bigtable, Dynamo, PNUTS, CouchDB, MongoDB, Keyspace, to name a few) solve different sets of problems, for which they are best suited for.

Recently, in a course that I'm taking at UC Santa Cruz, I got a chance to present the PNUTS paper and compare the system with Bigtable and Dynamo.

At a high level, here's how these systems compare:
(This post talks about HBase instead of Bigtable and Cassandra instead of Dynamo)




HBase

Cassandra

PNUTS

Consistency Model

Fully consistent.

Eventual consistency. Divergent version trees of the same row can exist. Client can trade off between latency and consistency.

Timeline consistency. All versions of a row honor a timeline and there are no divergent version trees.

ACID Semantics

put() call is atomic at a row level. There is no concept of transactions and no notion of consistency between rows.

Scans don't give a consistent view of the table. However, any row returned by a scan will be a consistent view of that given row. Any row updated with a timestamp older than the scanner initialization timestamp may show up in the results.

(Details available in the HBASE- 2294 jira)

None specified

Write call gives the same ACID semantics as a transaction involving a single row

Data Model

Tabular, column oriented.

Table consists of column families and each family has multiple columns. The schemas are flexible and there can be arbitrary columns in any given family. However, the families are specified on table creation. Different versions can be stored for each cell.

Storage model is columnar and is strictly ordered on the rows.

Similar to HBase. Dynamo on the other hand is a key value store. Cassandra has the Bigtable data model over the Dynamo
P2P architecture. Cassandra also has super columns, which are like columns families within a column family.

Tabular, row oriented.

The schemas are flexible and a row can have arbitrary columns, with some being empty as well. Each node stores only a single version of any given row but different versions can exist across the cluster.

Storage model is row oriented.

Underlying Storage

HDFS or any other distributed file system

Node's local storage

Node's local storage (Choice of Hash table or Ordered table)

Replication

Asynchronous.
Data is replicated by the file system when it is persisted

Choice between Synchronous and Asynchronous for each update.

Asynchronous.
Data is written to the master copy, which propagates it to the message broker, which takes care of replication

Fault Tolerance

Regions are restarted (on the same node or any other) if they crash.  If a region server dies, its regions
are distributed to the other servers that are functional. No re-allocation of data takes place.

Updates are first logged into a Write Ahead Log before they go to the  memstore. One WAL is maintained per region server.

Data from the failing node is re-assigned to the next node in the consistent hashing
circle.

Updates are first written to a Write Ahead Log before committed to the table.

If the master for a given record fails, either a new master is elected or the write fails. It is never the case that a write will go to a node that is not the
master for a record.

The message broker does logging when it receives the update from the master. There is no logging done at the individual nodes.

Scalability

1000s of nodes.
Each table
can have millions of columns and billions of rows

10-100s of nodes

10s of sites with 1000s of nodes. Since it is row oriented storage, the number of columns would not be very high (no numbers reported in the papers)

Optimized for

Writes, Scans. The writes are kept in memory (in the memstore) and flushed to disk in chunks, which gives good write performance.

Writes.

Poor scan performance as compared to the other two systems. [5]

Reads. The client has the option to read from a copy which is geographically close and
can define the level of consistency desired.

Where does it fit?

If you want a scalable system that is deployed in a single data center and you don't care about network partitions, HBase is your friend. If you cannot tolerate loose consistency in data, this is the best option.

Update:
If you already have Hadoop and want to be able to read/write small objects quickly and also run some analytics over the data, HBase is the way to go.

If you can deal with eventual consistency and want a highly available system that can span across data centers, Cassandra is your friend. As of now, Cassandra is easier to get off the ground than HBase and has lesser components that you need to get running to begin with. Contrary to the popular belief, Cassandra has more moving parts than HBase but they are managed by the framework and user does not need to worry much.

Update:
If you dont have Hadoop and dont need it but you just need a database that scales beyond a single node, Cassandra is your friend. Keep in mind that there is no SQL here. For HBase, you need to also deploy Hadoop (HDFS atleast), which is not required by Cassandra.

If you want the system to be geographically distributed in order to serve large number of reads with low latency from across the globe, PNUTS is your friend.  You also get fine-grained control over consistency and can trade it with low latency for reads.
Access options

Native Java API, Jython, Groovy DSL, Scala, REST, Thrift

Native Java API, Ruby, Perl, Python, Scala, PHP, Clojure, Grails, C++, C#



If you are out there looking for a scalable database solution, you've got quite a few choices. These are just some of the more popular ones. PNUTS is not open source but is a nice system from an architectural stand point and thats why I talk about it in this post.

References:
[1] HBase: http://hadoop.apache.org/hbase/
[2] Cassandra: http://cassandra.apache.org/
[3] Bigtable: http://labs.google.com/papers/bigtable.html
[4] Dynamo: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
[5] Yahoo! YCSB Benchmark: http://research.yahoo.com/node/3202
[6] Lars George's description of the HBase Architecture: http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html

Thursday, May 06, 2010

My thoughts on the CAP theorem

There has been buzz around the CAP theorem and its validity in the recent past and so it was brought up with Data Management in the Cloud class that I'm taking at UC Santa Cruz this quarter. I was asked to talk about it in one of the lectures and here's the essence of the discussion.

CAP is a good theoretical way of looking at systems and it definitely makes sense, but its limiting. After reading the proof of the CAP theorem by Gilbert and Lynch, I was left with questions on how I could map it to a practical system. Its a good theoretical paper but it did not tell me much on how to build systems and trade off between C, A and P. From a practical standpoint, the choice really is between CA and AP. A CP system does not make sense. Its like saying "I will give you a consistent answer whenever I give one, but I dont guarantee an answer". There is little practical use for such a system. So, essentially, the choice gets down to the following: Either the system can try to give you consistency all the time and when there is a failure, it gives up availability, which means it does not return requests; or it can give you availability even in case of failures, in which case you cant ask for consistency.

To overcome the shortcomings of the theorem, Daniel Abadi proposed PACELC some time back. I think what he says makes sense. I agree with Abadi's idea of including latency as a parameter, since its not required to constrain the system in all scenarios. Saying that the system is going to give up consistency in every situation does not necessarily make sense. If there is no network partitioning, I have a choice between giving a consistent view of the data (by either reading all replicas and reconciling, or writing all replicas in the write phase and read only one at read time) and answering a request with low latency. This means that the system can be designed to have a low latency write operation (write only one replica - this is what Bigtable does in some sense) or low latency read operation (read from only one replica) or give an always consistent view when the data is read (the Bigtable approach). Note that there is no partitioning of the network involved here and I can choose between the above 3 options. However, when a partition happens, the choice is between the system being available or consistent. This gives more flexibility in terms of the design choices that one needs to make while building a scalable system.

Talking about PACELC and CAP, I quote Ryan Rawson "Its computer science v/s software engineering". I'm still thinking about PACELC and CAP and still not fully convinced if any of these are comprehensive enough to cover all scenarios.