Tuesday, May 18, 2010

Comparing PNUTS, HBase and Cassandra


A lot of NoSQL systems have been sprouting up recently and an increasing number of people are using NoSQL data stores and moving away from RDBMS systems. There's nothing wrong with relational database systems but they are optimized for certain use cases, which they handle very well. NoSQL systems (Bigtable, Dynamo, PNUTS, CouchDB, MongoDB, Keyspace, to name a few) solve different sets of problems, for which they are best suited for.

Recently, in a course that I'm taking at UC Santa Cruz, I got a chance to present the PNUTS paper and compare the system with Bigtable and Dynamo.

At a high level, here's how these systems compare:
(This post talks about HBase instead of Bigtable and Cassandra instead of Dynamo)




HBase

Cassandra

PNUTS

Consistency Model

Fully consistent.

Eventual consistency. Divergent version trees of the same row can exist. Client can trade off between latency and consistency.

Timeline consistency. All versions of a row honor a timeline and there are no divergent version trees.

ACID Semantics

put() call is atomic at a row level. There is no concept of transactions and no notion of consistency between rows.

Scans don't give a consistent view of the table. However, any row returned by a scan will be a consistent view of that given row. Any row updated with a timestamp older than the scanner initialization timestamp may show up in the results.

(Details available in the HBASE- 2294 jira)

None specified

Write call gives the same ACID semantics as a transaction involving a single row

Data Model

Tabular, column oriented.

Table consists of column families and each family has multiple columns. The schemas are flexible and there can be arbitrary columns in any given family. However, the families are specified on table creation. Different versions can be stored for each cell.

Storage model is columnar and is strictly ordered on the rows.

Similar to HBase. Dynamo on the other hand is a key value store. Cassandra has the Bigtable data model over the Dynamo
P2P architecture. Cassandra also has super columns, which are like columns families within a column family.

Tabular, row oriented.

The schemas are flexible and a row can have arbitrary columns, with some being empty as well. Each node stores only a single version of any given row but different versions can exist across the cluster.

Storage model is row oriented.

Underlying Storage

HDFS or any other distributed file system

Node's local storage

Node's local storage (Choice of Hash table or Ordered table)

Replication

Asynchronous.
Data is replicated by the file system when it is persisted

Choice between Synchronous and Asynchronous for each update.

Asynchronous.
Data is written to the master copy, which propagates it to the message broker, which takes care of replication

Fault Tolerance

Regions are restarted (on the same node or any other) if they crash.  If a region server dies, its regions
are distributed to the other servers that are functional. No re-allocation of data takes place.

Updates are first logged into a Write Ahead Log before they go to the  memstore. One WAL is maintained per region server.

Data from the failing node is re-assigned to the next node in the consistent hashing
circle.

Updates are first written to a Write Ahead Log before committed to the table.

If the master for a given record fails, either a new master is elected or the write fails. It is never the case that a write will go to a node that is not the
master for a record.

The message broker does logging when it receives the update from the master. There is no logging done at the individual nodes.

Scalability

1000s of nodes.
Each table
can have millions of columns and billions of rows

10-100s of nodes

10s of sites with 1000s of nodes. Since it is row oriented storage, the number of columns would not be very high (no numbers reported in the papers)

Optimized for

Writes, Scans. The writes are kept in memory (in the memstore) and flushed to disk in chunks, which gives good write performance.

Writes.

Poor scan performance as compared to the other two systems. [5]

Reads. The client has the option to read from a copy which is geographically close and
can define the level of consistency desired.

Where does it fit?

If you want a scalable system that is deployed in a single data center and you don't care about network partitions, HBase is your friend. If you cannot tolerate loose consistency in data, this is the best option.

Update:
If you already have Hadoop and want to be able to read/write small objects quickly and also run some analytics over the data, HBase is the way to go.

If you can deal with eventual consistency and want a highly available system that can span across data centers, Cassandra is your friend. As of now, Cassandra is easier to get off the ground than HBase and has lesser components that you need to get running to begin with. Contrary to the popular belief, Cassandra has more moving parts than HBase but they are managed by the framework and user does not need to worry much.

Update:
If you dont have Hadoop and dont need it but you just need a database that scales beyond a single node, Cassandra is your friend. Keep in mind that there is no SQL here. For HBase, you need to also deploy Hadoop (HDFS atleast), which is not required by Cassandra.

If you want the system to be geographically distributed in order to serve large number of reads with low latency from across the globe, PNUTS is your friend.  You also get fine-grained control over consistency and can trade it with low latency for reads.
Access options

Native Java API, Jython, Groovy DSL, Scala, REST, Thrift

Native Java API, Ruby, Perl, Python, Scala, PHP, Clojure, Grails, C++, C#



If you are out there looking for a scalable database solution, you've got quite a few choices. These are just some of the more popular ones. PNUTS is not open source but is a nice system from an architectural stand point and thats why I talk about it in this post.

References:
[1] HBase: http://hadoop.apache.org/hbase/
[2] Cassandra: http://cassandra.apache.org/
[3] Bigtable: http://labs.google.com/papers/bigtable.html
[4] Dynamo: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
[5] Yahoo! YCSB Benchmark: http://research.yahoo.com/node/3202
[6] Lars George's description of the HBase Architecture: http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html

7 comments:

  1. Most of the comments about cassandra is incorrect, Consistency is a choice from the client and there is always ways to make it consistent. Datacenter replication is not specified. Facebook installation is 200 nodes and i dont see any Hbase installation that big. Scans are slow was a old story. you might really need to check out the latency and the scalability and compare with numbers... i dont see much of a feature difference currently.

    ReplyDelete
  2. Vijay, I havent heard of a 200 node Cassandra instance. Do you have any stats/write up/paper to back up the claim?

    Yes, Cassandra and HBase are in a way converging to provide similar features.

    The scan performance was compared in the Yahoo! YCSB paper. Its being presented in SoCC10.

    Consistency is the client's choice and I mention that the client can trade off between latency and consistency. Is that incorrect?

    ReplyDelete
  3. AK, you might want to read wikipedia.... http://en.wikipedia.org/wiki/Apache_Cassandra Facebook claims that....

    Yahoo paper was little old... and we asked them to use the latest release
    http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02315.html 0.5 to 0.6 there was atleast 2x performance gain.

    ReplyDelete
  4. In addition @cassandra writes are atomic within a columnfamily (within a row)
    http://stackoverflow.com/questions/2055037/cassandra-atomic-reads-writes-within-a-single-columnfamily

    About consistency: i was commenting about "If you can deal with eventual consistency and want a highly available system that can span across data centers" --> take a look at DCQuorum and DCQuorumSync consistency levels in cassandra

    ReplyDelete
  5. Thanks for giving pointers to more detailed information about Cassandra.
    Regarding atomic writes, from the link you have posted:

    "It's correct, if understood correctly. We should probably just remove it since it's confusing as written.

    What it means is, if a write for a given row is acked, eventually, all the data updated in that row will be available for reads. So no, it's not atomic at the batch_mutate level but at the list level."

    As far as ACID guarantees go, the semantics of each term are very clearly defined. The above description does not clearly describe what will happen on a write. Can you explain?

    I remember we discussed about DCQuorum when we met earlier. But I'm still not clear how my statement is incorrect. Can you explain?

    ReplyDelete
  6. If your consistency level is Quorum then it means when you read as Quorum you will see a consistent and an atomic read which will be similar to that of the hbase.... Datacenter Quorum will provide you the ability in which you will be able to see the data consistently within a Datacenter, which makes it even better for some use cases. hope that clarifies...

    ReplyDelete
  7. Getting a consistent view by reading from a quorum is the basic principal of this architecture. And like I said earlier, thats a tradeoff between latency (governed by the number of replicas needed to be read) and consistency, which the client can make. The statement about atomicity that you pointed me to does not say this, by the way. It clearly says that if a write is acked, "eventually" (for no definition of term eventual), all data updated in that row will be available. That does not "strongly guarantee" anything.

    DCQuorum, from what I understand (by reading [1]), gives you the ability to read replicas from across or within data centers in order to establish consistency, if the instance spans clusters. Anyways, a quorum needs to be read. Yes, it gives you consistency within a data center. And like I said earlier - its a trade off between latency (performance) and consistency, which is the choice of the client.

    [1] http://www.mail-archive.com/user@cassandra.apache.org/msg02494.html

    ReplyDelete