Tuesday, February 28, 2012

Introduction to Cassandra

Book Review and Excerpt
Cassandra: The Definitive Guide by Eben Hewitt

Eben Hewitt is a software architect, the author of Cassandra: The Definitive Guide, Java SOA Cookbook, and an international speaker on technology.

The book he wrote on Cassandra is well structured, brings hands-on examples and it's a helpful introduction to Cassandra even for beginners in NoSQL databases.
Let's start our journey with Ebin Hewitt.

Chapter 1 - Introduction to Cassandra

Ebin opens the book with a discussion on relational databases

Relational databases are very good at solving certain data storage problems, but because of their focus, they also can create problems of their own when it’s time to scale. 
Then, you often need to find a way to get rid of your joins, which means denormalizing the data, which means maintaining multiple copies of data and seriously disrupting your design, both in the database and in your application. 
Further, you almost certainly need to find a way around distributed transactions, which will quickly become a bottleneck. These compensatory actions are not directly supported in any
but the most expensive RDBMS. And even if you can write such a huge check, you still need to carefully choose partitioning keys to the point where you can never entirely ignore the limitation.

As we see some of the limitations of RDBMS and consequently some of the strategies that architects have used to mitigate their scaling issues, a picture slowly starts to emerge. It’s a picture that makes some NoSQL solutions seem perhaps less radical and less scary than we may have thought at first, and more like a natural expression and encapsulation of some of the work that was already being done to manage very large databases.

The question becomes what kind of application would you want to have if durability, elastic scalability, vast storage, and blazing-fast writes weren’t a problem? In a world now working at web scale, Apache Cassandra might be one part of the answer.

Apache Cassandra is an open source, distributed, decentralized, elastically scalable,
highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web.

Wow! Quite a statement! 
Let's zoom in for a better understanding of those promises!

Distributed Database
Much of its design and code base is specifically engineered toward not only making it work across many different machines, but also for optimizing performance across multiple data center racks, and even for a single Cassandra cluster running across geographically dispersed data centers. 
You can confidently write data to anywhere in the cluster and Cassandra will get it.

Decentralised Database
Once you start to scale many other data stores (MySQL, Bigtable), some nodes need to be set up as masters in order to organize other nodes, which are set up as slaves.
Cassandra, however, is decentralized, meaning that every node is identical.
It also features a peer-to-peer protocol and uses gossip to maintain and keep in sync a list of nodes that are alive or dead.
The fact that Cassandra is decentralized means that there is no single point of failure.
All of the nodes in a Cassandra cluster function exactly the same.

Elastic Scalability
Scalability is an architectural feature of a system that can continue serving a greater number of requests with little degradation in performance. Vertical scaling—simply adding more hardware capacity and memory to your existing machine—is the easiest way to achieve this. Horizontal scaling means adding more machines that have all or some of the data on them so that no one machine has to bear the entire burden of serving requests.
Elastic scalability refers to a special property of horizontal scalability. It means that your cluster can seamlessly scale up and scale back down. To do this, the cluster must be able to accept new nodes that can begin participating by getting a copy of some or all of the data and start serving new user requests without major disruption or reconfiguration of the entire cluster. You don’t have to restart your process. You don’t have to change your application queries. You don’t have to manually rebalance the data yourself. Just add another machine:  Cassandra will find it and start sending it work.
Scaling down, of course, means removing some of the processing capacity from your
cluster. 

High Availability and Fault Tolerance
Cassandra is highly available. You can replace failed nodes in the cluster with no downtime, and you can replicate data to multiple data centers to offer improved local performance and prevent downtime if one data center experiences a catastrophe such as fire or flood.

Tuneable Consistency
Consistency essentially means that a read always returns the most recently written value.
Cassandra is frequently called “eventually consistent,” which is a bit misleading. Out of the box, Cassandra trades some consistency in order to achieve total availability. But Cassandra is more accurately termed “tuneably consistent,” which means it allows you to easily decide the level of consistency you require, in balance with the level of availability.

When considering consistency, availability, and partition tolerance, we can achieve only two of these goals in a given distributed system. This is the CAP Theorem stated by Eric Brewer.
At the center of the problem is data update replication. To achieve a strict consistency, all update operations will be performed synchronously, meaning that they must block, locking all replicas until the operation is complete, and forcing competing clients to wait. A side effect of such a design is that during a failure, some of the data will be entirely unavailable.
We could alternatively take an optimistic approach to replication, propagating updates to all replicas in the background in order to avoid blowing up on the client. The difficulty this approach presents is that now we are forced into the situation of detecting and resolving conflicts. A design approach must decide whether to resolve these conflicts at one of two possible times: during reads or during writes. That is, a distributed database designer must choose to make the system either always readable or always writable.
Dynamo and Cassandra choose to be always writable, opting to defer the complexity of reconciliation to read operations, and realize tremendous performance gains.

In Cassandra, consistency is not an all-or-nothing proposition, so we might more accurately
term it “tuneable consistency” because the client can control the number of replicas to block on for all updates. For adjusting consistency in your system look for more information on setting the consistency level against the replication factor.

Column-Oriented Database
Cassandra is frequently referred to as a “column-oriented” database, which is not incorrect. It’s not relational, and it does represent its data structures in sparse multidimensional hashtables. “Sparse” means that for any given row you can have one or more columns, but each row doesn’t need to have all the same columns as other rows like it (as in a relational model). Each row has a unique key, which makes its data accessible. So although it’s not wrong to say that Cassandra is columnar or columnoriented, it might be more helpful to think of it as an indexed, row-oriented store

Schema-Free

Instead of modeling data up front using expensive data modeling tools and then writing queries with complex join statements, Cassandra asks you to model the queries you want, and then provide the data around them.

High Performance

Cassandra was designed specifically from the ground up to take full advantage of  multiprocessor / multicore machines, and to run across many dozens of these machines housed in multiple data centers. It scales consistently and seamlessly to hundreds of terabytes. Cassandra has been shown to perform exceptionally well under heavy load.

It consistently can show very fast throughput for writes per second on a basic commodity workstation. As you add more servers, you can maintain all of Cassandra’s desirable properties without sacrificing performance.

All this said, you should have by now a big picture on Cassandra's strong points. 

If this post has open your apatite for the subject, you can find much more information in the book itself: Cassandra: The Definitive Guide

The excerpt is from the book, 'Cassandra: The Definitive Guide', authored by Eben Hewit, published November 2010 by O’Reilly Media, Copyright 2011 Eben Hewitt.    

No comments:

Post a Comment