Cassandra: The Definitive Guide by Eben Hewitt
Eben Hewitt is a software architect, the author of Cassandra: The Definitive Guide , Java SOA Cookbook, and an international speaker on technology.
The book he wrote on Cassandra is well structured, brings hands-on examples and it's a helpful introduction to Cassandra even for beginners in NoSQL databases.
Let's start our journey with Ebin Hewitt.
Chapter 1 - Introduction to Cassandra
Ebin opens the book with a discussion on relational
databases
Relational databases are very good at solving certain data storage problems, but because
of their focus, they also can create problems of their own when it’s time to scale.
Then, you often need to find a way to get rid of your joins, which means denormalizing
the data, which means maintaining multiple copies of data and seriously disrupting
your design, both in the database and in your application.
Further, you almost
certainly need to find a way around distributed transactions, which will
quickly become a bottleneck.
These compensatory actions are not directly supported in any
but the most
expensive RDBMS. And even if you can write such a huge check, you still need to
carefully choose partitioning keys to the point where you can never entirely ignore the
limitation.
As we see some of the limitations of RDBMS and consequently some of the
strategies that architects have used to mitigate their scaling issues, a picture
slowly starts to emerge. It’s a picture that makes some NoSQL solutions seem perhaps less
radical and less scary than we may have thought at first, and more like a natural
expression and encapsulation of some of the work that was already being done to manage very large databases.
The question becomes what
kind of application would you want to have if durability, elastic
scalability, vast storage, and blazing-fast writes weren’t a problem? In a world now working at
web scale, Apache Cassandra might
be one part of the answer.
Apache Cassandra is an
open source, distributed, decentralized, elastically scalable,
highly available,
fault-tolerant, tuneably consistent, column-oriented database that bases its distribution
design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at
Facebook, it is now used at some of the most popular sites on the Web.
Wow! Quite a statement!
Let's zoom in for a better understanding of those promises!
Distributed Database
Much of its design and code
base is specifically engineered toward not only making it work across many
different machines, but also for optimizing performance across multiple data
center racks, and even for a single
Cassandra cluster running across geographically dispersed data centers.
You can
confidently write data to anywhere in the cluster and Cassandra will get it.
Decentralised Database
Once you start to scale
many other data stores (MySQL, Bigtable), some nodes need to be set up as masters in
order to organize other nodes, which are set up as slaves.
Cassandra, however, is
decentralized, meaning that every node is identical.
It also features a peer-to-peer protocol and uses gossip to maintain and keep in sync a list of
nodes that are alive or dead.
The fact that Cassandra is
decentralized means that there is no single point of failure.
All of
the nodes in a Cassandra cluster function exactly the same.
Elastic
Scalability
Scalability is an
architectural feature of a system that can continue serving a greater number of requests with
little degradation in performance. Vertical scaling—simply adding more hardware
capacity and memory to your existing machine—is the easiest way to achieve this.
Horizontal scaling means adding more machines that have all or some of the data on them
so that no one machine has to bear the entire burden of serving
requests.
Elastic
scalability refers to a special property of horizontal
scalability. It means that your cluster can
seamlessly scale up and scale back down. To do this, the cluster must be able to accept new
nodes that can begin participating by getting a copy of some or all of the data and start
serving new user requests without major disruption or reconfiguration of the entire cluster. You
don’t have to restart your process. You don’t have to change your application
queries. You don’t have to manually rebalance the data yourself. Just add another
machine: Cassandra will find it and start sending it work.
Scaling down, of course,
means removing some of the processing capacity from your
cluster.
High
Availability and Fault Tolerance
Cassandra is highly
available. You can replace failed nodes in the cluster with no downtime, and you can
replicate data to multiple data centers to offer improved local performance and prevent
downtime if one data center experiences a catastrophe such as fire or flood.
Tuneable
Consistency
Consistency
essentially means that a read always returns the most recently written value.
Cassandra is frequently
called “eventually consistent,” which is a bit misleading. Out of the box, Cassandra trades
some consistency in order to achieve total availability. But Cassandra is more
accurately termed “tuneably consistent,” which means it allows you to easily decide the
level of consistency you require, in balance with the level of availability.
When
considering consistency, availability, and partition tolerance, we can achieve only two of
these goals in a given distributed system. This is the CAP Theorem stated by Eric
Brewer.
At the center of the
problem is data update replication. To achieve a strict consistency, all update
operations will be performed synchronously, meaning that they must block,
locking all replicas until the operation is complete, and
forcing competing clients to wait. A side effect of such a design is that
during a failure, some of the data will be entirely unavailable.
We could alternatively
take an optimistic approach to replication, propagating updates to all replicas in the
background in order to avoid blowing up on the client. The difficulty this approach presents is
that now we are forced into the situation of detecting and resolving conflicts. A
design approach must decide whether to resolve these conflicts at one of two possible
times: during reads or during writes. That is, a distributed database designer must
choose to make the system either always readable or always writable.
Dynamo and
Cassandra choose to be always writable, opting to defer the complexity of
reconciliation to read operations, and realize tremendous performance gains.
In Cassandra, consistency
is not an all-or-nothing proposition, so we might more accurately
term it “tuneable
consistency” because the client can control the number of replicas to block on for
all updates. For adjusting consistency in your system look for more information on setting the consistency level against the
replication factor.
Column-Oriented Database
Cassandra is
frequently referred to as a “column-oriented” database, which is not incorrect.
It’s not relational, and it does represent its data structures in sparse
multidimensional hashtables. “Sparse” means that for any given row you can have
one or more columns, but each row doesn’t need to have all the same columns as
other rows like it (as in a relational model). Each row has a unique key, which
makes its data accessible. So although it’s not wrong to say that Cassandra is
columnar or columnoriented, it might be more helpful to think of it as an
indexed, row-oriented store
Schema-Free
Instead of
modeling data up front using expensive data modeling tools and then
writing queries with complex join statements, Cassandra asks you to model the
queries you want, and then provide the data around them.
High
Performance
Cassandra was
designed specifically from the ground up to take full advantage of multiprocessor / multicore
machines, and to run across many dozens of these machines housed in
multiple data centers. It scales consistently and seamlessly to hundreds of terabytes.
Cassandra has been shown to perform exceptionally well under heavy load.
It
consistently can show very fast throughput for writes per second on a basic
commodity workstation.
As you add more servers, you can maintain all of Cassandra’s desirable
properties without sacrificing performance.
All this said, you should have by now a big picture on Cassandra's strong points.
If this post has open your apatite for the subject, you can find much more information in the book itself: Cassandra: The Definitive Guide
The excerpt is from the book, 'Cassandra: The Definitive Guide', authored by Eben Hewit, published November 2010 by O’Reilly Media, Copyright 2011 Eben Hewitt.
No comments:
Post a Comment