green: February 2012

Wednesday, February 29, 2012

Installing Cassandra

Book Review and Excerpt
Cassandra: The Definitive Guide by Eben Hewitt
Chapter 2

My previous post on Cassandra gave you an overview of Cassandra's benefits.

So, when is Cassandra a good choice?

In case of lots of Writes, Statistics, and Analysis

Consider your application from the perspective of the ratio of reads to writes. Cassandra

is optimized for excellent throughput on writes.

Many of the early production deployments of Cassandra involve storing user activity updates, social network usage, recommendations/reviews, and application statistics.

These are strong use cases for Cassandra because they involve lots of writing with less

predictable read operations, and because updates can occur unevenly with sudden spikes. In fact, the ability to handle application workloads that require high performance at significant write volumes with many concurrent client threads is one of the primary features of Cassandra.

According to the project wiki, Cassandra has been used to create a variety of applications,

including a windowed time-series store, an inverted index for document searching, and a distributed job priority queue.

Geographical Distribution

If you have a globally deployed application that could see a performance benefit from putting the data near the user, Cassandra could be a great fit.

Evolving Applications

If your application is evolving rapidly and you’re in “startup mode,” Cassandra might

be a good fit given its schema-free data model. This makes it easy to keep your

database in step with application changes as you rapidly deploy.

Who is Using Cassandra?

• Twitter is using Cassandra for analytics. Twitter had decided against using Cassandra as its primary store for tweets, as originally planned, but would instead use it in production for several different things: for real-time analytics, for geolocation and places of interest data, and for data mining over the entire user store.

• Mahalo uses it for its primary near-time data store.

• Facebook still uses it for inbox search, though they are using a proprietary fork.

• Digg uses it for its primary near-time data store.

• Rackspace uses it for its cloud service, monitoring, and logging.

• Reddit uses it as a persistent cache.

• Cloudkick uses it for monitoring statistics and analytics.

• Ooyala uses it to store and serve near real-time video analytics data.

• SimpleGeo uses it as the main data store for its real-time location infrastructure.

• Onespot uses it for a subset of its main data store.

As of this writing, the largest known Cassandra installation is at Facebook, where they have more than 150TB of data on more than 100 machines.

Chapter 02 – Installing Cassandra

Download and unzip: archive.apache.org/dist/cassandra/0.7.0/apache-cassandra-0.7.0-beta1-bin.tar.gz which is the version used for this book.

Please note that Cassandra suffered considerable changes from 0.7.0-beta1 (latest realease at the time this book was written) until 1.1.0-beta1 (latest release at the time this post was written).

So, if you want to follow the exercises in this book without additional customizations, you need to download Cassandra 0.7.0-beta1.

You also need ant and the complete JDK, version 1.6.0_20 or better, not just the JRE.

For building Cassandra just make sure you’re in the root directory of your source download and execute: > ant –v

You can check out the unit test sources themselves for some useful examples of how to interact with Cassandra.

You can use gen-thrift-java target from build.xml to generate the Apache Thrift client interface for interacting with the database in Java.

To create a Java Archive (JAR) file for distribution, execute the command >ant jar. This will perform a complete build and output a file into the build directory called apache-cassandra-x.x.x.jar.

On Windows

Once you have the binary or the source downloaded and compiled, you’re ready to start the database server. You also might need to set your JAVA_HOME environment variable.

On Linux

The process on Linux is similar to that on Windows:

Make sure that your JAVA_HOME variable is properly set to version 1.6.0_20 or better. Then, you need to extract the Cassandra gzipped tarball using gunzip. Finally, create a couple of directories for Cassandra to store its data and logs, and give them the proper permissions, as shown here:

ehewitt@morpheus$ cd /home/eben/books/cassandra/dist/apache-cassandra-0.7.0-beta1

ehewitt@morpheus$ sudo mkdir -p /var/log/cassandra

ehewitt@morpheus$ sudo chown -R ehewitt /var/log/cassandra

ehewitt@morpheus$ sudo mkdir -p /var/lib/cassandra

ehewitt@morpheus$ sudo chown -R ehewitt /var/lib/cassandra

Instead of ehewitt, of course, substitute your own username.

Starting the Server

To start the Cassandra server on any OS, open a command prompt or terminal window,

navigate to the <cassandra-directory>/bin where you unpacked Cassandra, and run the

following command to start your server:

eben@morpheus$ bin/cassandra -feben@morpheus$ bin/cassandra -f

Congratulations! Now your Cassandra server should be up and running with a new

single node cluster called Test Cluster listening on port 9160.

Running the Command-Line Client Interface

On Linux, running the command-line interface just works: >bin/cassandra-cli

On Windows, navigate to the Cassandra home directory and open a new terminal in which to run our client process:

>bin\cassandra-cli

It’s possible that on Windows you will see an error like this when starting the client:

Starting Cassandra Client

Exception in thread "main" java.lang.NoClassDefFoundError:

org/apache/cassandra/cli/CliMain

This probably means that you started Cassandra directly from within the bin directory, and it therefore sets up its Java classpath incorrectly and can’t find the CliMain file to start the client. You can define an environment variable called CASSANDRA_HOME that points to the top-level directory where you have placed or built Cassandra, so you don’t have to pay as much attention to where you’re starting Cassandra from.

You now have an interactive shell at which you can issue commands:

eben@morpheus$ bin/cassandra-cli

Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.

[default@unknown]

Connecting to a Server

To connect to a particular server after you have started Cassandra this way, use the connect command:

eben@morpheus:~/books/cassandra/dist/apache-cassandra-0.7.0-beta1$ bin/

cassandra-cli localhost/9160

Welcome to cassandra CLI.

Note: In a production environment, be sure to remove the Test Cluster from the configuration.

Creating a Keyspace and Column Family

A Cassandra keyspace is sort of like a relational database.

[default@unknown] create keyspace MyKeyspace with replication_factor=1

ab67bad0-ae2c-11df-b642-e700f669bcfc

[default@unknown] use MyKeyspace

Authenticated to keyspace: MyKeyspace

[default@MyKeyspace] create column family User

991590d3-ae2e-11df-b642-e700f669bcfc

This creates a new column family called “User” in our current keyspace.

Writing and Reading Data

Now that we have a keyspace and a column family, we’ll write some data to the database

and read it back out again.

For our purposes here, it’s enough to think of a column family as a multidimensional ordered map that you don’t have to define further ahead of time. Column families hold columns, and columns are the atomic unit of data storage.

To write a value, use the set command:

[default@MyKeyspace] set User['ehewitt']['fname']='Eben'

Value inserted.

[default@MyKeyspace] set User['ehewitt']['email']='me@example.com'

Value inserted.

Here we have created two columns for the key ehewitt, to store a set of related values.

Now that we know the data is there, let’s read it, using the get command:

[default@MyKeyspace] get User['ehewitt']

=> (column=666e616d65, value=Eben, timestamp=1282510290343000)

=> (column=656d61696c, value=me@example.com, timestamp=1282510313429000)

Returned 2 results.

You can delete a column using the del command:

[default@MyKeyspace] del User['ehewitt']['email']

column removed.

We’ll clean up after ourselves by deleting the entire row

[default@MyKeyspace] del User['ehewitt']

row removed.

By now, you should have some hands-on experience with Cassandra.

If this post has open your apatite for the subject, you can find much more information in the book itself: Cassandra: The Definitive Guide

Tuesday, February 28, 2012

Introduction to Cassandra

Book Review and Excerpt
Cassandra: The Definitive Guide by Eben Hewitt

Eben Hewitt is a software architect, the author of Cassandra: The Definitive Guide, Java SOA Cookbook, and an international speaker on technology.

The book he wrote on Cassandra is well structured, brings hands-on examples and it's a helpful introduction to Cassandra even for beginners in NoSQL databases.

Let's start our journey with Ebin Hewitt.

Chapter 1 - Introduction to Cassandra

Ebin opens the book with a discussion on relational databases

Relational databases are very good at solving certain data storage problems, but because of their focus, they also can create problems of their own when it’s time to scale.

Then, you often need to find a way to get rid of your joins, which means denormalizing the data, which means maintaining multiple copies of data and seriously disrupting your design, both in the database and in your application.

Further, you almost certainly need to find a way around distributed transactions, which will quickly become a bottleneck. These compensatory actions are not directly supported in any

but the most expensive RDBMS. And even if you can write such a huge check, you still need to carefully choose partitioning keys to the point where you can never entirely ignore the limitation.

As we see some of the limitations of RDBMS and consequently some of the strategies that architects have used to mitigate their scaling issues, a picture slowly starts to emerge. It’s a picture that makes some NoSQL solutions seem perhaps less radical and less scary than we may have thought at first, and more like a natural expression and encapsulation of some of the work that was already being done to manage very large databases.

The question becomes what kind of application would you want to have if durability, elastic scalability, vast storage, and blazing-fast writes weren’t a problem? In a world now working at web scale, Apache Cassandra might be one part of the answer.

Apache Cassandra is an open source, distributed, decentralized, elastically scalable,

highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web.

Wow! Quite a statement!

Let's zoom in for a better understanding of those promises!

Distributed Database

Much of its design and code base is specifically engineered toward not only making it work across many different machines, but also for optimizing performance across multiple data center racks, and even for a single Cassandra cluster running across geographically dispersed data centers.

You can confidently write data to anywhere in the cluster and Cassandra will get it.

Decentralised Database

Once you start to scale many other data stores (MySQL, Bigtable), some nodes need to be set up as masters in order to organize other nodes, which are set up as slaves.

Cassandra, however, is decentralized, meaning that every node is identical.

It also features a peer-to-peer protocol and uses gossip to maintain and keep in sync a list of nodes that are alive or dead.

The fact that Cassandra is decentralized means that there is no single point of failure.

All of the nodes in a Cassandra cluster function exactly the same.

Elastic Scalability

Scalability is an architectural feature of a system that can continue serving a greater number of requests with little degradation in performance. Vertical scaling—simply adding more hardware capacity and memory to your existing machine—is the easiest way to achieve this. Horizontal scaling means adding more machines that have all or some of the data on them so that no one machine has to bear the entire burden of serving requests.

Elastic scalability refers to a special property of horizontal scalability. It means that your cluster can seamlessly scale up and scale back down. To do this, the cluster must be able to accept new nodes that can begin participating by getting a copy of some or all of the data and start serving new user requests without major disruption or reconfiguration of the entire cluster. You don’t have to restart your process. You don’t have to change your application queries. You don’t have to manually rebalance the data yourself. Just add another machine: Cassandra will find it and start sending it work.

Scaling down, of course, means removing some of the processing capacity from your

cluster.

High Availability and Fault Tolerance

Cassandra is highly available. You can replace failed nodes in the cluster with no downtime, and you can replicate data to multiple data centers to offer improved local performance and prevent downtime if one data center experiences a catastrophe such as fire or flood.

Tuneable Consistency

Consistency essentially means that a read always returns the most recently written value.

Cassandra is frequently called “eventually consistent,” which is a bit misleading. Out of the box, Cassandra trades some consistency in order to achieve total availability. But Cassandra is more accurately termed “tuneably consistent,” which means it allows you to easily decide the level of consistency you require, in balance with the level of availability.

When considering consistency, availability, and partition tolerance, we can achieve only two of these goals in a given distributed system. This is the CAP Theorem stated by Eric Brewer.

At the center of the problem is data update replication. To achieve a strict consistency, all update operations will be performed synchronously, meaning that they must block, locking all replicas until the operation is complete, and forcing competing clients to wait. A side effect of such a design is that during a failure, some of the data will be entirely unavailable.

We could alternatively take an optimistic approach to replication, propagating updates to all replicas in the background in order to avoid blowing up on the client. The difficulty this approach presents is that now we are forced into the situation of detecting and resolving conflicts. A design approach must decide whether to resolve these conflicts at one of two possible times: during reads or during writes. That is, a distributed database designer must choose to make the system either always readable or always writable.

Dynamo and Cassandra choose to be always writable, opting to defer the complexity of reconciliation to read operations, and realize tremendous performance gains.

In Cassandra, consistency is not an all-or-nothing proposition, so we might more accurately

term it “tuneable consistency” because the client can control the number of replicas to block on for all updates. For adjusting consistency in your system look for more information on setting the consistency level against the replication factor.

Column-Oriented Database

Cassandra is frequently referred to as a “column-oriented” database, which is not incorrect. It’s not relational, and it does represent its data structures in sparse multidimensional hashtables. “Sparse” means that for any given row you can have one or more columns, but each row doesn’t need to have all the same columns as other rows like it (as in a relational model). Each row has a unique key, which makes its data accessible. So although it’s not wrong to say that Cassandra is columnar or columnoriented, it might be more helpful to think of it as an indexed, row-oriented store

Schema-Free

Instead of modeling data up front using expensive data modeling tools and then writing queries with complex join statements, Cassandra asks you to model the queries you want, and then provide the data around them.

High Performance

Cassandra was designed specifically from the ground up to take full advantage of multiprocessor / multicore machines, and to run across many dozens of these machines housed in multiple data centers. It scales consistently and seamlessly to hundreds of terabytes. Cassandra has been shown to perform exceptionally well under heavy load.

It consistently can show very fast throughput for writes per second on a basic commodity workstation. As you add more servers, you can maintain all of Cassandra’s desirable properties without sacrificing performance.

All this said, you should have by now a big picture on Cassandra's strong points.

If this post has open your apatite for the subject, you can find much more information in the book itself: Cassandra: The Definitive Guide