Wednesday, February 29, 2012

Installing Cassandra

Book Review and Excerpt
Cassandra: The Definitive Guide by Eben Hewitt
Chapter 2

My previous post on Cassandra gave you an overview of Cassandra's benefits.
So, when is Cassandra a good choice?

In case of lots of Writes, Statistics, and Analysis
Consider your application from the perspective of the ratio of reads to writes. Cassandra
is optimized for excellent throughput on writes.
Many of the early production deployments of Cassandra involve storing user activity updates, social network usage, recommendations/reviews, and application statistics.
These are strong use cases for Cassandra because they involve lots of writing with less
predictable read operations, and because updates can occur unevenly with sudden spikes. In fact, the ability to handle application workloads that require high performance at significant write volumes with many concurrent client threads is one of the primary features of Cassandra.
According to the project wiki, Cassandra has been used to create a variety of applications,
including a windowed time-series store, an inverted index for document searching, and a distributed job priority queue.

Geographical Distribution
If you have a globally deployed application that could see a performance benefit from putting the data near the user, Cassandra could be a great fit.

Evolving Applications
If your application is evolving rapidly and you’re in “startup mode,” Cassandra might
be a good fit given its schema-free data model. This makes it easy to keep your
database in step with application changes as you rapidly deploy.

Who is Using Cassandra?
• Twitter is using Cassandra for analytics. Twitter had decided against using Cassandra as its primary store for tweets, as originally planned, but would instead use it in production for several different things: for real-time analytics, for geolocation and places of interest data, and for data mining over the entire user store.
• Mahalo uses it for its primary near-time data store.
• Facebook still uses it for inbox search, though they are using a proprietary fork.
• Digg uses it for its primary near-time data store.
• Rackspace uses it for its cloud service, monitoring, and logging.
• Reddit uses it as a persistent cache.
• Cloudkick uses it for monitoring statistics and analytics.
• Ooyala uses it to store and serve near real-time video analytics data.
• SimpleGeo uses it as the main data store for its real-time location infrastructure.
• Onespot uses it for a subset of its main data store.
As of this writing, the largest known Cassandra installation is at Facebook, where they have more than 150TB of data on more than 100 machines.

Chapter 02 – Installing Cassandra

Download and unzip: archive.apache.org/dist/cassandra/0.7.0/apache-cassandra-0.7.0-beta1-bin.tar.gz which is the version used for this book.
Please note that Cassandra suffered considerable changes from 0.7.0-beta1 (latest realease at the time this book was written) until 1.1.0-beta1 (latest release at the time this post was written).
So, if you want to follow the exercises in this book without additional customizations, you need to download Cassandra 0.7.0-beta1. 
You also need ant and the complete JDK, version 1.6.0_20 or better, not just the JRE.

For building Cassandra just make sure you’re in the root directory of your source download and execute: > ant –v

You can check out the unit test sources themselves for some useful examples of how to interact with Cassandra.

You can use gen-thrift-java target from build.xml to generate the Apache Thrift client interface for interacting with the database in Java.

To create a Java Archive (JAR) file for distribution, execute the command >ant jar. This will perform a complete build and output a file into the build directory called apache-cassandra-x.x.x.jar.

On Windows
Once you have the binary or the source downloaded and compiled, you’re ready to start the database server. You also might need to set your JAVA_HOME environment variable.

On Linux
The process on Linux is similar to that on Windows: 
Make sure that your JAVA_HOME variable is properly set to version 1.6.0_20 or better. Then, you need to extract the Cassandra gzipped tarball using gunzip. Finally, create a couple of directories for Cassandra to store its data and logs, and give them the proper permissions, as shown here:
ehewitt@morpheus$ cd /home/eben/books/cassandra/dist/apache-cassandra-0.7.0-beta1
ehewitt@morpheus$ sudo mkdir -p /var/log/cassandra
ehewitt@morpheus$ sudo chown -R ehewitt /var/log/cassandra
ehewitt@morpheus$ sudo mkdir -p /var/lib/cassandra
ehewitt@morpheus$ sudo chown -R ehewitt /var/lib/cassandra
Instead of ehewitt, of course, substitute your own username.

Starting the Server
To start the Cassandra server on any OS, open a command prompt or terminal window,
navigate to the <cassandra-directory>/bin where you unpacked Cassandra, and run the
following command to start your server:
eben@morpheus$ bin/cassandra -feben@morpheus$ bin/cassandra -f

Congratulations! Now your Cassandra server should be up and running with a new
single node cluster called Test Cluster listening on port 9160.

Running the Command-Line Client Interface
On Linux, running the command-line interface just works: >bin/cassandra-cli
On Windows, navigate to the Cassandra home directory and open a new terminal in which to run our client process:
>bin\cassandra-cli
It’s possible that on Windows you will see an error like this when starting the client:
Starting Cassandra Client
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/cassandra/cli/CliMain
This probably means that you started Cassandra directly from within the bin directory, and it therefore sets up its Java classpath incorrectly and can’t find the CliMain file to start the client. You can define an environment variable called CASSANDRA_HOME that points to the top-level directory where you have placed or built Cassandra, so you don’t have to pay as much attention to where you’re starting Cassandra from.

You now have an interactive shell at which you can issue commands:
eben@morpheus$ bin/cassandra-cli
Welcome to cassandra CLI.
Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
[default@unknown]

Connecting to a Server
To connect to a particular server after you have started Cassandra this way, use the connect command:
eben@morpheus:~/books/cassandra/dist/apache-cassandra-0.7.0-beta1$ bin/
cassandra-cli localhost/9160
Welcome to cassandra CLI.
Note: In a production environment, be sure to remove the Test Cluster from the configuration.

Creating a Keyspace and Column Family
A Cassandra keyspace is sort of like a relational database.
[default@unknown] create keyspace MyKeyspace with replication_factor=1
ab67bad0-ae2c-11df-b642-e700f669bcfc
                [default@unknown] use MyKeyspace
Authenticated to keyspace: MyKeyspace
                [default@MyKeyspace] create column family User
991590d3-ae2e-11df-b642-e700f669bcfc
This creates a new column family called “User” in our current keyspace.

Writing and Reading Data
Now that we have a keyspace and a column family, we’ll write some data to the database
and read it back out again.
For our purposes here, it’s enough to think of a column family as a multidimensional ordered map that you don’t have to define further ahead of time. Column families hold columns, and columns are the atomic unit of data storage.
To write a value, use the set command:
[default@MyKeyspace] set User['ehewitt']['fname']='Eben'
Value inserted.
[default@MyKeyspace] set User['ehewitt']['email']='me@example.com'
Value inserted.
Here we have created two columns for the key ehewitt, to store a set of related values.
Now that we know the data is there, let’s read it, using the get command:
[default@MyKeyspace] get User['ehewitt']
=> (column=666e616d65, value=Eben, timestamp=1282510290343000)
=> (column=656d61696c, value=me@example.com, timestamp=1282510313429000)
Returned 2 results.
You can delete a column using the del command:
[default@MyKeyspace] del User['ehewitt']['email']
column removed.
We’ll clean up after ourselves by deleting the entire row
 [default@MyKeyspace] del User['ehewitt']
row removed.

By now, you should have some hands-on experience with Cassandra.

If this post has open your apatite for the subject, you can find much more information in the book itself: Cassandra: The Definitive Guide

The excerpt is from the book, 'Cassandra: The Definitive Guide', authored by Eben Hewit, published November 2010 by O’Reilly Media, Copyright 2011 Eben Hewitt.    

No comments:

Post a Comment