Book Review and Excerpt
Cassandra: The Definitive Guide by Eben Hewitt
Chapter 2
Cassandra: The Definitive Guide by Eben Hewitt
Chapter 2
My previous post on Cassandra gave you an overview of Cassandra's benefits.
So, when is
Cassandra a good choice?
In case of lots of Writes, Statistics, and Analysis
Consider your
application from the perspective of the ratio of reads to writes. Cassandra
is optimized
for excellent throughput on writes.
Many of the
early production deployments of Cassandra involve storing user activity updates,
social network usage, recommendations/reviews, and application statistics.
These are
strong use cases for Cassandra because they involve lots of writing with less
predictable
read operations, and because updates can occur unevenly with sudden spikes. In
fact, the ability to handle application workloads that require high performance at significant
write volumes with many concurrent client threads is one of the primary
features of Cassandra.
According to
the project wiki, Cassandra has been used to create a variety of applications,
including a
windowed time-series store, an inverted index for document searching, and
a distributed job priority queue.
Geographical
Distribution
If you have a globally
deployed application that could see a performance benefit from putting the data near the
user, Cassandra could be a great fit.
Evolving
Applications
If your application
is evolving rapidly and you’re in “startup mode,” Cassandra might
be a good fit
given its schema-free data model. This makes it easy to keep your
database in
step with application changes as you rapidly deploy.
Who is Using
Cassandra?
• Twitter is
using Cassandra for analytics. Twitter had decided against using
Cassandra as its primary store for tweets, as originally planned, but would instead use it
in production for several different things: for real-time analytics, for geolocation
and places of interest data, and for data mining over the entire user store.
• Mahalo uses
it for its primary near-time data store.
• Facebook
still uses it for inbox search, though they are using a proprietary fork.
• Digg uses it
for its primary near-time data store.
• Rackspace
uses it for its cloud service, monitoring, and logging.
• Reddit uses
it as a persistent cache.
• Cloudkick
uses it for monitoring statistics and analytics.
• Ooyala uses
it to store and serve near real-time video analytics data.
• SimpleGeo
uses it as the main data store for its real-time location infrastructure.
• Onespot uses
it for a subset of its main data store.
As of this
writing, the largest known Cassandra installation is at Facebook, where they
have more than 150TB of
data on more than 100 machines.
Chapter 02 –
Installing Cassandra
Download and
unzip: archive.apache.org/dist/cassandra/0.7.0/apache-cassandra-0.7.0-beta1-bin.tar.gz
which is the version used for this book.
Please note
that Cassandra suffered considerable changes from 0.7.0-beta1 (latest realease
at the time this book was written) until 1.1.0-beta1 (latest release at the
time this post was written).
So, if you want to follow the exercises in this book
without additional customizations, you need to download Cassandra 0.7.0-beta1.
You
also need ant and the complete JDK, version 1.6.0_20 or better, not just the
JRE.
For building
Cassandra just make sure you’re in the root directory of your source download and
execute: > ant –v
You can check
out the unit test sources themselves for some useful examples of how to
interact with Cassandra.
You can use gen-thrift-java target from build.xml
to generate the Apache Thrift client interface for interacting
with the database in Java.
To create a
Java Archive (JAR) file for distribution, execute the command >ant jar. This will perform a complete
build and output a file into the build directory called apache-cassandra-x.x.x.jar.
On
Windows
Once you have
the binary or the source downloaded and compiled, you’re ready to start the database
server. You also might
need to set your JAVA_HOME environment variable.
On
Linux
The process on
Linux is similar to that on Windows:
Make sure that your JAVA_HOME variable is
properly set to version 1.6.0_20 or better. Then, you need to extract the Cassandra
gzipped tarball using gunzip. Finally,
create a couple of directories for Cassandra to store its
data and logs, and give them the proper permissions, as shown here:
ehewitt@morpheus$ cd
/home/eben/books/cassandra/dist/apache-cassandra-0.7.0-beta1
ehewitt@morpheus$ sudo mkdir -p /var/log/cassandra
ehewitt@morpheus$ sudo chown -R ehewitt /var/log/cassandra
ehewitt@morpheus$ sudo mkdir -p /var/lib/cassandra
ehewitt@morpheus$ sudo chown -R ehewitt /var/lib/cassandra
Instead of
ehewitt, of course, substitute your own username.
Starting
the Server
To start the
Cassandra server on any OS, open a command prompt or terminal window,
navigate to
the <cassandra-directory>/bin where you unpacked Cassandra, and run the
following
command to start your server:
eben@morpheus$ bin/cassandra -feben@morpheus$ bin/cassandra -f
Congratulations!
Now your Cassandra server should be up and running with a new
single node
cluster called Test Cluster listening on port 9160.
Running
the Command-Line Client Interface
On Linux,
running the command-line interface just works: >bin/cassandra-cli
On Windows,
navigate to the Cassandra home directory and open a new terminal in which to
run our client process:
>bin\cassandra-cli
It’s possible
that on Windows you will see an error like this when starting the client:
Starting Cassandra Client
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/cassandra/cli/CliMain
This probably
means that you started Cassandra directly from within the bin directory, and it
therefore sets up its Java classpath incorrectly and can’t find the CliMain
file to start the client. You can define an environment variable called
CASSANDRA_HOME that points to the top-level directory where you have placed or
built Cassandra, so you don’t have to pay as much attention to where you’re
starting Cassandra from.
You now have
an interactive shell at which you can issue commands:
eben@morpheus$ bin/cassandra-cli
Welcome to cassandra CLI.
Type 'help' or '?' for help. Type 'quit' or 'exit' to
quit.
[default@unknown]
Connecting to a Server
To connect to
a particular server after you have started Cassandra this way, use the connect command:
eben@morpheus:~/books/cassandra/dist/apache-cassandra-0.7.0-beta1$
bin/
cassandra-cli localhost/9160
Welcome to cassandra CLI.
Note: In a
production environment, be sure to remove the Test Cluster from the
configuration.
Creating a Keyspace and Column Family
A Cassandra
keyspace is sort of like a relational database.
[default@unknown] create keyspace
MyKeyspace with replication_factor=1
ab67bad0-ae2c-11df-b642-e700f669bcfc
[default@unknown]
use MyKeyspace
Authenticated to keyspace: MyKeyspace
[default@MyKeyspace]
create column family User
991590d3-ae2e-11df-b642-e700f669bcfc
This creates a
new column family called “User” in our current keyspace.
Writing and Reading Data
Now that we
have a keyspace and a column family, we’ll write some data to the database
and read it
back out again.
For our
purposes here, it’s enough to think of a column family as a multidimensional
ordered map that you don’t have to define further ahead of time. Column
families hold columns, and columns are the atomic unit of data storage.
To write a
value, use the set command:
[default@MyKeyspace] set
User['ehewitt']['fname']='Eben'
Value inserted.
[default@MyKeyspace] set
User['ehewitt']['email']='me@example.com'
Value inserted.
Here we have
created two columns for the key ehewitt, to store a set of related values.
Now that we
know the data is there, let’s read it, using the get
command:
[default@MyKeyspace] get User['ehewitt']
=> (column=666e616d65, value=Eben,
timestamp=1282510290343000)
=> (column=656d61696c, value=me@example.com,
timestamp=1282510313429000)
Returned 2 results.
You can delete
a column using the del command:
[default@MyKeyspace] del
User['ehewitt']['email']
column removed.
We’ll
clean up after ourselves by deleting the entire row
[default@MyKeyspace]
del User['ehewitt']
row removed.
By now, you should have some hands-on experience with Cassandra.
If this post has open your apatite for the subject, you can find much more information in the book itself: Cassandra: The Definitive Guide
The
excerpt is from the book, 'Cassandra: The Definitive Guide', authored
by Eben Hewit, published November 2010 by O’Reilly Media, Copyright 2011
Eben Hewitt.
No comments:
Post a Comment