Book Review and Excerpt
Cassandra: The Definitive Guide by Eben Hewitt
Chapter 5
But Cassandra's architecture is no more interesting then learning how to play with its configurations for improving performance.
Keyspaces
Keyspaces used to be defined statically in an XML configuration file, but as of 0.7, you can use the API to create keyspaces and column families: system_add_keyspace, system_rename_keyspace, system_drop_keyspace, system_add_column_family, system_drop_column_family, system_rename_column_family.
Creating a Column Family
To create a column family, you also use the CLI or the API. Here are the options available
when creating a column family: column_type (Super or Standard), clock_type (Timestamp),
comparator (AsciiType, BytesType, LexicalUUIDType, LongType, TimeUUIDType, and UTF8Type), subcomparator (for subcolumns), reconciler (Timestamp), comment, rows_cached, preload_row_cache, key_cache_size, read_repair_chance.
Replicas
Every Cassandra node is a replica for something. If your replication factor is set to N, then each node will act as a replica for N ranges, even if N is set to 1.
A Cassandra cluster, or collection of hosts, is typically referred to as a ring. Each node in the ring is assigned a single, unique token. Each node claims ownership of the range of values from its token to the token of the previous node.
When creating a replica, the first replica is always placed in the node claiming the key
range of its token. All remaining replicas are distributed according to a configurable
replication strategy, which we’ll look at now.
Replica Placement Strategies
Choosing the right replication strategy is important because the strategy determines which nodes are responsible for which key ranges. The implication is that you’re also determining which nodes should receive which write operations, which can have a big impact on efficiency in different scenarios. If you set up your cluster such that all writes are going to two data centers—one in Australia and one in Reston, Virginia—you will see a matching performance degradation.
Simple Strategy (or Rack-Unaware Strategy)
This strategy places replicas in a single data center, in a manner that is not aware of their placement on a data center rack. This means that the implementation is theoretically fast, but not if the next node that has the given key is in a different rack than others.
What’s happening here is that the next N nodes on the ring are chosen to hold replicas,
and the strategy has no notion of data centers.
Old Network Topology Strategy (or Rack-Aware Strategy)
Say you have two data centers, DC1 and DC2, and a set of Cassandra servers. This strategy will place some replicas in DC1, putting each in a different available rack. It will also ensure that another replica is placed in DC2. The Rack-Aware Strategy is specifically for when you have nodes in the same Cassandra cluster spread over two data centers and you are using a replication factor of 3.
Use this strategy when you want to ensure higher availability at the potential cost of some additional latency while the third node in the alternate data center is contacted.
There is no point in using Rack-Aware Strategy if you are only running Cassandra in a single data center. Cassandra is optimized for running in different data centers, however, and taking advantage of its high availability may be an important consideration for you.
If you use the Rack-Aware Strategy, you must also use the Rack-Aware Snitch (explained in a moment).
Network Topology Strategy
This strategy allows you to specify how replicas should be placed across data centers. To use it, you supply parameters in which you indicate the desired replication strategy for each data center.
This strategy used to employ a file called datacenter.properties. But as of 0.7, this metadata is attached directly to the keyspace and strategy as a map of configuration options.
Replication Factor
The replication factor specifies how many copies of each piece of data will be stored and distributed throughout the Cassandra cluster. It is specified by the replication_factor setting.
It may seem intuitive that the more nodes you have in your cluster, the more you want to increase the replication factor. However, you don’t want to follow this rule of thumb alone; instead, increase the replication factor in order to satisfy your required service level.
With a replication factor of one, your data will exist only in a single node in the cluster. Losing that node means that data becomes unavailable. It also means that Cassandra will have to do more work as coordinator among nodes; if all the data for a given key is on node B, every client request for that key that enters through node A will need to be forwarded.
Cassandra will achieve high consistency when the read replica count plus the write
replica count is greater than the replication factor.
So if you have 10 nodes in your cluster, you could set the replication factor to 10 as the maximum. But you might not want to do that, as it robs Cassandra of what it’s good at and stymies its availability, because if a single node goes down, you won’t be able to have high consistency. Instead, set the replication factor to a reasonable number and then tune the consistency level up or down. The consistency level never allows you to write to more than the number of nodes specified by the replication factor. A “reasonable number” here is probably fairly low. ONE seems like the lowest number; however, ANY is similar to ONE but even less consistent, since you might not see the written value until a long time after you wrote it. If any node in the cluster is alive, ANY should succeed.
If you’re new to Cassandra, the replication factor can sometimes be confused with the consistency level. The replication factor is set per keyspace, and is specified in the server’s config file. The consistency level is specified per query, by the client. The replication factor indicates how many nodes you want to use to store a value during each write operation. The consistency level specifies how many nodes the client has decided must respond in order to feel confident of a successful read or write operation. The confusion arises because the consistency level is based on the replication factor, not on the number of nodes in the system.
Increasing the Replication Factor
As your application grows and you need to add nodes, you can increase the replication factor. There are some simple guidelines to follow when you do this. First, keep in mind that you’ll have to restart the nodes after a simple increase in the replication factor value. A repair (replicas synchronization) will then be performed after you restart the nodes, as Cassandra will have to redistribute some data in order to account for the increased replication factor. For as long as the repair takes, it is possible that some clients will receive a notice that data does not exist if they connect to a replica that doesn’t have the data yet.
A faster way of increasing the replication factor from 1 to 2 is to use the node tool. First, execute a drain on the original node to ensure that all the data is flushed to the SSTables. Then, stop that node so it doesn’t accept any more writes. Next, copy the datafiles from your keyspaces (the files under the directory that is your value for the DataFile Directory element in the config). Make sure not to copy the values in the internal Cassandra keyspace. Paste those datafiles to the new node. Change the settings in the configuration of both nodes so that the replication factor is set to 2. Make sure that autobootstrap is set to false in both nodes. Then, restart both nodes and run node tool repair. These steps will ensure that clients will not have to endure the potential of false empty reads for as long.
Partitioners
The purpose of the partitioner is to allow you to specify how row keys should be sorted, which has a significant impact on how data will be distributed across your nodes. It also has an effect on the options available for querying ranges of rows. There are a few different partitioners you can use.
Note: The choice of partitioner does not apply to column sorting, only row key sorting.
You set the partitioner by updating the value of the Partitioner element in the config file or in the API.
Note: Keep in mind that the partitioner can modify the on-disk SSTable representations. So if you change the partitioner type, you will have to delete your data directories.
Random Partitioner
It uses a BigIntegerToken with an MD5 hash applied to it to determine where to place the keys on the node ring. This has the advantage of spreading your keys evenly across your cluster, because the distribution is random. It has the disadvantage of causing inefficient range queries, because keys within a specified range might be placed in a variety of disparate locations in the ring, and key range queries will return data in an essentially random order.
Order-Preserving Partitioner
Using this type of partitioner, the token is a UTF-8 string, based on a key. Rows are therefore stored by key order, aligning the physical structure of the data with your sort order.
This allows you to perform range slices.
It’s worth noting that OPP isn’t more efficient for range queries than random partitioning—
it just provides ordering. It has the disadvantage of creating a ring that is potentially very lopsided, because real-world data typically is not written to evenly. As an example, consider the value assigned to letters in a Scrabble game. Q and Z are rarely used, so they get a high value. With OPP, you’ll likely eventually end up with lots of data on some nodes and much less data on other nodes. The nodes on which lots of data is stored, making the ring lopsided, are often referred to as “hot spots.” Because of the ordering aspect, users are commonly attracted to OPP early on. However, using OPP means that your operations team will need to manually rebalance nodes periodically using Nodetool’s loadbalance or move operations.
If you want to perform range queries from your clients, you must use an orderpreserving
partitioner or a collating order-preserving partitioner.
Collating Order-Preserving Partitioner
This partitioner orders keys according to a United States English locale (EN_US). Like OPP, it requires that the keys are UTF-8 strings.
This partitioner is rarely employed, as its usefulness is limited.
Byte-Ordered Partitioner
It is an order-preserving
partitioner that treats the data as raw bytes, instead of converting them to strings the way the order-preserving partitioner and collating order-preserving partitioner do. If you need an order-preserving partitioner that doesn’t validate your keys as being strings, BOP is recommended for the performance improvement.
Snitches
The job of a snitch is simply to determine relative host proximity. Snitches gather some information about your network topology so that Cassandra can efficiently route requests.
The snitch will figure out where nodes are in relation to other nodes. But inferring data centers is the job of the replication strategy.
You configure which endpoint snitch implementation to use by updating the value for
the <EndPointSnitch> element in the configuration file.
Simple Snitch
If two hosts have the same value in the second octet of their IP addresses, then they are determined to be in the same data center. If two hosts have the same value in the third octet of their IP addresses, then they are determined to be in the same rack. “Determined to be” really means that Cassandra has to guess based on an assumption of how your servers are
located in different VLANs or subnets.
PropertyFileSnitch
This snitch helps Cassandra know for certain if two IPs are in the same data center or on the same rack—because you tell it that they are. This is perhaps most useful if you move servers a lot, as operations often need to, or if you have inherited an unwieldy IP scheme.
The default configuration of cassandra-rack.properties looks like this:
# Cassandra Node IP=Data Center:Rack
10.0.0.10=DC1:RAC1
10.0.0.11=DC1:RAC1
10.0.0.12=DC1:RAC2
10.20.114.10=DC2:RAC1
10.20.114.11=DC2:RAC1
10.20.114.15=DC2:RAC2
# default for unknown nodes
default=DC1:r1
Here we see that there are two data centers, each with two racks. Cassandra can determine an even distribution with good performance.
Update the values in this file to record each node in your cluster to specify which rack contains the node with that IP and which data center it’s in. Although this may seem difficult to maintain if you expect to add or remove nodes with some frequency, remember that it’s one alternative, and it trades away a little flexibility and ease of maintenance in order to give you more control and better runtime performance, as Cassandra doesn’t have to figure out where nodes are. Instead, you just tell it where they are.
Adding Nodes to a Cluster
Let's add a second node in a ring to share the load.
First, make sure that the nodes in the cluster all have the same name and the same keyspace definitions so that the new node can accept data. Edit the config file on the second node to indicate that the first one will act as the seed.
Then, set autobootstrap to true.
When the second node is started up, it immediately recognizes the first node, but then sleeps for 90 seconds to allow time for nodes to gossip information about how much data they are storing locally. It then gets the bootstrap token from the first node, so it knows what section of the data it should accept. The bootstrap token is a token that splits the load of the most-loaded node in half. Again, the second node sleeps for 30
seconds and then starts bootstrapping.
Depending on how much data you have, you could see your new node in this state for some time.
When the transfer is done, we can run node tool again to make sure that everything is set up properly:
$ bin/nodetool -h 192.168.1.5 ring
Address Status Load Range Ring
192.168.1.7 Up 229.56 MB ... |<--|
192.168.1.5 Up 459.26 MB ... |-->|
Multiple Seed Nodes
Cassandra allows you to specify multiple seed nodes. A seed node is used as a contact point for other nodes, so Cassandra can learn the topology of the cluster, that is, what hosts have what ranges.
By default, the configuration file will have only a single seed entry:
seeds:
- 127.0.0.1
To add more seed nodes to your ring, just add another seed element.
seeds:
- 192.168.1.5
- 192.168.1.7
Make sure that the autobootstrap element is set to false if you’re using a node’s own IP address as a seed. One approach is to make the first three or five nodes in a cluster seeds, without autobootstrap—that is, by manually choosing tokens or by allowing Cassandra to pick random tokens. The rest of your nodes will not be seeds, so they can be added using autobootstrap.
Next, we need to update the listen address. This is how nodes identify each other, and it’s used for all internal communication.
listen_address: 192.168.1.5
Finally, we need to change the rpc_address because this is how other nodes will see this one.
rpc_address: 192.168.1.5
The rpc_address setting is used only for connections made directly from clients to Cassandra nodes.
Now you can restart Cassandra and start the installations on the other machines.
Dynamic Ring Participation
Nodes in a Cassandra cluster can be brought down and brought back up without disruption to the rest of the cluster (assuming a reasonable replication factor and consistency level).
Security
The authenticator that’s plugged in by default is the org.apache.cassandra.auth.AllowAllAuthenticator. If you want to force clients to provide credentials, another alternative ships with Cassandra, the org.apache.cassandra.auth.SimpleAuthenticator.
There are two files included in the config directory:
access.properties - specifies which users are allowed access to which keyspaces:
Keyspace1=jsmith,Elvis Presley,dilbert
passwd.properties - contains a list of these users and specifies the password for each of them:
jsmith=havebadpass
Elvis\ Presley=graceland4evar
dilbert=nomoovertime
To use the Simple Authenticator, replace the value for the authenticator element in cassandra.yaml.
We also have to tell Cassandra the location of the access and passwords files, using the bin/cassandra.in.sh include script.
JVM_OPTS="
-Dpasswd.properties=/home/eben/books/cassandra/dist/apache-cassandra-0.7.0-beta1/conf/passwd.properties
-Daccess.properties=/home/eben/books/cassandra/dist/apache-cassandra-0.7.0-beta1/conf/access.properties"
Now we can log into the command-line interface with a username and password combination:
[default@unknown] use Keyspace1 jsmith 'havebadpass'
We can also specify the username and password combination when we connect on the CLI:
eben@morpheus: bin/cassandra-cli --host localhost --port 9160 --username jsmith --password havebadpass --keyspace Keyspace1
If you have set up authentication on your keyspace, your client application code will need to log in.
TTransport tr = new TSocket("localhost", 9160);
TFramedTransport tf = new TFramedTransport(tr);
TProtocol proto = new TBinaryProtocol(tf);
Cassandra.Client client = new Cassandra.Client(proto);
tr.open();
AuthenticationRequest authRequest = new AuthenticationRequest();
Map<String, String> credentials = new HashMap<String, String>();
credentials.put("username", "jsmith");
credentials.put("password", "havebadpass");
authRequest.setCredentials(credentials);
client.set_keyspace("Keyspace1");
AccessLevel access = client.login(authRequest);
Note: The SimpleAuthenticator class has two modes for specifying passwords: plain text and MD5 encrypted.
You enable MD5 in the cassandra.in.sh file by passing the passwd.mode switch to the JVM:
JVM_OPTS=" \
-da \
//other stuff...
-Dpasswd.mode=MD5"
You can also provide your own method of authenticating to Cassandra by implementing the IAuthenticator interface.
There are a few other settings that you'd like to further investigate: column_index_size_in_kb, in_memory_compaction_limit_in_mb, gc_grace_seconds, phi_convict_threshold
Viewing Keys
You can see which keys are in your SSTables using a script called sstablekeys.
eben@morpheus$ bin/sstablekeys /var/lib/cassandra/data/Hotelier/Hotel-1-Data.db
Starting with version 0.7, the user keyspaces defined in cassandra.yaml are not loaded by default when the server starts.
In order to load them, you need to invoke the loadSchemaFromYaml JMX operation.
Note: If you need to import and export data, Cassandra comes with two scripts for working with JSON: one that imports JSON into Cassandra SSTables, and another that exports existing SSTables into JSON.
Note: If you prefer to use a graphical interface instead of the CLI, you can get information regarding your Cassandra setup using Chiton, which is a Python GTK-based Cassandra
browser written by Brandon Williams. You can find it at http://github.com/driftx/chiton.
In this chapter we looked at how to configure Cassandra. We saw how to set the replication factor and the snitch, and how to use a good replica placement strategy.
If this post has open your apatite for the subject, you can find much more information in the book itself: Cassandra: The Definitive Guide
The excerpt is from the book, 'Cassandra: The Definitive Guide', authored by Eben Hewit, published November 2010 by O’Reilly Media, Copyright 2011 Eben Hewitt.