Monday, March 19, 2012

Performance Tuning in Cassandra

Book Review and Excerpt
Cassandra: The Definitive Guide by Eben Hewitt
Chapter 11



New goodies for today! Performance is always a challenging subject and with Cassandra we target performance in the context of hundreds of TeraBytes :) 
Let's go!

As a general rule, it’s important to note that simply adding nodes to a cluster will not improve performance on its own. You need to replicate the data appropriately, then send traffic to all the nodes from your clients. If you aren’t distributing client requests, the new nodes could just stand by somewhat idle.

Data StorageThere are two sets of files that Cassandra writes to as part of handling update operations:
the commit log and the datafile. Their different purposes need to be considered in order to understand how to treat them during configuration.
The commit log can be thought of as short-term storage. As Cassandra receives updates,
every write value is written immediately to the commit log in the form of raw sequential file appends. If you shut down the database or it crashes unexpectedly, the commit log can ensure that data is not lost. That’s because the next time you start the node, the commit log gets replayed. In fact, that’s the only time the commit log is read; clients never read from it. But the normal write operation to the commit log blocks, so it would damage performance to require clients to wait for the write to finish.

The datafile represents the Sorted String Tables (SSTables). Unlike the commit log, data
is written to this file asynchronously. The SSTables are periodically merged during major compactions to free up space. To do this, Cassandra will merge keys, combine columns, and delete tombstones.
Read operations can refer to the in-memory cache and in this case don’t need to go directly to the datafiles on disk. If you can allow Cassandra a few gigabytes of memory, you can improve performance dramatically when the row cache and the key cache are hit.
The commit logs are periodically removed, following a successful flush of all their appended
data to the dedicated datafiles. For this reason, the commit logs will not grow to anywhere near the size of the datafiles, so the disks don’t need to be as large; this is something to consider during hardware selection.
If Cassandra runs a flush, you’ll see something in the server logs like this:
INFO 18:26:11,732 Discarding obsolete commit log: CommitLogSegment(/var/lib/cassandra/commitlog/CommitLog-1278894011530.log)
Then, if you check the commit log directory, that file has been deleted.
By default, the commit log and the datafile are stored in the following locations:
<CommitLogDirectory>/var/lib/cassandra/commitlog</CommitLogDirectory>
<DataFileDirectories>
 <DataFileDirectory>/var/lib/cassandra/data</DataFileDirectory>
</DataFileDirectories>
You can change these values to store the datafiles or commit log in different locations.
You can specify multiple datafile directories if you wish.

Note: You don’t need to update these values for Windows, even if you leave them in the default location, because Windows will automatically adjust the path separator and place them under C:\. Of course, in a real environment, it’s a good idea to specify them separately, as indicated.

It’s recommended that you store the datafiles and the commit logs on separate hard disks for maximum performance. Cassandra, like many databases, is particularly dependent on
the speed of the hard disk and the speed of the CPUs (it’s best to have four or eight cores, to take advantage of Cassandra’s highly concurrent construction). So make sure for QA and production environments to get the fastest disks you can, and get at least two separate ones so that the commit logfiles and the datafiles are not competing for
I/O time. It’s more important to have several processing cores than one or two very fast ones.

Reply Timeout
The reply timeout is a setting that indicates how long Cassandra will wait for other nodes to respond before deciding that the request is a failure. This is a common setting in relational databases and messaging systems. This value is set by the RpcTimeoutInMillis element (rpc_timeout_in_ms in YAML). By default, this is 5,000, or five seconds.
Commit Logs
You can set the value for how large the commit log is allowed to grow before it stops appending new writes to a file and creates a new one. This is similar to setting log rotation on Log4J.
This value is set with the CommitLogRotationThresholdInMB element (commitlog_rotation_threshold_in_mb in YAML). By default, the value is 128MB.
Another setting related to commit logs is the sync operation, represented by the commitlog_sync element. There are two possible settings for this: periodic and batch. periodic is the default, and it means that the server will make writes durable only at specified intervals. When the server is set to make writes durable periodically, you can potentially lose the data that has not yet been synced to disk from the write-behind cache.
In order to guarantee durability for your Cassandra cluster, you may want to examine this setting.
If your commit log is set to batch, it will block until the write is synced to disk (Cassandra will not acknowledge write operations until the commit log has been completely synced to disk). This clearly will have a negative impact on performance.
You can change the value of the configuration attribute from periodic to batch to specify that Cassandra must flush to disk before it acknowledges a write. Changing this value will require taking some performance metrics, as there is a necessary trade-off here: forcing Cassandra to write more immediately constrains its freedom to manage its own resources. If you do set commitlog_sync to batch, you need to provide a suitable value for CommitLogSyncBatchWindowInMS, where MS is the number of milliseconds between each sync effort. Moreover, this is not generally needed in a multinode cluster when using write replication, because replication by definition means that the write isn’t acknowledged until another node has it.
If you decide to use batch mode, you will probably want to split the commit log onto a separate disk from the SSTables (data) to mitigate the performance impact.

Memtables
Each column family has a single memtable associated with it. There are a few settings around the treatment of memtables. The size that the memtable can grow to before it is flushed to disk as an SSTable is specified with the MemtableSizeInMB element (binary_memtable_throughput_in_mb in YAML). Note that this value is based on the size of the memtable itself in memory, and not heap usage, which will be larger because of the overhead associated with column indexing.
You’ll want to balance this setting with MemtableObjectCountInMillions (memtable_throughput_in_mb in YAML), which sets a threshold for the number of column values that will be stored in a memtable before it is flushed. The default value is 0.3, which is approximately 333,000 columns. 

You can also configure how long to keep memtables in memory after they’ve been flushed to disk. This value can be set with the memtable_flush_after_mins element.
When the flush is performed, it will write to a flush buffer, and you can configure the size of that buffer with flush_data_buffer_size_in_mb.
Another element related to tuning the memtables is memtable_flush_writers. This setting, which is 1 by default, indicates the number of threads used to write out the memtables when it becomes necessary. If you have a very large heap, it can improve performance to set this count higher, as these threads are blocked during disk I/O.
Concurrency
Cassandra differs from many data stores in that it offers much faster write performance than read performance. There are two settings related to how many threads can perform read and write operations: concurrent_reads and concurrent_writes. In general, the defaults provided by Cassandra out of the box are very good. But you might want to update the concurrent_reads setting immediately before you start your server. That’s because the concurrent_reads setting is optimal at two threads per processor core. By default, this setting is 8, assuming a four-core box. If that’s what you have, you’re in business. If you have an eight-core box, tune it up to 16.
The concurrent_writes setting behaves somewhat differently. This should match the number of clients that will write concurrently to the server. If Cassandra is backing a web application server, you can tune this setting from its default of 32 to match the number of threads the application server has available to connect to Cassandra. It is common in Java application servers such as WebLogic to prefer database connection pools no larger than 20 or 30, but if you’re using several application servers in a cluster, you’ll need to factor that in as well.

Caching
There are several settings related to caching, both within Cassandra and at the operating system level. Caches can use considerable memory, and it’s a good idea to tune them carefully once you understand your usage patterns.
There are two primary caches built into Cassandra: a row cache and a key cache. The row cache caches complete rows (all of their columns), so it is a superset of the key cache. If you are using a row cache for a given column family, you will not need to use a key cache on it as well.
Your caching strategy should therefore be tuned in accordance with a few factors:
• Consider your queries, and use the cache type that best fits your queries.
• Consider the ratio of your heap size to your cache size, and do not allow the cache to overwhelm your heap.
• Consider the size of your rows against the size of your keys. Typically keys will be much smaller than entire rows.
The keys_cached setting indicates the number of key locations—not key values—that will be saved in memory. This can be specified as a fractional value (a number between 0 and 1) or as an integer. If you use a fraction, you’re indicating a percentage of keys to cache, and an integer value indicates an absolute number of keys whose locations will be cached.
Note: The keys_cached setting is a per-column family setting, so different column families can have different numbers of key locations cached if some are used more frequently than others.
This setting will consume considerable memory, but can be a good trade-off if your locations are not hot already.

You can also populate the row cache when the server starts up. To do this, use the preload_row_cache element. The default setting for this is false, but you will want to set it to true to improve performance. The cost is that bootstrapping can take longer if there is considerable data in the column family to preload.
The rows_cached setting specifies the number of rows that will be cached. By default,
this value is set to 0, meaning that no rows will be cached, so it’s a good idea to turn this on. If you use a fraction, you’re indicating a percentage of everything to cache, and an integer value indicates an absolute number of rows whose locations will be cached.
You’ll want to use this setting carefully, however, as this can easily get out of hand. If your column family gets far more reads than writes, then setting this number very high will needlessly consume considerable server resources. If your column family has a lower ratio of reads to writes, but has rows with lots of data in them (hundreds of columns), then you’ll need to do some math before setting this number very high. And unless you have certain rows that get hit a lot and others that get hit very little, you’re not going to see much of a boost here.

Buffer Sizes
The buffer sizes represent the memory allocation when performing certain operations: flush_data_buffer_size_in_mb
By default, this is set to 32 megabytes and indicates the size of the buffer to use when memtables get flushed to disk.
flush_index_buffer_size_in_mb
By default, this is set to 8 megabytes. If each key defines only a few columns, then it’s a good idea to increase the index buffer size. Alternatively, if your rows have many columns, then you’ll want to decrease the size of the buffer.
sliced_buffer_size_in_kb
Depending on how variable your queries are, this setting is unlikely to be very useful. It allows you to specify the size, in kilobytes, of the buffer to use when executing slices of adjacent columns. If there is a certain slice query that you perform far more than others, or if your data is laid out with a relatively consistent number of columns per family, then this setting could be moderately helpful on read operations. But note that this setting is defined globally.

The Python Stress Test
Cassandra ships with a popular utility called py_stress that you can use to run a stress
test on your Cassandra cluster. To run py_stress, navigate to the <cassandra-home>/
contrib directory. You might want to check out the README.txt file, as it will have the
list of dependencies required to run the tool.
Running the Python Stress Test
Navigate to the <cassandra-home>/contrib/py_stress directory. In a terminal, type stress.py to run the test against local host. Or indicate which node you want to connect to:
$ stress.py -d 192.168.1.5
Note: You can execute stress.py -h to get usage on the stress test script. 
The test will run until it inserts one million values, and then stop. I ran this test on a single regular workstation with an Intel I7 processor (which is similar to eight cores) with 4GB of RAM available and a lot of other processes running. Here was my output:
eben@morpheus$ ./stress.py -d 192.168.1.5 -o insert
total,interval_op_rate,avg_latency,elapsed_time
196499,19649,0.0024959407711,10
370589,17409,0.00282591440216,20
510076,13948,0.00295883878841,30
640813,13073,0.00438663874102,40
798070,15725,0.00312562838215,50
950489,15241,0.0029109908417,60
1000000,4951,0.00444872583334,70
Let’s unpack this a bit. What we’ve done is generated and inserted one million values into a completely untuned single node Cassandra server in about 70 seconds. You can see that in the first 10 seconds we inserted 196,499 randomly generated values. The average latency per operation is 0.0025 seconds, or 2.5 milliseconds. But this is using the defaults, and is on a Cassandra server that already has 1GB of data in the database to manage before running the test. Let’s give the test more threads to see if we can squeeze a little better performance out of it:
eben@morpheus$ ./stress.py -d 192.168.1.5 -o insert -t 10
total,interval_op_rate,avg_latency,elapsed_time
219217,21921,0.000410911544945,10
427199,20798,0.000430060066223,20
629062,20186,0.000443717396772,30
832964,20390,0.000437958271074,40
1000000,16703,0.000463042383339,50
What we’ve done here is used the -t flag to use 10 threads at once. This means that in 50 seconds, we wrote 1,000,000 records—about 2 milliseconds latency per write, with a totally untuned database that is already managing 1.5GB of data.
You should run the test several times to get an idea of the right number of threads given
your hardware setup. Depending on the number of cores on your system, you’re going
to see worse—not better—performance if you set the number of threads arbitrarily
high, because the processor will devote more time to managing the threads than doing
your work. You want this to be a rough match between the number of threads and the
number of cores available to get a reasonable test.
Now that we have all of this data in the database, let’s use the test to read some values
too:
$ ./stress.py -d 192.168.1.5 -o read
total,interval_op_rate,avg_latency,elapsed_time
103960,10396,0.00478858081549,10
225999,12203,0.00406984714627,20
355129,12913,0.00384438665076,30
485728,13059,0.00379976526221,40
617036,13130,0.00378045491559,50
749154,13211,0.00375620621777,60
880605,13145,0.00377542658007,70
1000000,11939,0.00374060139004,80
As you can see, Cassandra doesn’t read nearly as fast as it writes; it takes about 80 seconds to read one million values. Remember, though, that this is out of the box, untuned, single-threaded, on a regular workstation running other programs, and the database is 2GB in size. Regardless, this is a great tool to help you do performance tuning for your environment and to get a set of numbers that indicates what to expect in your cluster.

Startup and JVM Settings
Cassandra allows you to configure a variety of options for how the server should start up, how much Java memory should be allocated, and so forth. In this section we look at how to tune the startup.
If you’re using Windows, the startup script is called cassandra.bat, and on Linux it’s cassandra.sh. You can start the server by simply executing this file, which sets several defaults. But there’s another file in the bin directory that allows you to configure a variety of settings related to how Cassandra starts. This file is called cassandra.in.sh and it separates certain options, such as the JVM settings, into a different file to make it easier to update.
Tuning the JVM
Heap Min and Max 
By default, these are set to 256MB and 1GB, respectively. To tune these, set them higher (see following note) and to the same value.
Assertions 
By default, the JVM is passed the -ea option to enable assertions. Changing this option from -ea to -da (disable assertions) can have a positive effect on performance.
Survivor Ratio 
The Java heap is broadly divided into two object spaces: young and old. The young space is subdivided into one for new object allocation (called “eden space”) and another for new objects that are still in use. Older objects still have some reference, and have therefore survived a few garbage collections, so the Survivor Ratio is the ratio of eden space to survivor space in the young object part of the heap.
Increasing the ratio makes sense for applications with lots of new object creation and low object preservation; decreasing it makes sense for applications with longer-living objects. Cassandra sets this value to 8 by default, meaning that the ratio of eden to survivor space is 1:8 (each survivor space will be 1/8 the size of eden). This is fairly low, because the objects are living longer in the memtables. Tune this setting along with MaxTenuringThreshold.
MaxTenuringThreshold
Every Java object has an age field in its header, indicating how many times it has been copied within the young generation space. They’re copied (into a new space) when they survive a young generation garbage collection, and this copying has a cost. Because long-living objects may be copied many times, tuning this value can improve performance. By default, Cassandra has this value set at 1. Set it to 0 to immediately move an object that survives a young generation collection to the tenured generation.
UseConcMarkSweepGC

This instructs the JVM on what garbage collection (GC) strategy to use; specifically, it enables the ConcurrentMarkSweep algorithm. This setting uses more RAM, and uses more CPU power to do frequent garbage collections while the application is running in order to keep the GC pause time to a minimum. When using this strategy, it’s important to set the heap min and max values to the same value, in order to prevent the JVM from having to spend a lot of time growing the heap initially. It is possible to tune this to -XX:+UseParallelGC, which also takes advantage of multiprocessor machines; this will give you peak application performance, but with occasional pauses. Do not use the Serial GC with Cassandra.

Note: If you want to change any of these values, simply open the cassandra.in.sh file in a text editor, change the values, and restart.

The majority of options in the include configuration file surround the Java settings. For
example, the default setting for the maximum size of the Java heap memory usage is 1GB. If you’re on a machine capable of using more, you may want to tune this setting.
Try setting the -Xmx and -Xms options to the same value to keep Java from having to
manage heap growth.

Note: The maximum theoretical heap size for a 32-bit JVM is 4GB. However, do not simply set your JVM to use as much memory as you have available up to 4GB. There are many factors involved here, such as the amount of swap space and memory fragmentation. Simply increasing the size of the heap using -Xmx will not help if you don’t have any swap available.
Typically it is possible to get approximately 1.6GB of heap on a 32-bit JVM on Windows and closer to 2GB on Solaris. Using a 64-bit JVM on a 64-bit system will allow more space. See http://java.sun.com/docs/hot spot/HotSpotFAQ.html for more information.

Tuning some of these options will make stress tests perform better. For example, I saw
a 15% performance improvement using the following settings over the defaults:
JVM_OPTS=" \
-da \
-Xms1024M \
-Xmx1024M \
-XX:+UseParallelGC \
-XX:+CMSParallelRemarkEnabled \
-XX:SurvivorRatio=4 \
-XX:MaxTenuringThreshold=0
When performance tuning, it’s a good idea to set only the heap min and max options, and nothing else at first. Only after real-world usage in your environment and some performance benchmarking with the aid of heap analysis tools and observation of your specific application’s behavior should you dive into tuning the more advanced JVM settings. If you tune your JVM options and see some success using a load-testing tool or something like the Python stress test in contrib, don’t get too excited. You need to test under real-world conditions; don’t simply copy these settings.

Note: For more information on Java 6 performance tuning (Java 6 operates differently than previous versions), see 
http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html

In general, you’ll probably want to make sure that you’ve instructed the heap to dump its state if it hits an out of memory error. This is just good practice if you’ve been getting out of memory errors. You can also instruct the heap to print garbage-collection details. Also, if you have a lot of data in Cassandra and you’re noticing that garbage collection is causing long pauses, you can attempt to cause garbage collection to run when the heap has filled up less memory than it otherwise would take as a threshold to initialize a garbage collection. All of these parameters are shown here:
-XX:CMSInitiatingOccupancyFraction=88 \
-XX:+HeapDumpOnOutOfMemoryError \
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc \

You may want to check out Jonathan Ellis’s blog entry on using a variety of Linux performance monitoring tools at http://spyced.blogspot.com/2010/01/linux-performance-basics.html
In this chapter we looked at the settings available in Cassandra to aid in performance tuning, including caching settings, memory settings, and hardware concerns. We also set up and used the Python stress test tool to write and then read one million rows of data.

If this post has open your apatite for the subject, you can find much more information in the book itself: Cassandra: The Definitive Guide

The excerpt is from the book, 'Cassandra: The Definitive Guide', authored by Eben Hewit, published November 2010 by O’Reilly Media, Copyright 2011 Eben Hewitt.

No comments:

Post a Comment