green: Cassandra Data Model

Book Review and Excerpt
Cassandra: The Definitive Guide by Eben Hewitt
Chapter 3

My previous post on Cassandra gave you some hands-on experience with this database.
Let's get closer for understanding the Cassandra’s design goals, data model, and some general behavior characteristics.

We have the basic Cassandra data structures: the Column, which is a name/value pair (and a client-supplied timestamp of when it was last updated), and a Column Family, which is a container for rows that have similar, but not identical, column sets.

TIP: For a quick visual understanding of the differences between basic concepts in RDBMS and Cassandra have a look at the related DataStax article.

A column family is somewhat similar to a table in a RDMS. But:

Column families are schema-free (no predefined columns).

You cannot perform joins in Cassandra. If you have designed a data model and find that you need something like a join, you’ll have to either do the work on the client side, or create a denormalized second column family that represents the join results for you. This is common among Cassandra users. Performing joins on the client should be a very rare case; you really want to duplicate (denormalize) the data instead.

Column and Row Sorting in Column Families
Columns are sorted by the “Compare With” type defined on their enclosing column family, and you can choose from the following: AsciiType, BytesType (default one), LexicalUUIDType, IntegerType, LongType, TimeUUIDType, UTF8Type or a custom comparator that you can define.
Rows, on the other hand, are stored in an order defined by the partitioner (for example,
with RandomPartitioner, they are in random order, etc.).
Note: Column sorting is controllable, but key sorting isn’t; row keys always sort in byte order.
Also keep in mind that it is not possible in Cassandra to sort by value, as we’re used to doing in relational databases.

Column families support super columns (we'll come back later to this topic).

Each column family is stored on disk in its own separate file. So to optimize performance, it’s important to keep columns that you are likely to query together in the same column family, and a super column can be helpful for this.

Cassandra’s columns don’t have to be as simple as predefined name/value pairs; you can store useful data in the key itself (more then a simple String), not only in the value. This is somewhat common when creating indexes in Cassandra.

Cassandra's columns have a timestamp (while rows don't) used only for conflict resolution on the server side

Cassandra 0.7 introduced an optional time to live (TTL) value, which allows columns to expire a certain amount of time after creation. This can potentially prove very useful.

On the server side, columns are immutable in order to prevent multithreading issues.

Super Column Family
A row in a super column family still contains columns, each of which then contains subcolumns.

TIP: For a quick visual understanding of the differences between columns and super columns have a look at the related Jonathan's Lampe article (the pictures will tell the story).

Note: There is an important consideration when modeling with super columns: Cassandra
does not index subcolumns, so when you load a super column into memory, all of its columns are loaded as well.
This limitation was discovered by Ryan King, the Cassandra lead at Twitter. It might be fixed in a future release, but the change is pending an update to the underlying storage file (the SSTable).

Composite Keys
You can use a composite key of your own design to help you with queries. A composite key might be something like <userid:lastupdate>.
This could just be something that you consider when modeling, and then check back on later when you come to a hardware sizing exercise. But if your data model anticipates more than several thousand subcolumns, you might want to take a different approach and not use super columns. The alternative involves creating a composite key. Instead of representing columns within a super column, the composite key approach means that you use a regular column family with regular columns, and then employ a custom delimiter in your key name and parse it on client retrieval.

Example of a composite key pattern:
HotelByCity (CF) Key: city:state {
   key: Phoenix:AZ {AZC_043: -, AZS_011: -}
   key: San Francisco:CA {CAS_021: -}
   key: New York:NY {NYN_042: -}
}

There are three things happening here. First, we already have defined hotel information in another column family called Hotel. But we can create a second column family called HotelByCity that denormalizes the hotel data. We repeat the same information we already have, but store it in a way that acts similarly to a view in RDBMS, because it allows us a quick and direct way to write queries. When we know that we’re going to look up hotels by city (because that’s how people tend to search for them), we can create a table that defines a row key for that search. However, there are many states that have cities with the same name (Springfield comes to mind), so we can’t just name the row key after the city; we need to combine it with the state.
We then use another pattern called Valueless Column. All we need to know is what hotels are in the city, and we don’t need to denormalize further. So we use the column’s name as the value, and the column has no corresponding value. That is, when the column is inserted, we just store an empty byte array with it.

Keyspace
A keyspace is somewhat similar with a database schema in a RDBMS.

Clusters
A cluster holds one or more keyspaces (but typically it holds no more then one).
So the outermost structure in Cassandra is the cluster, sometimes called the ring, because Cassandra assigns data to nodes in the cluster by arranging them in a ring.

A node holds a replica for different ranges of data. If the first node goes down, a replica can respond to queries. The peer-to-peer protocol allows the data to replicate across nodes in a manner transparent to the user, and the replication factor is the number of machines in your cluster that will receive copies of the same data.

For further info on this topic you can look up strategies to customize the basic attributes of a keyspace: Replication factor (the number of nodes that will act as copies/replicas), Replica placement strategy (how the replicas will be placed in the ring), Column families attributes.

Note: It’s an inherent part of Cassandra’s replica design that all data for a single row must fit on a single machine in the cluster. The reason for this limitation is that rows have an associated row key, which is used to determine the nodes that will act as replicas for that row. Further, the value of a single column cannot exceed 2GB. Keep these things in mind
as you design your data model.

Four or Five-Dimensional Hash
Some people refer to Cassandra column families as similar to a four-dimensional hash:
[Keyspace][ColumnFamily][Key][Column]
But for super columns, it becomes more like a five-dimensional hash:
[Keyspace][ColumnFamily][Key][SuperColumn][SubColumn]

The Thrift API

The Thrift API is the underlying RPC serialization mechanism for performing remote operations on Cassandra. Because the Thrift API has no notion of inheritance, you will sometimes see the API refer to a ColumnOrSupercolumn type; when data structures use this type, you are expected to know whether your underlying column family is of type Super or Standard.

RDBMS vs Cassandra

No query language

No referential integrity
So we have no joins or cascading deletes in Cassandra.

Secondary Indexes
Support for secondary indexes is currently being added to Cassandra
0.7. This allows you to create indexes on column values. So, if you want to see all the users who live in a given city, for example, secondary index support will save you from doing it from scratch.

Sorting is a Design Decision
In Cassandra, column family definitions include a CompareWith element, which dictates the order in which your rows will be sorted on reads, but this is not configurable per query.
Where RDBMS constrains you to sorting based on the data type stored in the column, Cassandra only stores byte arrays, so that approach doesn’t make sense. What you can do, however, is sort as if the column were one of several different types (ASCII, Long integer, TimestampUUID, lexicographically, etc.). You can also use your own pluggable comparator for sorting if you wish.
Otherwise, there is no support for ORDER BY and GROUP BY statements in Cassandra
as there is in SQL. There is a query type called a SliceRange which is similar to ORDER BY in that it allows a reversal.

Denormalization
In the relational world, denormalization violates Codd's normal forms, and we try to avoid it. But in Cassandra, denormalization is, well, perfectly normal. Cassandra performs best when
the data model is denormalized. It's not required if your data model is simple. But don't be afraid of it.
The important point is that instead of modeling the data first and then writing queries, with Cassandra you model the queries and let the data be organized around them.
Think of the most common query paths your application will use, and then create the column families that you need to support them.
Detractors have suggested that this is a problem. But it is perfectly reasonable to expect
that you should think hard about the queries in your application, just as you would,
presumably, think hard about your relational domain. You may get it wrong, and then
you’ll have problems in either world. Or your query needs might change over time, and
then you’ll have to work to update your data set. But this is no different from defining
the wrong tables, or needing additional tables, in RDBMS.

Design Patterns for using Cassandra

Materialized View
It is common to create a secondary index that represents additional queries. Because you don’t have a SQL WHERE clause, you can recreate this effect by writing your data to a second column family that is created specifically to represent that query.
Note: As of 0.7, Cassandra has native support for secondary indexes.

Valueless Column
Let’s build on our User/UserCity example (previously mentioned). Because we’re storing the reference data in the User column family, two things arise: one, you need to have unique and thoughtful keys that can enforce referential integrity; and two, the columns in the UserCity column family don’t necessarily need values. If you have a row key of Boise, then the column names can be the names of the users in that city. Because your reference data is in the User column family, the columns don’t really have any meaningful value; you’re just using it as a prefabricated list, but you’ll likely want to use values in that list to get additional data from the reference column family.

Aggregate Key
When you use the Valueless Column pattern, you may also need to employ the Aggregate Key pattern. This pattern fuses together two scalar values with a separator to create an aggregate. To extend our example further, city names typically aren’t unique; many states in the US have a city called Springfield, and there’s a Paris, Texas, and a Paris, Tennessee. So what will work better here is to fuse together the state name and the city name to create an Aggregate Key to use in our Materialized View. This key would look something like: TX:Paris or TN:Paris. By convention, many Cassandra users employ the colon as the separator, but it could be a pipe character or any other character that is not otherwise meaningful in your keys.

Some Things to Keep in Mind
A few things to keep in mind when you’re trying to move from a relational mindset to Cassandra’s data model:
• Start with your queries. Ask what queries your application will need, and model the data around that instead of modeling the data first, as you would in the relational world.
• You have to supply a timestamp (or clock) with each query, so you need a strategy to synchronize those with multiple clients. This is crucial in order for Cassandra to use the timestamps to determine the most recent write value. One good strategy here is the use of a Network Time Protocol (NTP) server.

In this chapter we took a gentle approach to understanding Cassandra’s data model of
keyspaces, column families, columns, and super columns. We also explored a few of
the contrasts between RDBMS and Cassandra.

If this post has open your apatite for the subject, you can find much more information in the book itself: Cassandra: The Definitive Guide

green

Thursday, March 1, 2012

Cassandra Data Model

1 comment: