Max's Output

A Quick Introduction to the Cassandra Data Model

with 10 comments

For newcomers Cassandra data model is a mess. Even experienced database developers spend quite a bit of time learning it. There are great articles on the Web that explain the model. Read WTF is a SuperColumn? An Intro to the Cassandra Data Model and my favorite one – Installing and using Apache Cassandra With Java. This blog post is my take to explain Cassandra model to those who would like to understand the key ideas in 15 minutes or less.

In a nutshell, Cassandra data model can be described as follows:

1) Cassandra is based on a key-value model

A database consists of column families. A column family is a set of key-value pairs. I know the terminology is confusing but so far it is just basic key-value model. Drawing an analogy with relational databases, you can think about column family as table and a key-value pair as a record in a table.

2) Cassandra extends basic key-value model with two levels of nesting

At the first level the value of a record is in turn a sequence of key-value pairs. These nested key-value pairs are called columns where key is the name of the column. In other words you can say that a record in a column family has a key and consists of columns. This level of nesting is mandatory – a record must contain at least one column (so in the first point above value of a record was an intermediate notion as value is actually a sequence of columns).

At the second level, which is arbitrary, the value of a nested key-value pair can be a sequence of key-value pairs as well. When the second level of nesting is presented, outer key-value pairs are called super columns with key being the name of the super column and inner key-value pairs are called columns.

3) The names of both columns and super columns can be used in two ways: as names or as values (usually reference value).

First, names can play the role of attribute names. For example, the name of a column in a record about User can be Email. That is how we used to think about columns in relational databases.

Second, names can also be used to store values! For example, column names in a record which represent Blog can be identifiers of the posts of this blog and the corresponding column values are posts themselves. You can really use column (or super column) names to store some values because (a) theoretically there is no limitation on the number of columns (or super columns) for any given record and (b) names are byte arrays so that you can encode any value in it.

4) Columns and super columns are stored ordered by names.

You can specify sorting behavior by defining how Cassandra treats the names of (super) columns (recall that a name is just an byte array). Name can be treated as Bytes Type, Long Type, Ascii Type, UTF8 Type, Lexical UUID Type, Time UUID Type.

So now you know everything you need to know. Let’s consider an classical :) example of Twitter database to demonstrate the points.

Column family Tweetscontains records representing tweets. The key of a record is of Time UUID type and generated when the tweet is received (we will use this feature in User_Timelines column family below). The records consist of columns (no super columns here). Columns simply represent attributes of tweets. So it is very similar to how one would store it in a relational database.

The next example is User_Timelines (i.e. tweets posted by a user). Records are keyed by user IDs (referenced by User_ID columns in Tweets column family). User_Timelines demonstrates how column names can be used to store values – tweet IDs in this case. The type of column names is defined as Time UUID. It means that tweets IDs are kept ordered by the time of posting. That is very useful as we usually want to show the last N tweets for a user. Values of all columns are set to an empty byte array (denoted “-”) as they are not used.

To demonstrate super columns let us assume that we want to collect statistics about URLs posted by each user. For that we need to group all the tweets posted by a user by URLs contained in the tweets. It can be stored using super columns as follows.

In User_URLs the names of the super columns are used to store URLs and the names of the nested columns are the corresponding tweet IDs.

Important note: currently Cassandra automatically supports indexes for column names but does not support indexes for the names of super columns. In our example it means that you cannot efficiently retrieve/update tweet ids by URL.

[Update: The above note is incorrect! It is subcolumn names that are not indexed inside super columns. Supercolumn names are always indexed. It is a great news as it enables the use-case of data denormalization to speed up queries. For more on this, find the first comment by Jonathan Ellis below. I cover denormalization use-cases in my next post.]

Let me know if I missed anything or something is unclear.

Advertisement

Written by maxgrinev

July 9, 2010 at 9:52 pm

Posted in Uncategorized

10 Responses

Subscribe to comments with RSS.

  1. Hi Maxim,

    This is an excellent post!

    A couple clarifications:

    - if you don’t have anything relevant to store in the column value, leaving it an empty byte array is fine

    - it is _subcolumn_ names that are not indexed inside supercolumns, meaning that you should only store data inside supercolumns that you plan to access together. In your example the supercolumns would be fine. Top level columns (the supercolumn’s name) are always indexed.

    - The most common use case for supercolumns is for denormalizing data from another columnfamily, e.g. in user_timelines you could make the tweet id a supercolumn name, with subcolumns of the actual tweet field names + values. This makes reads still more efficient, since you don’t have to perform joins manually via multiget at read time.

    Jonathan Ellis

    July 13, 2010 at 3:25 am

  2. [...] A Quick Introduction to the Cassandra Data Model « Max’s Output – August 4th %(postalicious-tags)( tags: cassandra nosql data model tutorial database intro )% [...]

  3. Since each record of User_URLs has a collection of super columns, does that make it a super column family? Or am I misunderstanding the distinction between column families and super column families?

    Carlos Macasaet

    August 16, 2010 at 5:31 am

  4. Great post Maxim! Do you have the direct experience with using it in enterprise model?

    Michael

    September 8, 2010 at 2:19 pm

    • Not yet in enterprise settings. We are successfully using this approach for our Web/social applications.

      maxgrinev

      September 8, 2010 at 3:19 pm

  5. [...] Page na Apache Tipos de Dados no Cassandra Introdução ao Modelo de Dados Cassandra Mais um tutorial sobre o modelo de dados [...]

  6. [...] For quick introduction of Cassandra Data Model : Cassandra Data Model of Max Version [...]

  7. [...] besitzt ein relativ einfaches Datenmodell (siehe auch hier). Dies ist eine Mischung aus einem Key-Value-Store und einer spaltenorientierten Datenbank. Die [...]


Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

Please log in to WordPress.com to post a comment to your blog.

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.