Do You Really Need SQL to Do It All in Cassandra?
NoSQL database systems are designed for scalability. The down side of that is a primitive key-value data model and, as the name suggest, no support for SQL. It might sound like a serious limitation – how can I “select”, “join”, “group” and “sort” the data? This post explains how all these operations can be quite naturally and efficiently implemented in one of the most famous NoSQL system – Cassandra.
To understand this post you need to know the Cassandra data model. You can find a quick introduction in my previous post. The power of the Cassandra data model is that it extends a basic key-value store with efficient data nesting (via columns and super columns). It means that you can read/update a column (or a super column) without retrieving the whole record. Below I describe how we can exploit data nesting to support various query operations.
Let’s consider a basic example: departments and employees with one-to-many relationships respectively. So we have two column families: Emps and Deps. In Emps employee IDs are used as keys and there are Name, Birthdate, and City columns. In Deps keys are department IDs and the single column is Name.
1) Select
For example: select * from Emps where Birthdate = '25/04/1975'
To support this query we need to add one more column family named Birthdate_Emps in which key is a date and column names are IDs of those employees that were born on the date. The values are not used here and can be an empty byte array (denoted “-”). Every time when a new employee is inserted/deleted into/from Emps we need to update Birthdate_Emps. To execute the query we just need to retrieve all the columns for the key '25/04/1975' from Birthdate_Emps.
Notice that Birthdate_Emps is essentially an index that allows us to execute the query very efficiently. And this index is scalable as it is distributed across Cassandra nodes. You can go even further to speed up the query by redundantly storing information about employees (i.e. employee’s columns from Emps) in Birthdate_Emps. In this case employee IDs becomes names of super columns that contain corresponding employee columns.
2) Join
For example: select * from Emps e, Deps d where e.dep_id = d.dep_id
What does join essentially mean? It constructs records that represent relationship between entities. Such relationships can be easily (and even more naturally) represented via nesting. To do that add column family Dep_Emps in which key is a department ID and column names are IDs of the corresponding employees.

3) Group By
For example: select count(*) from Emps group by City
From implementation viewpoint Group By is very similar to select/indexing described above. You just need to add a column family City_Emps with cities as keys and employee IDs as column names. In this case you will count the number of employees on retrieval. Or you can have a single column named count which value is the pre-calculated number of employees in the city.
4) Order By
To keep data sorted in Cassandra you can use two mechanisms: (a) records can be sorted by keys using OrderPreservingPartitioner with range queries (more on this in Cassandra: RandomPartitioner vs OrderPreservingPartitioner). To keep nested data sorted you can use automatically supported ordering for column names.
To support all these operations we store redundant data optimized for each particular query. It has two implications:
1) You must know queries in advance (i.e. no support for ad-hoc queries). However, typically in Web applications and enterprise OLTP applications queries are well known in advance, few in number, and do not change often. Read Mike Stonebraker convincingly talking about that. BTW, Constraint Tree Schema, described in the latter paper, also exploits nesting to organize data for predefined queries.
2) We shift the burden from querying to updating because what we essentially do is supporting materialized views (i.e. pre-computed results of queries). But it makes a lot of sense in case of using Cassandra as Cassandra is very much optimized for updates (thanks to eventual consistency and “log-structured” storage borrowed from Google BigTable). So we can use fast updates to speed up query execution. Moreover, use-cases typical for social applications are proven to be only scalable with push-on-change model (i.e. preliminary data propagation via updates with simple queries – the approach taken in this post) in comparison with pull-on-demand model (i.e. data are stored normalized and combined by queries on demand – classical relational approach). On push-on-change versus pull-on-demand read WHY ARE FACEBOOK, DIGG, AND TWITTER SO HARD TO SCALE?





Great post Max.
The examples are very useful and clear. What happens when you need something not so trivial, say that you need to do not just a join on e.dep_id = d.dep_id, but on some other values. What does one do then? Could you please give an example of that, so one can see how one “scales” the modeling further? Thanks.
Otis Gospodnetic
July 16, 2010 at 6:26 pm
Thank you!
I just did not want to make the post too long. Answering your question, the approach seems to scale quite well by combining the above techniques. For example, suppose that you want to query select e.name from Deps d, Emps e where d.dep_id=e.dep_id and dep.located=’Bern’. You can create a column family that maps cities to the IDs of the employees who work in the city. In general the aproach should work fine for ‘equal’ predicates. It is not a perfect solution for ‘range’ predicates as range scan over keys is not quite efficient.
maxgrinev
July 22, 2010 at 10:21 am
Thanks for your posts Max. I’m starting to get a better grasp of how to structure data in Cassandra based on the logical (normalised) data model and the queries that will be performed. However, I’m still not sure about how to structure the data for queries against multiple attributes.
In your example, you created indices for employee birthdays and employee cities. What if I wanted to find all the employees born in 1975 currently in San Francisco? Can such an index be built or would the query need to be performed against the emps column family?
Thanks!
You can build sort of compound index. A key in this index is the concatenation of the year of birth and the city of current location and each record contains all employees born in this year and currently located in this city.
Carlos Macasaet
August 16, 2010 at 6:18 am
[...] Similar to indexes, materialized views store data redundantly. So they also can be supported via triggers. Find general guidelines on managing redundant data (indexes and materialized views) in Cassandra in my post “Do You Really Need SQL to Do It All in Cassandra?”. [...]
Extending Cassandra with Asynchronous Triggers « Max's Output
July 23, 2010 at 3:23 pm
Hi,
I have a requirement of using Cassandra in my application. In my application there is one table with lot of data and most of my application uses that table. Due to lot of data,performance of the application is decreasing when i use that table is in Oracle.
So, I have decided to use the Cassandra database for that one table and all other tables in oracle. Lot of business logic is dependent on that table.
No my question is, Can I use the Cassandra for a table which has lot of business logic.
I am unable to implement lot of where clauses for Cassandra database.
Is there any supporting tool to use Cassandra in an efficient way?
Please let me know…
i am in urgency..
Thanks in advance
By Mallik
mallikarjungunda
August 18, 2010 at 8:47 am
[...] Do You Really Need SQL to Do It All in Cassandra? [...]
Cassandra and PHP become friendly with phpcassa | Nowvu Blog
August 19, 2010 at 2:41 am
What about:
select * from Emps where Birthdate >= ’25/04/1975′;
We have not been able to adopt Cassandra because the notion of this query is nearly impossible (without a fixed data set) with the data model. Maybe I am overlooking something, wouldnt be the first time.
As far as duplicating data, what impact does this have on the databases disk usage footprint?
Basically, I am in a situation where I am collecting a decent amount of data, in a structure that can change, but require the ability of some type of dynamic filter like I mentioned up top… Any suggestions would be great, because I really like the idea of switching from MySql to some NoSql solution, because it would solve most of my dynamic schema issues…
Justin Smith
September 27, 2010 at 10:36 pm
Good information Max.
As you explained for where condition of Birthdate, is we are having two columnfamilies (Emps, Birthdates_Emps) in the Keyspace. If so can explain in brief how storage-conf.xml entry for this.
Thanks in Advance.
sekhar
September 28, 2010 at 7:22 am
[...] Because Cassandra cannot execute SQL query commands, the additional columns would be needed to analyze the data within a specific row, elaborated computer scientist Maxim Grinev in a recent blog entry. [...]
New Cassandra Can Pack Two Billion Columns Into a Row (PC World) – Linux Guide Online
January 14, 2011 at 10:29 pm
[...] Because Cassandra cannot execute SQL query commands, the additional columns would be needed to analyze the data within a specific row, elaborated computer scientist Maxim Grinev in a recent blog entry. [...]
Gibizz News » New Cassandra Can Pack Two Billion Columns Into a Row (PC World)
January 14, 2011 at 11:58 pm
[...] Because Cassandra can't govern SQL query commands, a additional columns would be indispensable to investigate a information within a specific row, elaborated mechanism scientist Maxim Grinev in a new blog entry. [...]
Gibizz.com » New Cassandra Can Pack Two Billion Columns Into a Row (PC World)
January 15, 2011 at 1:44 am
Thanks for this.
Nice and concise.
Justin
January 21, 2011 at 5:49 pm
hi
thanx it really helps. ur site is bookmarked, i’ll learn a lot from u insallah
Saber
April 10, 2011 at 7:03 am
how can I “select”, “join”, “group” and “sort” the data in cassandra database . Give with examples
AJITH PPM
May 11, 2011 at 12:36 pm
[...] 本文翻译自:Do You Really Need SQL to Do It All in Cassandra? [...]
Cassandra中实现SQL操作 | 董的博客
June 30, 2011 at 8:26 am
Eipxzc http://wnbUj5n0mXqpcvm27Hms.biz
joseph
August 24, 2011 at 8:36 am
This is, IMHO, the dark side of the story: suppose you need to know who works in a particular department than you have to create a super column family EMP_DEP where
- The key is the dep.key
- The super columns name is the emp.key
- The super column is the set of the employees (to improve the retrieve operation)
IMHO, it is impressive because there is redundancy and a lot of pre-work.
Is my understanding correct? thanks
Michelangelo
Michelangelo
August 24, 2011 at 4:34 pm
I’d like to cancel this standing order primetime bikini yjgnwe
Tmzokwdk
September 4, 2011 at 5:44 am
Sorry, you must have the wrong number preteen
0926
Fpzicbir
September 7, 2011 at 1:09 am
A few months Lolita Cp ijf
Moulfkta
September 20, 2011 at 2:16 am
I read a lot Pussy Little Model
=[[
Lzgynpoa
September 25, 2011 at 3:43 am