Max's Output

Do You Really Need SQL to Do It All in Cassandra?

with 21 comments

NoSQL database systems are designed for scalability. The down side of that is a primitive key-value data model and, as the name suggest, no support for SQL. It might sound like a serious limitation – how can I “select”, “join”, “group” and “sort” the data? This post explains how all these operations can be quite naturally and efficiently implemented in one of the most famous NoSQL system – Cassandra.

To understand this post you need to know the Cassandra data model. You can find a quick introduction in my previous post. The power of the Cassandra data model is that it extends a basic key-value store with efficient data nesting (via columns and super columns). It means that you can read/update a column (or a super column) without retrieving the whole record. Below I describe how we can exploit data nesting to support various query operations.

Let’s consider a basic example: departments and employees with one-to-many relationships respectively.  So we have two column families: Emps and Deps. In Emps employee IDs are used as keys and there are Name, Birthdate, and City columns. In Deps keys are department IDs and the single column is Name.

1) Select
For example: select * from Emps where Birthdate = '25/04/1975'
To support this query we need to add one more column family named Birthdate_Emps in which key is a date and column names are IDs of those employees that were born on the date. The values are not used here and can be an empty byte array (denoted “-”). Every time when a new employee is inserted/deleted into/from Emps we need to update Birthdate_Emps. To execute the query we just need to retrieve all the columns for the key '25/04/1975' from Birthdate_Emps.

Notice that Birthdate_Emps is essentially an index that allows us to execute the query very efficiently. And this index is scalable as it is distributed across Cassandra nodes. You can go even further to speed up the query by redundantly storing information about employees (i.e. employee’s columns from Emps) in Birthdate_Emps. In this case employee IDs becomes names of super columns that contain corresponding employee columns.

2) Join
For example: select * from Emps e, Deps d where e.dep_id = d.dep_id
What does join essentially mean? It constructs records that represent relationship between entities. Such relationships can be easily (and even more naturally) represented via nesting. To do that add column family Dep_Emps in which key is a department ID and column names are IDs of the corresponding employees.

3) Group By
For example: select count(*) from Emps group by City
From implementation viewpoint Group By is very similar to select/indexing described above. You just need to add a column family City_Emps with cities as keys and employee IDs as column names. In this case you will count the number of employees on retrieval. Or you can have a single column named count which value is the pre-calculated number of employees in the city.

4) Order By
To keep data sorted in Cassandra you can use two mechanisms: (a) records can be sorted by keys using OrderPreservingPartitioner with range queries (more on this in Cassandra: RandomPartitioner vs OrderPreservingPartitioner). To keep nested data sorted you can use automatically supported ordering for column names.

To support all these operations we store redundant data optimized for each particular query. It has two implications:
1) You must know queries in advance (i.e. no support for ad-hoc queries). However, typically in Web applications and enterprise OLTP applications queries are well known in advance, few in number, and do not change often. Read Mike Stonebraker convincingly talking about that.  BTW, Constraint Tree Schema, described in the latter paper, also exploits nesting to organize data for predefined queries.

2) We shift the burden from querying to updating because what we essentially do is supporting materialized views (i.e. pre-computed results of queries). But it makes a lot of sense in case of using Cassandra as Cassandra is very much optimized for updates (thanks to eventual consistency and “log-structured” storage borrowed from Google BigTable). So we can use fast updates to speed up query execution. Moreover, use-cases typical for social applications are proven to be only scalable with push-on-change model (i.e. preliminary data propagation via updates with simple queries – the approach taken in this post) in comparison with pull-on-demand model (i.e. data are stored normalized and combined by queries on demand – classical relational approach). On push-on-change versus pull-on-demand read WHY ARE FACEBOOK, DIGG, AND TWITTER SO HARD TO SCALE?

Advertisement

Written by maxgrinev

July 12, 2010 at 12:56 am

Posted in Cassandra

21 Responses

Subscribe to comments with RSS.

  1. Great post Max.

    The examples are very useful and clear. What happens when you need something not so trivial, say that you need to do not just a join on e.dep_id = d.dep_id, but on some other values. What does one do then? Could you please give an example of that, so one can see how one “scales” the modeling further? Thanks.

    Otis Gospodnetic

    July 16, 2010 at 6:26 pm

    • Thank you!

      I just did not want to make the post too long. Answering your question, the approach seems to scale quite well by combining the above techniques. For example, suppose that you want to query select e.name from Deps d, Emps e where d.dep_id=e.dep_id and dep.located=’Bern’. You can create a column family that maps cities to the IDs of the employees who work in the city. In general the aproach should work fine for ‘equal’ predicates. It is not a perfect solution for ‘range’ predicates as range scan over keys is not quite efficient.

      maxgrinev

      July 22, 2010 at 10:21 am

      • Thanks for your posts Max. I’m starting to get a better grasp of how to structure data in Cassandra based on the logical (normalised) data model and the queries that will be performed. However, I’m still not sure about how to structure the data for queries against multiple attributes.

        In your example, you created indices for employee birthdays and employee cities. What if I wanted to find all the employees born in 1975 currently in San Francisco? Can such an index be built or would the query need to be performed against the emps column family?

        Thanks!

        You can build sort of compound index. A key in this index is the concatenation of the year of birth and the city of current location and each record contains all employees born in this year and currently located in this city.

        Carlos Macasaet

        August 16, 2010 at 6:18 am

  2. [...] Similar to indexes, materialized views store data redundantly. So they also can be supported via triggers. Find general guidelines on managing redundant data (indexes and materialized views) in Cassandra in my post “Do You Really Need SQL to Do It All in Cassandra?”. [...]

  3. Hi,

    I have a requirement of using Cassandra in my application. In my application there is one table with lot of data and most of my application uses that table. Due to lot of data,performance of the application is decreasing when i use that table is in Oracle.

    So, I have decided to use the Cassandra database for that one table and all other tables in oracle. Lot of business logic is dependent on that table.

    No my question is, Can I use the Cassandra for a table which has lot of business logic.

    I am unable to implement lot of where clauses for Cassandra database.

    Is there any supporting tool to use Cassandra in an efficient way?

    Please let me know…
    i am in urgency..

    Thanks in advance

    By Mallik

    mallikarjungunda

    August 18, 2010 at 8:47 am

  4. [...] Do You Really Need SQL to Do It All in Cassandra? [...]

  5. What about:

    select * from Emps where Birthdate >= ’25/04/1975′;

    We have not been able to adopt Cassandra because the notion of this query is nearly impossible (without a fixed data set) with the data model. Maybe I am overlooking something, wouldnt be the first time.

    As far as duplicating data, what impact does this have on the databases disk usage footprint?

    Basically, I am in a situation where I am collecting a decent amount of data, in a structure that can change, but require the ability of some type of dynamic filter like I mentioned up top… Any suggestions would be great, because I really like the idea of switching from MySql to some NoSql solution, because it would solve most of my dynamic schema issues…

    Justin Smith

    September 27, 2010 at 10:36 pm

  6. Good information Max.

    As you explained for where condition of Birthdate, is we are having two columnfamilies (Emps, Birthdates_Emps) in the Keyspace. If so can explain in brief how storage-conf.xml entry for this.

    Thanks in Advance.

    sekhar

    September 28, 2010 at 7:22 am

  7. [...] Because Cassandra cannot execute SQL query commands, the additional columns would be needed to analyze the data within a specific row, elaborated computer scientist Maxim Grinev in a recent blog entry. [...]

  8. [...] Because Cassandra cannot execute SQL query commands, the additional columns would be needed to analyze the data within a specific row, elaborated computer scientist Maxim Grinev in a recent blog entry. [...]

  9. [...] Because Cassandra can't govern SQL query commands, a additional columns would be indispensable to investigate a information within a specific row, elaborated mechanism scientist Maxim Grinev in a new blog entry. [...]

  10. Thanks for this.
    Nice and concise.

    Justin

    January 21, 2011 at 5:49 pm

  11. hi
    thanx it really helps. ur site is bookmarked, i’ll learn a lot from u insallah

    Saber

    April 10, 2011 at 7:03 am

  12. how can I “select”, “join”, “group” and “sort” the data in cassandra database . Give with examples

    AJITH PPM

    May 11, 2011 at 12:36 pm

  13. [...] 本文翻译自:Do You Really Need SQL to Do It All in Cassandra? [...]

  14. This is, IMHO, the dark side of the story: suppose you need to know who works in a particular department than you have to create a super column family EMP_DEP where
    - The key is the dep.key
    - The super columns name is the emp.key
    - The super column is the set of the employees (to improve the retrieve operation)

    IMHO, it is impressive because there is redundancy and a lot of pre-work.

    Is my understanding correct? thanks
    Michelangelo

    Michelangelo

    August 24, 2011 at 4:34 pm

  15. I’d like to cancel this standing order primetime bikini yjgnwe

    Tmzokwdk

    September 4, 2011 at 5:44 am

  16. Sorry, you must have the wrong number preteen
    0926

    Fpzicbir

    September 7, 2011 at 1:09 am

  17. A few months Lolita Cp ijf

    Moulfkta

    September 20, 2011 at 2:16 am

  18. I read a lot Pussy Little Model
    =[[

    Lzgynpoa

    September 25, 2011 at 3:43 am


Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

Please log in to WordPress.com to post a comment to your blog.

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.