Max's Output

Do You Really Need SQL to Do It All in Cassandra?

with 49 comments

NoSQL database systems are designed for scalability. The down side of that is a primitive key-value data model and, as the name suggest, no support for SQL. It might sound like a serious limitation – how can I “select”, “join”, “group” and “sort” the data? This post explains how all these operations can be quite naturally and efficiently implemented in one of the most famous NoSQL system – Cassandra.

To understand this post you need to know the Cassandra data model. You can find a quick introduction in my previous post. The power of the Cassandra data model is that it extends a basic key-value store with efficient data nesting (via columns and super columns). It means that you can read/update a column (or a super column) without retrieving the whole record. Below I describe how we can exploit data nesting to support various query operations.

Let’s consider a basic example: departments and employees with one-to-many relationships respectively.  So we have two column families: Emps and Deps. In Emps employee IDs are used as keys and there are Name, Birthdate, and City columns. In Deps keys are department IDs and the single column is Name.

1) Select
For example: select * from Emps where Birthdate = '25/04/1975'
To support this query we need to add one more column family named Birthdate_Emps in which key is a date and column names are IDs of those employees that were born on the date. The values are not used here and can be an empty byte array (denoted “-”). Every time when a new employee is inserted/deleted into/from Emps we need to update Birthdate_Emps. To execute the query we just need to retrieve all the columns for the key '25/04/1975' from Birthdate_Emps.

Notice that Birthdate_Emps is essentially an index that allows us to execute the query very efficiently. And this index is scalable as it is distributed across Cassandra nodes. You can go even further to speed up the query by redundantly storing information about employees (i.e. employee’s columns from Emps) in Birthdate_Emps. In this case employee IDs becomes names of super columns that contain corresponding employee columns.

2) Join
For example: select * from Emps e, Deps d where e.dep_id = d.dep_id
What does join essentially mean? It constructs records that represent relationship between entities. Such relationships can be easily (and even more naturally) represented via nesting. To do that add column family Dep_Emps in which key is a department ID and column names are IDs of the corresponding employees.

3) Group By
For example: select count(*) from Emps group by City
From implementation viewpoint Group By is very similar to select/indexing described above. You just need to add a column family City_Emps with cities as keys and employee IDs as column names. In this case you will count the number of employees on retrieval. Or you can have a single column named count which value is the pre-calculated number of employees in the city.

4) Order By
To keep data sorted in Cassandra you can use two mechanisms: (a) records can be sorted by keys using OrderPreservingPartitioner with range queries (more on this in Cassandra: RandomPartitioner vs OrderPreservingPartitioner). To keep nested data sorted you can use automatically supported ordering for column names.

To support all these operations we store redundant data optimized for each particular query. It has two implications:
1) You must know queries in advance (i.e. no support for ad-hoc queries). However, typically in Web applications and enterprise OLTP applications queries are well known in advance, few in number, and do not change often. Read Mike Stonebraker convincingly talking about that.  BTW, Constraint Tree Schema, described in the latter paper, also exploits nesting to organize data for predefined queries.

2) We shift the burden from querying to updating because what we essentially do is supporting materialized views (i.e. pre-computed results of queries). But it makes a lot of sense in case of using Cassandra as Cassandra is very much optimized for updates (thanks to eventual consistency and “log-structured” storage borrowed from Google BigTable). So we can use fast updates to speed up query execution. Moreover, use-cases typical for social applications are proven to be only scalable with push-on-change model (i.e. preliminary data propagation via updates with simple queries – the approach taken in this post) in comparison with pull-on-demand model (i.e. data are stored normalized and combined by queries on demand – classical relational approach). On push-on-change versus pull-on-demand read WHY ARE FACEBOOK, DIGG, AND TWITTER SO HARD TO SCALE?

About these ads

Written by maxgrinev

July 12, 2010 at 12:56 am

Posted in Cassandra

49 Responses

Subscribe to comments with RSS.

  1. Great post Max.

    The examples are very useful and clear. What happens when you need something not so trivial, say that you need to do not just a join on e.dep_id = d.dep_id, but on some other values. What does one do then? Could you please give an example of that, so one can see how one “scales” the modeling further? Thanks.

    Otis Gospodnetic

    July 16, 2010 at 6:26 pm

    • Thank you!

      I just did not want to make the post too long. Answering your question, the approach seems to scale quite well by combining the above techniques. For example, suppose that you want to query select e.name from Deps d, Emps e where d.dep_id=e.dep_id and dep.located=’Bern’. You can create a column family that maps cities to the IDs of the employees who work in the city. In general the aproach should work fine for ‘equal’ predicates. It is not a perfect solution for ‘range’ predicates as range scan over keys is not quite efficient.

      maxgrinev

      July 22, 2010 at 10:21 am

      • Thanks for your posts Max. I’m starting to get a better grasp of how to structure data in Cassandra based on the logical (normalised) data model and the queries that will be performed. However, I’m still not sure about how to structure the data for queries against multiple attributes.

        In your example, you created indices for employee birthdays and employee cities. What if I wanted to find all the employees born in 1975 currently in San Francisco? Can such an index be built or would the query need to be performed against the emps column family?

        Thanks!

        You can build sort of compound index. A key in this index is the concatenation of the year of birth and the city of current location and each record contains all employees born in this year and currently located in this city.

        Carlos Macasaet

        August 16, 2010 at 6:18 am

    • That’s a knowing answer to a diffcilut question

      Roxy

      August 9, 2012 at 7:43 pm

  2. [...] Similar to indexes, materialized views store data redundantly. So they also can be supported via triggers. Find general guidelines on managing redundant data (indexes and materialized views) in Cassandra in my post “Do You Really Need SQL to Do It All in Cassandra?”. [...]

  3. Hi,

    I have a requirement of using Cassandra in my application. In my application there is one table with lot of data and most of my application uses that table. Due to lot of data,performance of the application is decreasing when i use that table is in Oracle.

    So, I have decided to use the Cassandra database for that one table and all other tables in oracle. Lot of business logic is dependent on that table.

    No my question is, Can I use the Cassandra for a table which has lot of business logic.

    I am unable to implement lot of where clauses for Cassandra database.

    Is there any supporting tool to use Cassandra in an efficient way?

    Please let me know…
    i am in urgency..

    Thanks in advance

    By Mallik

    mallikarjungunda

    August 18, 2010 at 8:47 am

  4. [...] Do You Really Need SQL to Do It All in Cassandra? [...]

  5. What about:

    select * from Emps where Birthdate >= ’25/04/1975′;

    We have not been able to adopt Cassandra because the notion of this query is nearly impossible (without a fixed data set) with the data model. Maybe I am overlooking something, wouldnt be the first time.

    As far as duplicating data, what impact does this have on the databases disk usage footprint?

    Basically, I am in a situation where I am collecting a decent amount of data, in a structure that can change, but require the ability of some type of dynamic filter like I mentioned up top… Any suggestions would be great, because I really like the idea of switching from MySql to some NoSql solution, because it would solve most of my dynamic schema issues…

    Justin Smith

    September 27, 2010 at 10:36 pm

  6. Good information Max.

    As you explained for where condition of Birthdate, is we are having two columnfamilies (Emps, Birthdates_Emps) in the Keyspace. If so can explain in brief how storage-conf.xml entry for this.

    Thanks in Advance.

    sekhar

    September 28, 2010 at 7:22 am

  7. [...] Because Cassandra cannot execute SQL query commands, the additional columns would be needed to analyze the data within a specific row, elaborated computer scientist Maxim Grinev in a recent blog entry. [...]

  8. [...] Because Cassandra cannot execute SQL query commands, the additional columns would be needed to analyze the data within a specific row, elaborated computer scientist Maxim Grinev in a recent blog entry. [...]

  9. [...] Because Cassandra can't govern SQL query commands, a additional columns would be indispensable to investigate a information within a specific row, elaborated mechanism scientist Maxim Grinev in a new blog entry. [...]

  10. Thanks for this.
    Nice and concise.

    Justin

    January 21, 2011 at 5:49 pm

  11. hi
    thanx it really helps. ur site is bookmarked, i’ll learn a lot from u insallah

    Saber

    April 10, 2011 at 7:03 am

  12. how can I “select”, “join”, “group” and “sort” the data in cassandra database . Give with examples

    AJITH PPM

    May 11, 2011 at 12:36 pm

  13. [...] 本文翻译自:Do You Really Need SQL to Do It All in Cassandra? [...]

  14. This is, IMHO, the dark side of the story: suppose you need to know who works in a particular department than you have to create a super column family EMP_DEP where
    - The key is the dep.key
    - The super columns name is the emp.key
    - The super column is the set of the employees (to improve the retrieve operation)

    IMHO, it is impressive because there is redundancy and a lot of pre-work.

    Is my understanding correct? thanks
    Michelangelo

    Michelangelo

    August 24, 2011 at 4:34 pm

  15. I’d like to cancel this standing order primetime bikini yjgnwe

    Tmzokwdk

    September 4, 2011 at 5:44 am

  16. Sorry, you must have the wrong number preteen
    0926

    Fpzicbir

    September 7, 2011 at 1:09 am

  17. A few months Lolita Cp ijf

    Moulfkta

    September 20, 2011 at 2:16 am

  18. I read a lot Pussy Little Model
    =[[

    Lzgynpoa

    September 25, 2011 at 3:43 am

  19. I’ve only just arrived http://osausidutyc.de.tl lz magazin preteen I agree some talk is nice but he should shut up so we could see a cumshot, but iy would be just a little dribble…

    Rosfwxpo

    May 6, 2012 at 9:57 am

  20. instead of making a ColumnFamily for “Birthdate_Emps” why not index that column?

    update column family Emps with column_metadata =
    [
    {
    validation_class: ????,
    column_name : birth_date,
    index_type : KEYS,
    index_name : birth_date
    },
    ];

    Shaun

    May 9, 2012 at 5:51 pm

  21. this post is the clearest i could find. Congrats!

    Lautaro

    July 2, 2012 at 5:15 pm

  22. [...] the behavior of a JOIN. Refer the following blogpost for details on implementing the same in Casssandra. Some noSQL supports SQL like query langauge. For example Cassandra has CQL; Cassandra Query [...]

  23. [...] decentralised, elastic, fault-tolerant, durable; with a rich datamodel that provides often superior approaches to joins, grouping, and ordering than traditional sql. But we’ll spend more time [...]

  24. [...] Do You Really Need SQL to Do It All in Cassandra? [...]

  25. Max, I have two simple questions on your article. First, at the end of the article you say “To keep data sorted in Cassandra you can use two mechanisms: (a)…”. But there is no (b). What is (b)? Secondly, you do not include an example as with the earlier explanations. Can you offer an example?

    Thank you

    Sandy Cairns

    October 8, 2012 at 3:37 am

  26. I have an additional question. In relational, especially Business Intelligence, it is common to grab some data and do an “order by” “group by”. The “group by” is done on the data you just ordered. Your article (I’m sure for simplicity; your article is very helpful.) illustrates them as unrelated queries. So, I ask then: how can you do the group by and order by together.

    The additional question behind this is: how applicable is Cassandra to Business Intelligence querying and reporting. In BI, you might say: give me a breakdown of product” sales and quantity, by customer by product by week. Or, give me Order profitability by customer, order it (i.e., sort it) by the Org Unit that sold the order, but do a “group by” from product to product subgroup to product group”.

    Thank you.

    Sandy Cairns

    October 8, 2012 at 3:58 am

  27. [...] based on the Key-Value model, Cassandra also supports the concept of columns and super-columns – essentially nested Key-Value pairs – that facilitate the modeling of more complex data [...]

  28. SQL Error: line 1:23 mismatched input ‘group’ expecting EOF InvalidRequestException

    Rajiv

    January 21, 2013 at 12:46 pm

    • Im using cassandra 1.1.7 but most of the queries has problem. Please suggest me to come up from this. Advance Thx

      Rajiv

      January 21, 2013 at 12:48 pm

  29. So for every new form of query you need to create another column family (effectively ‘materialized view’) on the data…
    … You call it natural. I call it un-natural

    dbwrangler

    March 21, 2013 at 12:35 pm

  30. Thank you for all the information you have put under the ‘Cassandra’ tag!! Made me understand the beast in a jiffy..

  31. Continue using coconut oil and you’ll see the benefits in good, clear skin, free of dryness and other skin conditions and a lessening of fine lines and wrinkles. Although coconut oil is mainly used within soap for structural purposes, its added skin care benefits make it an element not to be overlooked by the cosmetic world. Coconut oil for skin health will make your skin smoother and softer than you could have dreamed possible.

    coconut oil acne

    April 24, 2013 at 2:27 am

  32. I find it’s a great way to break the fast and stop my stomach grumbling in the morning which helps to maintain and feed my muscles early in the morning while letting my body use my fat for fuel. There are many diet plans that are available for you to try but make sure you have success with them. Regardless of the sometimes popular name of “banana miracle diet,” the banana diet is not a miracle or some magical silver bullet.

    Magdalena

    April 26, 2013 at 2:15 am

  33. They both bruise very easily, so it is best to handle
    them by their stems. It is apparently about half the price of similar quality oils.
    Many new innovative methods are introduced in the
    shale gas drilling such as drilling the well vertically, usage of hydraulic
    fracturing, injecting chemicals, sand and water at high pressure into the rock and allow the gas
    to flow.

    Katrice

    April 27, 2013 at 11:59 pm

  34. I’m thinking you need a spam filter.

    Richard

    May 1, 2013 at 1:19 pm

  35. I’m really enjoying the theme/design of your website. Do you ever run into any internet browser compatibility problems? A small number of my blog audience have complained about my site not working correctly in Explorer but looks great in Opera. Do you have any advice to help fix this problem?

    More Information

    May 8, 2013 at 12:54 am

  36. Hi to every one, since I am in fact eager of reading this weblog’s post to be updated on a regular basis. It contains nice information.

    heart conditions

    May 27, 2013 at 7:34 pm

  37. Hi, I do think this is an excellent site. I stumbledupon
    it ;) I am going to revisit once again since I saved as a favorite it.
    Money and freedom is the best way to change, may you be rich and continue to
    help others.

    green coffee extract

    July 5, 2013 at 1:53 pm

  38. Heya fantastic blog! Does running a blog similar to this
    take a massive amount work? I have absolutely no knowledge of computer programming but I had
    been hoping to start my own blog soon. Anyway,
    should you have any ideas or techniques for new blog owners please share.
    I understand this is off subject but I just
    wanted to ask. Many thanks!

  39. […] maxgrinev.com […]

    Cassandra | Annotary

    July 29, 2013 at 5:25 am

  40. Hi, Very useful information.
    How about select * from Emps where Birthdate >= ’25/04/1975′;
    Please explain.

    venkatesh

    August 29, 2013 at 3:04 am

  41. Whats up very cool website!! Guy .. Beautiful ..
    Superb .. I will bookmark your blog and take the feeds additionally?
    I’m glad to seek out so many helpful info here within the submit, we want develop extra strategies in this regard, thanks for sharing.
    . . . . .

    glowing pendants

    September 20, 2013 at 7:32 am

  42. Helpful info. Lucky me I found your site accidentally, and I’m surprised why this accident did not took place earlier!

    I bookmarked it.

    wayfair coupons

    December 5, 2013 at 11:41 am

  43. Hi there just wanted to give you a quick heads up and let you
    know a few of the images aren’t loading properly. I’m not sure why
    but I think its a linking issue. I’ve tried it in two different browsers and both show the same outcome.

    Delmar

    February 23, 2014 at 3:55 pm

  44. It’s wonderful that you are getting thoughts from this post as well as from our discussion made at this time.

    Chance

    March 21, 2014 at 8:33 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: