Max's Output

Managing Indexes in Cassandra using Async Triggers

with 8 comments

[This post is written by Maxim Grinev and Martin Hentschel]

Suppose you are building a Cassandra application and you want to speed up your queries via indexing. Cassandra does not support secondary indexes at first, but storing redundant data (in a different layout) will give you the same effect. The main drawback is that your application (the code that writes to the DB) needs to take care of managing the index. Every time you write to the DB, you also need to maintain your index. This will notably slow down the response time of any user of your web application.

In Figure 1 whenever a user request forces a write to the database (Step 1), the application also updates the index. At least two operations to the database layer are needed, the response time of this request increases. The advantage of this approach is that your index is always in sync with the data. Any query to the index will see the latest result (Step 2).

Figure 1: Application inserts data and builds index in one step.

We recently proposed an extension to Cassandra, which we call Async triggers. An Async trigger will listen on a column family inside Cassandra. Whenever a modification to this column family is made, the trigger will be scheduled for an asynchronous execution. In our case, the logic to build and maintain the index shifts from the application to the trigger. This means the application has less work to do and can return faster. The response time of a user request will be reduced.

In Figure 2 a write to the database (Step 1) will return as soon as the write is finished. An Asynch trigger is scheduled to update the index. This trigger will run some time after the response to the client. This also means that for a short period of time, the data and the index will be out of sync (i.e. inconsistent). A query to the index (Step 2) might now always see the latest results. We believe that this is acceptable for many web applications. For example in Twitter it is totally fine if a search ignores tweets that have been posted less than a second ago.

Figure 2: An Asynch trigger maintains the index.

Of course you may get similar good response times using other architectures. For example if you explicitly use queues to separate writes to the database and maintenance of indexes. We are currently doing research on advantages and disadvantages of Async triggers with respect to such architectures.

Example: Index users by name

Here is a concrete example of how to index a user database not only by user id, but also by name (a secondary index). Figure 3 shows a possible layout of column families in Cassandra. The first column family “Users” stores data about users. Each row is naturally indexed by its row key (in our case it is the user id). The second column family “Index” stores redundant data to quickly retrieve users by their name. For example if you want to look up the name “Sue”, you will find two users with ids 2 and 4.

Figure 3: Database layout storing users by id and name.

Using an Async trigger to update the index whenever there is a write to the users column family works involves two steps. First, we need to implement the trigger and second, we need to specify the column family that the trigger will listen on.

To implement a trigger we will implement the execute method of the ITrigger interface. First, we connect to the local Cassandra instance. The trigger will execute within Cassadra, no network overhead is involved. Then we will get the user name of the user that has just been inserted (or modified). The user id is provided by the key parameter. We can insert this user id into the respective user name row. (Note that I have removed any exception handling or null checks for ease of reading.)

public class UpdateIndex implements ITrigger
{
    public void execute(byte[] key, ColumnFamily cf)
    {
        // connect to local Cassandra instance
        CassandraServer client = new CassandraServer();
        client.set_keyspace("TriggerExample");

        // get user name
        byte[] userName = cf.getColumn("name".getBytes()).value();

        // insert the user id into the index
        ColumnParent parent = new ColumnParent("Index");
        byte[] userId = key;
        long timestamp = System.currentTimeMillis();
        Column indexedValueColumn = new Column(userId,
                 "1".getBytes(), new Clock(timestamp));
        client.insert(userName, parent, indexedValueColumn, ConsistencyLevel.ONE);
    }
}

It remains to specify that this trigger should listen on the “Users” column family. The following entry needs to be added to the cassandra.yaml file.

triggers:
    - name: UpdateIndex
      keyspace: TriggerExample
      column_family: Users
      implementation: UpdateIndex

That’s it. The complete example source code along with appropriate scripts to run our example can be found in the directory contrib/trigger_example.

Currently our extension to Cassandra is under submission. In order to try Async triggers and this example, find the patch here: https://issues.apache.org/jira/browse/CASSANDRA-1311

Written by maxgrinev

July 23, 2010 at 3:23 pm

Posted in Cassandra

8 Responses

Subscribe to comments with RSS.

  1. Absolutely great!

    Augi

    August 5, 2010 at 5:29 pm

  2. Very helpful.

    Frank LoVecchio

    September 13, 2010 at 4:35 pm

  3. Very interesting – I wonder about the locality of the execution of the trigger – will it run on a host that has the modified data? Does it run in the same JVM?

    Henrik Lindberg

    July 12, 2011 at 3:57 pm

  4. Exxcellent way of telling, and pleasant article to obtain facts
    concerning my presentation subject, which i am going tto convey in academy.

    Max Detox Blend

    September 14, 2013 at 5:23 am

  5. I’ve been surfing online more than three hours today, yet I never found any interesting article like yours. It’s pretty worth enough for me. In my opinion, if all webmasters and bloggers made good content as you did, the web will be a lot more useful than ever before.
    Недвижимость в Монако

    Sohail Shaikh

    May 16, 2015 at 9:04 am

  6. I am trying to implement a trigger that populate a View based on another input but
    CassandraServer cassandraServer = new CassandraServer();
    cassandraServer.set_keyspace(“recommendation_engine”);
    throws an exception
    java.lang.AssertionError: null
    at org.apache.cassandra.thrift.ThriftSessionManager.currentSession(ThriftSessionManager.java:55) ~[apache-cassandra-2.1.7.jar:2.1.7]
    at org.apache.cassandra.thrift.CassandraServer.state(CassandraServer.java:103) ~[apache-cassandra-2.1.7.jar:2.1.7]
    at org.apache.cassandra.thrift.CassandraServer.set_keyspace(CassandraServer.java:1750) ~[apache-cassandra-2.1.7.jar:2.1.7]
    at TestTrigger.augment(TestTrigger.java:32) ~[na:na]

    the connection is null can you someone provide some information ?!

    adelinghanayem

    July 22, 2015 at 4:27 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: