Extending Cassandra with Asynchronous Triggers
Latency matters! Amazon found every 100ms of latency cost them 1% in sales. Google found an extra .5 seconds in search page generation time dropped traffic by 20%. Even the smallest delay kills user satisfaction. You can find more on the importance of latency here and here.
Underlying database significantly contribute to the overall application response time, especially on update operations as they cannot be sped up via caching. A common approach to reduce the database-imposed latency is to break a request into sub-operations which are executed asynchronously so that the application will acknowledge the request without waiting to ensure that all the sub-operations are executed. This approach allows for tremendous latency reduction and also makes the latency guaranteed even when sub-operation execution time varies significantly. The implication of asynchronous execution is eventual consistency of data. But eventually consistency is already a common practice that meet requirements of many Web applications.
Asynchronous execution is typically implemented using queues in front of a database. Yet, integrating asynchronous execution into a database provides a number of benefits:
- Much easier programming model and simplified system management – the developer/administrator has to deal with one system (i.e. database) instead of two (i.e. queue and database)
- Asynchronous execution can be used to automate internal database operations with reduced latency – for example, updating indexes.
We have extended Cassandra database system with support for Asynchronous Triggers (CASSANDRA-1311). Asynchronous triggers is a basic mechanism that can be used to implement various use cases of asynchronous execution of application code at database side. Cassandra and Async triggers are a perfect match as they both exploit advantages out of eventual data consistency.
Cassandra Async Triggers
Like traditional database triggers, Cassandra Async trigger is a procedure that is automatically executed by the database in response to certain events on a particular database object (e.g. table or view). The distinguishing feature of Async trigger is that the database responds to the client on successful update execution without waiting for triggers to be executed, thus reducing response latency.
More precisely, Cassandra Async triggers can be described as follows:
- A trigger is set on a column family and is executed in case of any update to the column family. Cassandra triggers are “after” triggers. A trigger is executed after the update operation that fires the trigger and can see the results of the update.
- Trigger procedures are implemented in Java. The application developer implements the
- Cassandra Async triggers are mutation-level triggers. A trigger is executed for each mutation issued to the column family. For example, if a trigger is fired by
batch_mutateit will be executed for each mutation separately. Each trigger is parametrized by the mutation that fires it.
- In contrast to traditional triggers, which are synchronous, Cassandra Async triggers are asynchronous. The database acknowledges update execution to the client after the update is executed and the fired triggers are submitted for execution. Actual execution of fired triggers happens after the acknowledgement to the client. It allows saving latency but leads to eventual consistency of data.
- Our implementation guarantees triggers to be executed at least once. The system is responsible for handling failures during trigger execution so that the client does need to care about it. It makes client-side code easier to develop and maintain. Note that triggers may be executed more than once that requires triggers to be idempotent. This is a requirement to any Cassandra update though. We chose not to implement exactly-once semantics because it introduces essential overhead.
Async triggers is a powerful mechanism that allows for many interesting applications:
1) Index and Materialized View Support
Cassandra does not support secondary indexes. To index data other than by record key or column name, the application developer needs to store data redundantly in a separate column family. In this case triggers can be used to update secondary indexes whenever records in the original column family are modified.
Similar to indexes, materialized views store data redundantly. So they also can be supported via triggers. Find general guidelines on managing redundant data (indexes and materialized views) in Cassandra in my post “Do You Really Need SQL to Do It All in Cassandra?”.
Implementing index and view maintenance synchronously means coupling update and maintenance operations together. This will increase response time of each update operation depending on the number of indexes. Using an asynchronous mechanism will keep the response time constant.
2) Online Analytics
Async triggers can be effectively used to propagate changes (in near real-time) from the “operational” part of the database to the “analytical” part without reducing response time for operational updates. It opens up a way for many new interesting applications! Note that delay in the data propagation caused by asynchronous triggers is negligible to the majority of analytical applications. It is online (near real-time) analytics in comparison, for example, with offline MapReduce analytics.
3) Specific application: Data propagation in social networks
Data propagation in social networks (e.g. sending user’s posts to all her friends) are proven to be only scalable with push-on-change model (i.e. redundant data propagation via updates with simple queries) in comparison with pull-on-demand model (i.e. combining data from normalized storage on demand via queries). Twitter uses the push-on-change model: it employs queues to implement asynchronous data propagation. Using just a few triggers one can easily implement production version of a Twitter-like network.
Asynchronous triggers is submitted as a contribution to Cassandra. You can find it at CASSANDRA-1311