When faced with a task of picking a database for a project we need to consider a lot of tradeoffs. Finding and fully understanding the limitations and implications of using each candidate databases is very hard. It’s often the case that some database behaviors, which have huge impact on our use case, are buried deep in the documentation or even require diving into the codebase. For me personally, one of those unexpected behaviors was the way Cassandra updates ordering works.

Ordering is tricky

One of the classic problem in distributed systems is ordering of messages. It’s problematic in cases when concurrent updates for the same key can be proposed. In such case nodes have to somehow agree on what is the actual order. The problem is however, way easier to solve if there are no concurrent updates. Since Cassandra is leaderless database multiple writes can be accepted by multiple nodes concurrently. It’s therefore crucial to understand how Cassandra handles such cases. Here are some quotes from the documentation:

As every replica can independently accept mutations to every key that it owns, every key must be versioned. Unlike in the original Dynamo paper where deterministic versions and vector clocks were used to reconcile concurrent updates to a key, Cassandra uses a simpler last write wins model where every mutation is timestamped (including deletes) and then the latest version of data is the “winning” value

Later in the same page it’s mentioned that:

Specifically all mutations that enter the system do so with a timestamp provided either from a client clock or, absent a client provided timestamp, from the coordinator node’s clock. Updates resolve according to the conflict resolution rule of last write wins. Cassandra’s correctness does depend on these clocks, so make sure a proper time synchronization process is running such as NTP.

and also:

Rows are guaranteed to be unique by primary key, and each column in a row resolve concurrent mutations according to last-write-wins conflict resolution. This means that updates to different primary keys within a partition can actually resolve without conflict

I may be an exception, but after reading the documentation I personally came to conclusion that Cassandra only needs to resolve conflicts if there are concurrent updates (and uses timestamp for that). I came to this conclusion after reading docs, since there are:

  • Mentions of “concurrent” updates in sections describing reconciliation/conflict resolution.
  • Mentions of “versioned key” statements. To me this indicates there is some kind of version ordering implemented (something equivalent to MVCC or vector clock in Dynamo).
  • Comparisons to Dynamo’s reconciliation process, that implements ordering based on vector clock. In such implementation only concurrent updates impose problem for ordering.
  • Mentions of “Last Write Wins” which hints that “whatever is written last, wins”.

It was not sure if my understanding of documentation is correct, so I decided to test that.

No logical ordering, only timestamps matter

Turns out my assumptions were wrong and conflict resolutions occur even when there are no concurrent updates. If a single client sends updates to Cassandra cluster (sequentially with CONSISTENCY=ALL) there is no guarantee that the final value is going to be the last one that the client has sent. Basically, when performing standard insert/update, all that matters is the timestamp attached to the value. It doesn’t matter what was the actual order of applied updates. If timestamp of new value happens to be smaller than timestamp of existing value, new value will simply be discarded. Cassandra only relies on timestamps provided by client or node itself (with some exceptions mentioned few sections below). How are those timestamps picked then?

Timestamps generated by Cassandra nodes

The timestamp for a statement can be provided either by the client or the coordinator node (the node that received request from a client). If there is a clock skew between cassandra nodes in a cluster, even if your client is sending requests sequentially (using load balancer) the final value might may not be the last one even though there was no concurrent updates and CONSISTENCY was set to ALL.

Timestamps generated by clients

Similarly, imagine you have a stateless app that sends updates to cassandra. This app has multiple instances which are reachable via load balancer using round-robin. Timestamps for Cassandra updates are set by clients (your app instances), but the clocks are out of sync. Since clocks on those clients are all over the place, the timestamps assigned to updates received by cassandra would also be all over the place. Even if you perform requests to the load balancer sequentially the final value might therefore may not be the last one.

Testing ordering wth out of sync clocks

Test source code available on github.

In order to test those scenarios I created a 3 node cassandra cluster using docker compose. Since docker containers share clock with host, all the cassandra nodes are perfectly in sync. In order to simulate clock skews I used libfaketime to set clocks as follows:

  • Cassandra 1: 3 seconds behind host
  • Cassandra 2: 6 seconds behind host
  • Cassandra 3: 9 seconds behind host

I also had to make a small tweak to the cqlsh.py to make sure server timestamps are used: self.session.use_client_timestamp=False.

After starting the cluster and creating test tables I run following script:

docker exec -it cass1 cqlsh -e "CONSISTENCY ALL; INSERT INTO ordering_test.ordering_test(key, value) VALUES('key', 'value_1')"
echo "Inserted 'value_1'"

docker exec -it cass2 cqlsh -e "CONSISTENCY ALL; INSERT INTO ordering_test.ordering_test(key, value) VALUES('key', 'value_2')"
echo "Inserted 'value_2'"

echo "Selecting current value"
docker exec -it cass3 cqlsh -e "CONSISTENCY ALL; SELECT * FROM ordering_test.ordering_test"

The first insert statement is sent to cass1 while second statement is sent to cass2. The select statement is sent to cass3.

Since consistency is set to ALL and queries are executed sequentially (there is no concurrency) it’s logical to expect that the final value would be value_2, right? However, since cass2 clock is 3 seconds behind cass1, value_1 has greater timestamp than value_2 and final result is value_1:

Consistency level set to ALL.
Inserted 'value_1'

Consistency level set to ALL.
Inserted 'value_2'

Selecting current value
Consistency level set to ALL.

 key | value
-----+---------
 key | value_1

I also wanted to make sure if timestamp of select doesn’t affect the result. AFAIK Cassandra doesn’t support MVCC or any similar feature, but it’s worth testing. cass3 (which is used for performing select at the end) has the biggest clock drift (9 seconds behind host). If there was MVCC-like feature present, the select should therefore return an empty value. After all, when cass3 performs select, there is no value with timestamp <= cass3’s timestamp. We can debug this by adding SELECT dateof(now()) FROM system.local to each command like so:

docker exec -it cass1 cqlsh -e "CONSISTENCY ALL; INSERT INTO ordering_test.ordering_test(key, value) VALUES('key', 'value_1'); SELECT dateof(now()) FROM system.local"
echo "Inserted 'value_1'"

docker exec -it cass2 cqlsh -e "CONSISTENCY ALL; INSERT INTO ordering_test.ordering_test(key, value) VALUES('key', 'value_2'); SELECT dateof(now()) FROM system.local"
echo "Inserted 'value_2'"

docker exec -it cass3 cqlsh -e "CONSISTENCY ALL; SELECT dateof(now()) FROM system.local, * FROM ordering_test.ordering_test; SELECT dateof(now()) FROM system.local"

Results:

Consistency level set to ALL.

 system.dateof(system.now())
---------------------------------
 2022-11-14 17:26:51.706000+0000

Inserted 'value_1'



Consistency level set to ALL.

 system.dateof(system.now())
---------------------------------
 2022-11-14 17:26:49.927000+0000

Inserted 'value_2'


Selecting current value
Consistency level set to ALL.

 key | value
-----+---------
 key | value_1
 
 system.dateof(system.now())
---------------------------------
 2022-11-14 17:26:47.828000+0000

As you can see, timestamp at each subsequent statement is smaller than the previous one. This indicates that select timestamp doesn’t influence the results.

Sources

According to my research the reconciliation is implemented by Cell.reconcile(c1, c2):

public static Cell<?> reconcile(Cell<?> c1, Cell<?> c2)
    {
        if (c1 == null || c2 == null)
            return c2 == null ? c1 : c2;

        if (c1.isCounterCell() || c2.isCounterCell())
            return resolveCounter(c1, c2);

        return resolveRegular(c1, c2);
    }

As you can see above counter cells are handled different (more about it in the next section). Apart from that, the reconciliation process for regular cells is very simple and purely based on timestamps. If timestamps are different, value with higher timestamp is picked:

long leftTimestamp = left.timestamp();
long rightTimestamp = right.timestamp();
if (leftTimestamp != rightTimestamp) 
    return leftTimestamp > rightTimestamp ? left : right;

In rare scenario when timestamps are the same, tombstones are prioritezed. If there are no tombstones then greater value is picked:

return compareValues(left, right) >= 0 ? left : right;

Are there any exceptions to timestamp based ordering?

To make things more fun there seems to be some exceptions. There are probably more - those are just the ones I encountered.

COUNTER columns updates actually behave as we would expect. Instead of using timestamps, cells of type COUNTER merges conflicting values. It makes sense, since counters can only be updated by some delta (They can’t be set to specific value). Since counter can only be updated by delta, the order of how those deltas are applied really doesn’t matter, does it?

Cassandra supports lightweight transactions that are using paxos in order to achieve a consensus for a new value proposed by one of the nodes. In our example, it can be triggered by using IF EXISTS like so:

UPDATE ordering_test.ordering_test SET value = 'value_1' where key = 'key' IF EXISTS

By using lightweight transactions the actual ordering of updates is respected and the final value is indeed value_2 as we would expect. However, since paxos is used, a lot of communication between nodes is required to achieve consensus which probably has serous performance implications.

What can we do about it?

We should try to keep our servers clocks in sync. Both clients and Cassandra servers. Properly configured NTP, however, doesn’t guarantee that there will be no problems mentioned above as there can always be some small drifts. Here are few ideas I came up wth to eliminate or at least reduce to minimum clock related issues:

  • Simply do not update existing values at all. If we only insert new values without conflicting keys the problem basically does not exist. Instead of updating you can add new entries and periodically remove old ones. If using CQRS pattern you can even use separate database for serving reads and use cassandra only for writes and streaming events to the read side.
  • Use client timestamps and always perform updates for the same key from the same client. This can be achieved by having consistent hashing so that the same client receives requests for the same entity all the time.
  • Use USING TIMESTAMP in your INSERT/UPDATE explicitly. If your data already has some notion of timestamps/ordering key you can use it to set timestamp on a row explicitly.
  • Use lightweight transactions, but be aware of decreased performance.
  • If possible, model your column using COUNTER type (however it has another set of limitations and quirks).

If I missed something or some of my conclusions are wrong please let me know in the comments.


Jakub Dziworski

Dev Blog