Chapter 9 - Consistency and Consensus (Part One)

· 15 min read

Notes from Chapter 9 of Martin Kleppmann's 'Designing Data-Intensive Applications' book. This chapter is split into two parts.

In this chapter, we focus on some of the abstractions that applications can rely on in building fault-tolerant distributed systems. One of these is Consensus. Once there's a consensus implementation, applications can use it for things like leader election and state machine replication.

Table of Contents

Consistency Guarantees

We've discussed eventual consistency in some of the earlier chapters as one of the consistency guarantees provided by some applications. It means that even though there might be delays in replicating data across multiple nodes, the data will eventually get to all nodes.

However, it is a very weak guarantee as it doesn't say when the replicas will converge, it just says that they will converge.

There are stronger consistency guarantees that can be provided, which we'll touch on in this chapter, but these come at a cost. These stronger guarantees often have worse performance or are less fault-tolerant than systems with weaker guarantees.


The idea behind Linearizability is that the database should always appear as if there is only one copy of the data. This means that making the same request on multiple nodes should always give the same response as long as no update is made between those requests.

It is also a recency guarantee, meaning that the value read must be the most recent or up-to-date value, and is not from a stale cache. Basically, as soon as a client successfully completes a write, all other clients must see the value just written.

If one client's read returns a new value, all subsequent reads must also return the new value.

Note: In the book, Linearizability is said to also be known as atomic consistency, strong consistency, immediate consistency or external consistency. However, Google's Cloud Spanner has a different idea and distinguishes between some of those terms. This distinction is explained in another post.

Linearizability vs Serializability

Linearizability is a recency guarantee on reads and writes of a single object. This guarantee does not group multiple operations together into a transaction (meaning it cannot protect against a problem like write skew, where a transaction makes a write based on a value it read earlier that has now been updated by another concurrently running transaction).

Serializability is an isolation property of transactions that guarantees that transactions behave the same as if they had executed in some serial order i.e. each transaction is completed before the next one starts. There is no guarantee on what serial order these transactions appear to run in, all that matters is that it is a serial order.

When a database provides both serializability and linearizability, the guarantee is known as strict serializability or strong one-copy serializability. I believe external consistency and strict serializability provide the same guarantees.

Two Phase-Locking and Actual Serial Execution are implementations of serializability that are also linearizable. However, serializable snapshot isolation is not linearizable, since a transaction will be reading values from a consistent snapshot.

Question: If serializable snapshot isolation is well implemented by ensuring that it detects writes in a transaction that may affect prior reads (from a consistent snapshot) or that it detects stale reads, wouldn't that make it linearizable as one of these transactions will be aborted and will thus preserve the recency guarantee?

Answer: I'm guessing the risk here is that the stale read might have returned a value now being used outside of the database, which then violates the linearizability guarantee.

Note that a stale read is not a violation of serializability, see here. If Transactions A & B are concurrent and Transaction A commits before Transaction B, serializability is still preserved if the database makes it look like the operations in Transaction B happened before those in Transaction A. The key thing is that the transactions appear to be executed one after the other.

Relying on Linearizability

As good as linearizability is as a guarantee, it is not critical for all applications. However, there are examples of where linearizability is important for making a system work correctly, and we'll cover them here.

Locking and leader election

A system with a single-leader replication model must ensure that there's only ever one leader at a time. One way to implement leader election is by using a lock. All the eligible nodes start up and try to acquire a lock and the successful one becomes the leader.

This lock must be linearizable: once a node owns the lock, all the other nodes must see that it is that node that owns the lock.

Apache ZooKeeper and etcd are often used to implement distributed locks and leader election.

Constraints and uniqueness guarantees

When multiple users are concurrently trying to register a value that must be unique, each user can be thought of as acquiring a lock on that value. E.g. a username or email address system.

We see similar issues in examples like ensuring that a bank account never goes negative, not selling more items than is available in stock, not concurrently booking the same seat on a flight, or in a theater for two people. For these constraints to be implemented properly, there needs to be a single up to date value (the account balance, the stock level, the seat occupancy) that all nodes agree on.

However, note that some of these constraints can be treated loosely and are not always critical, so linearizability may not be needed.

Implementing Linearizable Systems

Seeing that linearizability means the system behaves as if there is only one copy of the data, the simplest way to implement it will be to actually have just one copy of the data. However, that won't be fault-tolerant if the node that has the single copy becomes unavailable.

Since replication is the most common way to make a system fault-tolerant, we'll compare different replication methods here and discuss whether they can be made linearizable.

The Cost of Linearizability

While linearizability is often desirable, the performance costs mean that it is not always an ideal guarantee.

Consider a scenario where we have two data centers and there's a network interruption between those data centers:

The CAP Theorem

The CAP theorem is a popular theorem in Distributed Systems that is often misunderstood. It describes a trade-off in building distributed systems. In relation to the scenario above, this trade-off is as follows:

In the original definition of the CAP Theorem, the behaviour described for Consistency is linearizability. Availability means that any non-failing node must return a response that contains the results of the requested work i.e., not a 500 error or a timeout message.

Therefore, in the face of network partitions or faults, a system has to choose between either total availability or total linearizability. That's the CAP Theorem in simple terms.

Applications that do not require linearizability are more tolerant of network problems since the nodes can continue to serve requests.

Note that while the CAP Theorem has been useful, the definition is quite narrow in scope (it only considers Linearizability as the consistency model and network partitions (i.e. nodes in a network disconnected from each other) as the only types of faults, it says nothing about network delays or dead nodes.

You can read a critique of the CAP theorem in this article, which also proposes alternative ways to analyze systems.

Linearizability and network delays

Fault tolerance is not the only reason for dropping linearizability, performance is another reason why it sometimes gets dropped.

Interestingly, RAM on a modern multi-core CPU is not linearizable. This means that if a thread running on one CPU core writes to a memory address, a thread on another CPU core is not guaranteed to read the latest value written (unless a fence or memory barrier is used).

From StackOverflow:

A memory fence/barrier is a class of instructions that mean memory reads/writes occur in the order you expect. For example a 'full fence' means all reads/writes before the fence are committed before those after the fence.

This happens because every CPU core has its own memory cache and store buffer, and memory access goes to the cache by default. Changes are asynchronously written out to main memory. Accessing data in the cache is faster than going to the main memory, so this feature is useful for good performance on modern CPUs.

We can't say that this tradeoff was made for availability purposes, because we wouldn't expect on CPU core to continue to function properly while disconnected from the rest of the computer.

Linearizability is always slow, not just during a network fault. There's a proof in this paper that if you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network.

The response time will certainly be high in networks with highly variable delays. Weaker consistency models can be much faster than linearizability and as it is with everything, there's always a tradeoff.

In Chapter 12, there are some approaches suggested for avoiding linearizability without sacrificing correctness.

Ordering Guarantees

Ordering has been mentioned a lot in this book because it is such a fundamental idea in distributed systems. Some of the contexts in which we've discussed it so far are:

Ordering and Causality

One of the reasons why ordering keeps coming up is that it helps preserve causality. With causality, an ordering of events is guaranteed such that cause always comes before effect. If one event happened before another, causality will ensure that that relationship is captured i.e. the happens-before relationship. This is useful because if one event happens as a result of another one, it can lead to inconsistencies in the system if that order is not captured. Some examples of this are:

A system that obeys the ordering imposed by causality is said to be causally consistent. For example, snapshot isolation provides causal consistency, since when you read some data from it, you must also be able to see any data that causally precedes it (assuming it has not be deleted within the transaction).

The causal order is not a total order

If elements are in a total order, it means that they can always be compared. That is, with any two elements, you can always see which one is greater and which is smaller.

With a partial order, we can sometimes compare the elements and say which is bigger or smaller, but in other cases the elements are incomparable. For example, mathematical sets are not totally ordered. You can't compare {a, b} with {b, c}.

This difference between total order and a partial order is reflected when we compare Linearizability and Causality as consistency models:


We have a total order of operations in a linearizable system. If the system behaves as if there is only one copy of the data, and every operation is atomic (meaning we can always point to before and after that operation), then we can always say which operation happened first.


Two operations are ordered if they are causally related (i.e. we can say which happened before the other), but are incomparable if they are concurrent. With concurrent operations, we can't say that one happened before the other.

This definition means that there is no concurrency in a linearizable database. We can always say which operations happened before the other.

The version history of a system like Git is similar to a graph of causal dependencies. One commit often happens after another, but sometimes they branch off, and we create merges when those concurrently created commits are combined.

Linearizability is stronger than causal consistency

The relationship between linearizability and causal order is that linearizability implies causality. Any system that is linearizable will preserve causality out of the box.

This is part of what makes linearizable systems easy to understand. However, given the cost of linearizability that we've discussed above, many distributed systems have dropped linearizability.

Fortunately, linearizability is not the only way of preserving causality. Causal consistency is actually the strongest possible consistency model that does not slow down due to network delays, and also remains available in the face of network failures. The caveat here is that in the face of network failures, clients must stick to the same server, given that the server captures the effect of all operations that happened causally before the partition.

Capturing causal dependencies

Causal consistency captures the notion that causally-related operations should appear in the same order on all processes—though processes may disagree about the order of causally independent operations - Jepsen

For a causally consistent database, when a replica processes an operation, it needs to ensure that all the operations that happened before it have already been processed; if a preceding operation is missing, the system must hold off on processing the later one until the preceding operation has been processed.

The hard part is determining how to describe the "knowledge" of a node in a system. If a node had seen the value of X when it issued the write Y, X and Y must be causally related.

We discussed 'Detecting Concurrent Writes' earlier where we focused on causality in a leaderless datastore and detecting concurrent writes to the same key in order to prevent lost updates. For causal consistency though, we need to go beyond just keeping track of a single key, but instead tracking causal dependencies across the entire database.

To determine causal ordering, the database needs to keep track of which version of the data was read by an application.

Sequence Number Ordering

A good way of keeping track of causal dependencies in a database is by using sequence numbers or timestamps to order the events. This timestamp can be a logical clock which is an algorithm that generates monotonically increasing numbers for each operation. These sequence numbers provide a total order meaning that if we have two sequence numbers, we can always determine which is greater.

The important thing is to create sequence numbers in a total order that is consistent with causality meaning that if operation A causally happened before B, then the sequence number for A must be lower than that of B. We can order concurrent operations arbitrarily.

With single-leader databases, the replication log defines a total order of write operations that is consistent with causality. Here, the leader can assign a monotonically increasing sequence number to each operation in the log. A follower that applies the writes in the order they appear in the replication log will always be in a causally consistent state.

Noncausal sequence number generators

In a multi-leader or leaderless database, generating sequence numbers for operations can be done in different ways such as:

However, while these options perform better than pushing all operations through a single leader which increments the counter, the problem with them is that they these sequence number are not consistent with causality. They do not capture ordering across different nodes.

If we used the third option, for example, an operation numbered at 1100 on node B could have happened before operation 50 on node A if they process a different number of operations per second. There is no way to capture that using these methods.

Lamport Timestamps

This is one of the most important topics in the field of distributed systems. It’s a simple method for generating sequence numbers across multiple nodes that is consistent with causality.

The idea here is that each node has a unique identifier, and also keeps a counter of the number of operations it has processed. The Lamport timestamp is then a pair of (counter, nodeID). Multiple nodes can have the same counter value, but including the node ID in the timestamp makes it unique.

Lamport timestamps provide a total ordering: if there are two timestamps, the one with the greater counter value is the greater timestamp; if the counter values are the same, then we pick the one with the greater node ID as the greater timestamp.

Quoting the book, what makes Lamport timestamps consistent with causality is the following:

Every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.

With every operation, the node increases the maximum counter value it has seen by 1.

Consider the diagram below:

Lamport Timestamp Illustration

Figure 1 - Lamport Timestamps Illustration.

In this figure, the nodes and clients initially have a counter value of 0:

A possible ordering of these operations is (1,1) -> (2, 2) -> (3, 1) -> (3,2), if in the case of the same counter value, our ordering scheme gives precedence to the node with the lower ID value.

This ordering showcases a limitation of Lamport timestamps. Even though operation (3,2) appears to complete before (3,1), the ordering does not reflect that.

The fact that those two have the same counter value means that they are concurrent and the operations do not know about each other, but Lamport timestamps must enforce a total ordering. With the ordering from Lamport timestamps, you cannot tell whether two operations are concurrent or whether they are causally dependent.

Basically, if two events are causally related, the Lamport timestamp ordering will always obey causality. But if one event appears before another in the ordering, it does not mean that they are causally related.

Version Vectors can help distinguish whether two operations are concurrent or whether one causally depends on the other, but Lamport timestamps have the advantage that they are more compact.

Aside: I think Lamport timestamp ordering is also sufficient to provide sequential consistency.

Further Reading:

Timestamp ordering is not sufficient

Although Lamport timestamps are great for defining a total order that is consistent with causality, they do not solve some common problems in distributed systems.

The key thing to note here is that they only define a total order of operations after you have collected all the operations. If one operation needs to decide right now whether a decision should be made, it might need to check with every other node that there's no concurrently executing operation that could affect its decision. Any of the other nodes being down will bring the system to a halt, which is not good for fault tolerance.

For example, if two users concurrently try to create an account with the same username, only one of them should succeed. It might seem as though we could simply pick the one with the lower timestamp as the winner and let the one with the greater timestamp fail. However, if a node needs to decide right now, it might simply not be aware that another node is in the process of concurrently creating an account, or might not know what timestamp the other node may assign to the operation.

It's not enough to have a total ordering of operations, it's also important to know when the order is finalized i.e. what that order is at each point in time.

In the second part of these notes, we'll look at ways to solve the challenge of knowing the order of operations at each point in time.

Last Updated: 15-12-2022.

distributed-systems learning-diary ddia

A small favour

Did you find anything I wrote confusing, outdated, or incorrect? Please let me know by writing a few words below.

Follow along

To get notified when I write something new, you can subscribe to the RSS feed or enter your email below.

← Home