Chapter 8 - The Trouble with Distributed Systems

· 18 min read

Notes from Chapter 8 of Martin Kleppmann's 'Designing Data-Intensive Applications' book

In this chapter, we'll look at the things that may go wrong in distributed systems. We'll cover problems with network, clocks and timing issues, and other faults.

Table of Contents

Faults and Partial Failures

Distributed systems differ from single node computers in that: unlike single node computers where either the system is completely working or completely broken, we can have partial failures in distributed systems.

What makes partial failures more difficult to deal with is that they are nondeterministic. It may sometimes work, and sometimes fail.

Cloud Computing and Supercomputing

We have high-performance computing (HPC) and cloud computing on both extremes of philosophies for building large-scale applications.

High performance computers or Supercomputers have thousands of CPUs used for computationally expensive tasks like weather forecasting. In general, a job will checkpoint the state of its computation and store it durably from time to time. If a node fails, the whole cluster is brought down. The state of computation is restarted from the last checkpoint. This makes supercomputers similar to single node computers.

Nowadays, many internet services need high availability. It's not acceptable to bring down the cluster due to failure in a node.

To make distributed systems work, we must endeavor to build a reliable system from unreliable components.

We sometimes like to assume that faults are rare and then don't account for them, but we must design with fault tolerance in mind.

Building a reliable system from unreliable components is not a unique idea to distributed systems, and is used in other areas as well. E.g.:

Note that even though a system can be more reliable than the parts that it's made of, there's a limit to the level of reliability that can be attained. Error-correcting codes can only deal with a number of single-bit errors and TCP cannot remove delays in the network.

Unreliable Networks

As stated earlier, this book focuses on shared-nothing systems which communicate with each other via a network. The advantage of this approach is that it is comparatively cheap, as it requires no special hardware. We can have a bunch of regular machines as part of the system.

Note that the internet and most internal networks in datacenters are asynchronous packet networks. This means that: one node can send a message to another node, but have no guarantee about when the message will arrive, or whether the message will actually arrive at all. Unfortunately, with this approach, many things could go wrong:

In essence, the sender is unable to tell whether the packet was delivered unless it receives a response message from the recipient. It's impossible to distinguish these issues in an asynchronous network.

These issues are typically handled with a timeout, but that still gives no information about the state of the request.

Network Faults in Practice

Although network faults may be rare, the fact that they can happen means that software needs to be able to handle them. Handling network faults does not necessarily mean tolerating them, a simple approach can just be to show an error message to users. However, there has to be work done to know how the software reacts to network problems, and also ensure that the system can recover from them.

It might be a good idea to deliberately trigger network problems and test the system's response - Chaos Monkey.

Detecting Faults

It's important to automatically detect network faults, as they might be linked to faulty nodes. Detecting faults quickly ensures that:

Due to the uncertainty about the network, it's difficult to tell whether a node is working or not. There are some specific ways to tell though, such as:

Timeouts and Unbounded Delays

We have mentioned that timeouts are often used to detect a fault. However, there is no simple answer too how long a timeout should be. It simply depends.

With a long timeout, it means there can be a wait until a node is declared dead. This means users will have to wait a while or see error messages.

On the other hand, a short timeout means that nodes can be declared dead prematurely, even when it only suffers a temporary breakdown (e.g. due to a load spike on the node or the network). There are downsides of declaring a node dead prematurely:

In an ideal system, we could have a guarantee that the maximum delay for packets transmission will be d, and the node always handles a response within time r. In this kind of system, we could set a timeout for 2d + r and it'll be reasonable.

However, in most systems, we do not have either of those guarantees. Asynchronous networks have unbounded delays.

Network Congestion and Queueing

Queueing is the most common cause of the variability of network packet delays (i.e. unbounded delays). Queues can be formed at different points:

Aside: TCP vs UDP(Transmission Control Protocol vs User-Datagram Protocol)

TCP is a reliable network transmission protocol while UDP is unreliable. It means that TCP implements:

Any messages not acknowledged will be retransmitted in TCP.

UDP is used in latency-sensitive applications like videoconferencing and Voice over IP, where there's less tolerance for delays. In UDP, delayed data is probably worthless so it does not try to retransmit it. E.g. in phone calls, instead of retransmitting, it simply fills the missing packet's time slot with silence. The retry happens at the human layer: "Could you repeat that please?".

In essence, timeouts should typically be chosen experimentally: measure the distribution of network round trip times over an extended period, and over many machines to determine the expected variability of delays.

The Phi Accrual failure detector used in Akka and Cassandra measure response times and automatically adjust timeouts based on the observed response time distribution.

Synchronous Versus Asynchronous Networks

A question that you might have is: why don't we make the network reliable at a hardware level so the software does not need to worry about it?

To address this, it's worth looking at the traditional fixed-line telephone network (non-cellular, non-VoIP) which is apparently very reliable and rarely drops messages.

The way it works is that:

Note that this approach differs from a TCP connection. While there is a fixed amount of reserved bandwidth here that no one else can use while the circuit is established, TCP packets will typically grab whatever network bandwidth is available.

Datacenter networks and the internet make use of the TCP approach of packet switching rather than establishing circuits, because they are optimizing for bursty traffic. Unlike an audio or video call where the number of bits transferred per second is fairly constant, the traffic through the internet is unpredictable. We could be requesting a web page, or sending an email, or transferring a file etc. The goal is to just complete it as quickly as possible.

Using circuits for bursty data transfer will waste network capacity and could make transfers unnecessarily slow, as we would have to guess how much bandwidth to allocate beforehand. TCP dynamically adapts the data transfer rate to the available network capacity.

There's ongoing research to use quality of service and admission control to emulate circuit switching on packet networks, or provide statistically bounded delays.

Unreliable Clocks

Clocks and time are important in distributed systems. Applications use clocks to answer questions like:

Some questions measure duration, while some describe points in time.

Time is tricky because each machine on a network has its own clock, and some may be faster or slower than others. Clocks can be synchronized to a degree though, by using the Network Time Protocol (NTP). It works by adjusting clocks using time reported from a group of servers. The group of servers get their time from a GPS receiver.

Monotonic vs Time-of-Day Clocks

Modern computers have at least two different kinds of clocks: a time-of-day clock and a monotonic clock.

Time-of-day clocks

These are like standard clocks, which just return the current date and time according to a calendar. These clocks are typically synchronized with NTP, which means that timestamps should match across machine ideally.

Note that if the local clock is too far ahead of NTP, it may appear to jump back to a previous point in time. It could also jump due to leap seconds. The tendency of these clocks to jump make them unsuitable for measuring elapsed time.

Monotonic Clocks

These clocks are suitable for measuring a duration like a timeout or response time. They are guaranteed to move forward in time (unlike a time-of-day clock which may jump back in time).

System.nanoTime() in Java is a monotonic clock. With a monotonic clock, you can check the value at a point in time, perform an action, then check the value again and then use the difference between the two values to measure time elapsed between the two checks.

Monotonic clocks are fine for measuring elapsed time, because they do not assume any synchronization between nodes' clocks.

Clock Synchronization and Accuracy

Unlike monotonic clocks which don't need synchronization, time-of-day clocks must be synchronized with an NTP server or another external time source. However, NTP and hard clocks are not as reliable or accurate as one might hope. For example:

Nevertheless, it's possible to achieve very good clock accuracy with significant investment into resources. For example, the MiFID II European regulation for financial companies mandates that HFT funds synchronize their clocks within 100 microseconds of UTC, to help debug market anomalies like "flash crashes".

Relying on Synchronized Clocks

While clocks may seem simple and straightforward, they have a good number of pitfalls. Some of the issues that may arise are:

Like with unreliable networks, robust software must be prepared to deal with incorrect clocks. Dealing with incorrect clocks can be even trickier because the problems caused by this can easily go unnoticed. A faulty CPU or misconfigured network is easier to detect, as the system would not work at all. However, for a defective clock, things will generally look fine. We're more likely to experience silent and subtle data loss than a dramatic crash.

Therefore, if a software requires synchronized clocks, it's essential to monitor the clock offsets between all machines. A node whose clock drifts too far from the others should be labelled as a dead node and removed from the cluster.

Timestamps for ordering events.

Time-of-day clocks are commonly used for ordering events in some systems and they often use the last write wins conflict resolution strategy. Some of these systems are Cassandra and Riak, typically multi-leader replication and leaderless databases. Some implementations of this generate the timestamp on the client's side rather than on the server, but this does not change the problems of LWW which include:

Q: Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur?

A: Probably not. NTP's synchronization accuracy is also limited by the network round-trip time, in addition to other error sources like quartz drift.

Logical clocks are a safer alternative for ordering events than an oscillating quartz crystal. They measure the relative ordering of events, rather than actual elapsed time which physical clocks (like time-of-day and monotonic clocks).

Clock readings have a confidence interval

Clock readings typically have an uncertainty range, like a margin of error. However, most systems don't expose this uncertainty. An exception to this is Google's TrueTime API which is used in Spanner, and gives a confidence interval on the local clock.

Synchronized clocks for global snapshots

Snapshot isolation is commonly implemented by giving each transaction a montonically increasing ID. If a write happened later than the snapshot (i.e. it has a transaction ID greater than the snapshot), the write is invisible to the snapshot transaction. This is easier to manage on a single-node database, as we can use a simple counter.

For a distributed database though, it is more difficult to coordinate a monotonically increasing transaction ID. The transaction ID must reflect causality. If transaction B reads a value written by transaction A, B must have a higher transaction ID than A for it to be consistent.

If we didn’t have uncertainty about clock accuracy, the timestamps from the synchronized time-of-day clocks would be suitable as transaction IDs as later transactions will have a higher timestamp.

However, Google's Spanner implements snapshot isolation across datacenters this way:

Spanner implements snapshot isolation across datacenters in this way. It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the following observation: if you have two confidence intervals, each consisting of an earliest and latest possible timestamp (A = [Aearliest, Alatest] and B = [Bearliest, Blatest]), and those two intervals do not overlap (i.e., Aearliest< Alatest < Bearliest < Blatest), then B definitely happened after A — there can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened.

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 7547-7554). O'Reilly Media. Kindle Edition.

To ensure that transaction timestamps reflect causality, Spanner waits for the length of the confidence interval before committing a read-write transaction. This means that any transaction that reads the data is at a sufficiently later time, so the confidence intervals do not overlap. For example, if the confidence interval is 7ms, a read-write transaction will wait for 7ms before committing. Remember that with snapshot isolation, a transaction can't read anything that wasn't committed when it started. Therefore, we can be sure that any transaction that reads the now committed read-write transaction happened at a sufficiently later time.

To keep the wait time as small as possible, Google uses a GPS receiver in each datacenter, which allows clocks to be synchronized within about 7ms.

Process Pauses

A node in a distributed system must assume that its execution can be paused for a significant amount of time at any point, even in the middle of a function. When this pause happens, the rest of the system keeps moving and may declare the paused node dead because it's not responding. This paused node may eventually continue running, without noticing that it was asleep until it checks the clock later.

A distributed system must tailor for these pauses which can be caused by:

There's active research into limiting the impact of Garbage Collection pauses. Some of the options are:

These options can't fully prevent GC pauses, but they can reduce their impact on the application.

Knowledge, Truth, and Lies

So far, we have discussed some of the distributed systems problems that can occur, which include: unreliable networks, unreliable clocks, faulty nodes, processing pauses etc. We've also discussed how distributed systems differ from programs running on a single node: there's no shared memory, there's only message passing via an unreliable network with variable delays.

As a result of these issues, a node in a distributed system cannot know anything for sure. It can only guess based on the messages it receives (or doesn't receive) via the network. There has to be a consensus.

In this section, we'll explore the concept of knowledge and truth, and guarantees we can provide under certain assumptions in a distributed system.

The Truth is Defined by the Majority

A node cannot trust its assessment of a situation. A node may think it's the leader, while the other nodes have elected a new one; it may think it's alive, while other nodes have declared it dead. As a result, many distributed algorithms rely on a quorum for making decisions i.e. decisions require a minimum number of votes from several nodes in order to reduce dependence on a single node.

The quorum is typically an absolute majority of more than half the nodes. This is typically safe because there can only be one majority in a system at a time.

The Leader and the Lock

A system often requires there to only be one of a particular thing. For example:

Due to the fact that a node can believe it’s the "chosen one" even when it isn't, the system must be designed to handle such situations and avoid problems like split-brain.

Fencing tokens

One of the ways by which systems handle a situation where a node is under a false belief of being "the chosen one", thereby disrupting the rest of the system, is by using fencing tokens.

Basically, each time a lock server grants a lock or a lease, it also generates a fencing token (a number that increases every time a lock is granted). We can then require that any client which wants to send a write request to the storage service must include the current fencing token.

The lock server will then perform validation on any request with the fencing token included and reject it if it has generated a fencing token with a higher number.

For applications using ZooKeeper as a lock service, the transaction ID zxid or node version cversion can be used as the fencing token, since they are guaranteed to be monotonically increasing - which is a required property for a fencing token.

Byzantine Faults

Fencing tokens can help detect and block a node that is not deliberately acting in error (e.g. because it hasn't yet realized that its lease has expired). However, for a node that is deliberately acting in error, it could simply send messages with a fake fencing token.

In this book, nodes are assumed to be unreliable but honest: any node that does respond is assumed to be telling the truth to the bests of its knowledge.

If there's a risk that nodes may "lie" (e.g. by sending corrupted messages or faulty responses), it becomes a much harder problem to deal with. That behavior is known as a Byzantine fault and systems that are designed to handle these faults are Byzantine Fault Tolerant Systems.

A system is Byzantine fault-tolerant if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network.

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 7812-7813). O'Reilly Media. Kindle Edition.

Dealing with Byzantine faults is relevant in specific circumstances like:

In most server-side data systems, however, the cost of deploying Byzantine fault-tolerant solutions makes them not practical. Web applications need some other controls though to prevent malicious behavior and that's why: input validation, sanitization and output escaping are important.

Weak forms of lying

There are weaker forms of "lying" which are not full-blown Byzantine faults that we can protect against. For example:

System Model and Reality

When coming up with distributed systems algorithms, we need to write them in a way that doesn't depend too much on the hardware and software details. Basically, we need an abstraction for what the algorithm may assume, and the types of faults that we can expect in a system. This abstraction is known as a system model.

System Model for Timing Assumptions
System Model for Node Failures

For modeling real system, the most useful model is generally the partially synchronous model with crash-recovery faults.

Correctness of an algorithm

For an algorithm to be correct, it must have certain properties. E.g. For a sorting algorithm to be correct, it must have certain properties expected of its output. Such as that the element further to the left is smaller than the element further to the right.

Likewise, we can define properties for what it means for a distributed algorithm to be correct. For example, for generating fencing tokens, the algorithm may be required to satisfy the following properties:

It's possible for some properties to hold, while others don't. How do we know which distinguish between which properties must hold and which could tolerate caveats? The next section helps to answer that.

Safety and liveness

There are two different kinds of properties that we can distinguish between: safety and liveness properties. In the example above for a fencing token, uniqueness and monotonic sequence are safety properties, while availability is a liveness property.

Safety properties are informally defined as: nothing bad happens, while liveness properties are defined as something good eventually happens.

These informal definitions are subjective (what's good or bad, really) and it’s best not to read too much into them.

The actual definitions of safety and liveness are precise and mathematical:

Kleppmann, Martin. Designing Data-Intensive Applications (Kindle Locations 7924-7927). O'Reilly Media. Kindle Edition.

For distributed algorithms, it is commonly required that safety properties always hold, in all possible situations of a system model. Therefore, even in the occurrence of all nodes crashing, or entire network failures, the algorithm must always return a correct result.

On the other hand, we are allowed to make caveats with liveness properties. E.g. we could say that a request will only receive a response if majority of nodes have not crashed, and only if the network recovers from an outage eventually.

Mapping system models to the real world

While these theoretical abstractions are useful, reality must also be considered when designing algorithms. We may sometimes have to include code to handle situations where something that was assumed to be impossible actually happens.

Proving the correctness of an algorithm does not mean that the implementation on a real system will always behave correctly. However, theoretical analysis is still a good first step because it can uncover problems that may remain hidden for a long time.

Theoretical analysis and empirical testing are equally important.

learning-diary distributed-systems ddia

A small favour

Did you find anything I wrote confusing, outdated, or incorrect? Please let me know by writing a few words below.

Follow along

To get notified when I write something new, you can subscribe to the RSS feed or enter your email below.

← Home