Timilearning

MIT 6.824: Lectures 6 & 7 - Fault Tolerance(Raft)

One common pattern in the previous systems we have discussed like MapReduce, GFS, and VMware FT is that they all rely on a single entity to make the key decisions. For example:

While this has the advantage of making it easier for the system to come to a decision, the downside of this approach is that the entity is now a single point of failure. If the entity is down for whatever reason, the system may not be able to make any meaningful progress without manual intervention from an operator.

Ideally, we would want a more fault-tolerant system that can withstand the loss of at least one node, even if that node is the one making critical decisions at the time. Such a system would need a mechanism for the other nodes in the cluster to automatically agree on which one of them should take over as the next leader.

It turns out that getting all the nodes in a cluster to agree on a decision is a hard problem in distributed systems. The main difficulty lies in the fact that it is impossible to distinguish between when a node has crashed, and when there is a network fault. These two problems have same symptom, which is that no response will be received from the node. If a node is wrongly declared dead, it may still be able to receive and execute requests from clients, which may lead to inconsistencies between the nodes. The other nodes may need to decide on:

This problem of getting multiple nodes to agree has led to the development of consensus algorithms. Distributed consensus is the ability for components in a distributed system to reach agreement even in the presence of failures and an unreliable network. Paxos and Raft are two of the most popular consensus algorithms today. This post will focus on the Raft algorithm, which tackles the consensus problem. Its original paper can be found here.

Table of Contents

Raft Paper Summary

Before the creation of the Raft algorithm, Paxos had almost solely dominated the landscape of consensus algorithms. The problem with Paxos, however, is that it is quite a challenge to understand it. This challenge is what led to the creation of Raft. The authors had a desire to develop a consensus algorithm that was not only practical but was also understandable. Their primary goal in the creation of the algorithm was understandability.

One of the approaches the authors took to create an understandable algorithm was to decompose the consensus problem into separate parts:

These parts will be discussed separately below.

Replicated State Machines

The replicated state machine approach to replication has been discussed in an earlier post, but to recap, the idea is that if two state machines are fed the same input operations in the same order, their outputs will be the same.

A replicated log is typically used to implement replicated state machines. Each server maintains its log, which contains a series of commands that its state machine must execute in order. These logs must be kept consistent across all the servers. Consistency here means that all the logs must have the same commands in the same order.

Figure 1: Replicated state machine architecture.

The job of a consensus algorithm is to keep this replicated log consistent. Each server has a consensus module for managing its log. The consensus module on a server is responsible for adding client commands to the log and communicating with the consensus modules on the other servers to ensure that their logs eventually contain the same commands in the same order.

Practical consensus algorithms must not violate the following properties:

Raft Overview

Raft works by electing a single leader among the servers, which then becomes responsible for managing the log replication. The leader is responsible for accepting client requests and deciding where log entries should be placed, without having to consult other servers. When the current leader fails, Raft includes a protocol for electing a new leader to take over.

As stated earlier, Raft breaks down consensus into three independent subproblems:

Figure 2: Raft Application

Figure 2: An example of an application where Raft could be used is a key-value database, as shown above. Client requests are converted to log entries on a leader, which are then replicated on the other servers by the Raft module. The state machine converts those log entries into records in the key-value store.

Why Use Logs?

A log is an append-only sequence of records used as a storage abstraction in many distributed systems today. Some of the benefits of using a log include:

Server States

A server is always in one of three states: leader, follower, or candidate. A leader receives requests from clients and communicates them with the other servers. A follower is passive; it only receives log entries from the leader and votes in elections. Any requests from a client to a follower will be redirected to the leader. The candidate state is used for leader elections.

Figure 3: Raft server states

Figure 3 - "Server states. Followers only respond to requests from other servers. If a follower receives no communication, it becomes a candidate and initiates an election. A candidate that receives votes from a majority of the full cluster becomes the new leader. Leaders typically operate until they fail."

Raft Uses Terms

Time in Raft is divided into terms which can be of arbitrary length. Each term begins with an election and has at most one leader. Terms act as a form of a logical clock in a system. Some other key things to note about terms are as follows:

The servers in Raft communicate through RPCs. There are two main RPC methods involved here:

  1. AppendEntries RPC: This method is invoked by a leader to replicate its log entries to the other servers. It includes the leader's term, the new log entries, and identifiers for where to place the entries, among other things.
  2. RequestVote RPC: This is invoked by candidates to gain votes from the other servers. It includes the candidate's term, identifier, and some other items which will be discussed later.

How is a leader elected?

A leader sends periodic heartbeat messages to its followers to maintain its authority. These heartbeat messages are in the form of AppendEntries RPCs which contain no log entry. If a follower has not received any communication from the leader within a specified election timeout, it will transition to candidate state. The follower does this because it assumes there is no leader in the cluster at present, and so it begins an election to choose a new one.

The first step that a follower takes after becoming a candidate is to increase its term number. After doing that, it votes for itself and sends an RPC (including its term number) to all the other servers in parallel to request votes from them. A candidate will remain in its state until any of the following conditions is met:

  1. It wins the election.
  2. Another server establishes itself as the leader.
  3. There's a period with no winner of the election.

For a candidate to win an election, it must have received votes from the majority of the servers in the cluster. Note that the majority is out of all the servers in the cluster, not just the live ones. A server can vote for at most one candidate in a given term, and the decision is made on a first-come-first-served basis. The winning candidate in a term then becomes a leader and immediately sends out heartbeat messages to the other servers to establish its authority. Note that there is another restriction on how servers can vote which will be discussed later.

It is also possible that while waiting for votes, a candidate receives an AppendEntries RPC from another server that claims to be the leader. If the supposed leader's term included in the RPC is greater than or equal to the candidate's term, the candidate will accept that the leader is legitimate and then transition back to follower state. Otherwise, if the leader has a smaller term than the candidate's term, the candidate will reject the RPC and continue in its state.

Lastly, it's possible that a candidate neither wins nor loses an election. This can happen if there are many candidates at the same time, and the votes get split equally among them. If this happens, the candidates will time out, increase their terms, and then begin another round of the election process. Raft takes an extra measure to prevent split votes from happening indefinitely, which will be discussed next.

Election Timeouts Are Randomized

Raft takes measures to prevent split votes in the first place by ensuring that the election timeout of each server is randomly chosen from a fixed interval. This way, the probability that two or more servers time out at the same time and become candidates is reduced. A single server can then win an election and send heartbeat messages to the other servers before their election timeout expires.

Randomized timeouts are also used to handle split votes. Each candidate's election timeout is restarted at the start of an election, and the timeout must elapse before it can start another election. This reduces the probability of another split vote in the new election.

Any Two Majorities Must Overlap in At Least One Server

All of Raft's key decisions rely on getting some confirmation from majority of the servers in the cluster. The insight here is that if there is a partition in the cluster i.e., one or more nodes cannot communicate with the other set of nodes, at most one of the partitions can have the majority. In addition, any subsequent majorities must overlap with the previous ones in at least one server. This is how Raft is able to ensure that a term has at most one leader and prevent a split-brain situation. If we have candidates from separate partitions that cannot communicate, only one of them will be able to gain a majority of votes. There is no chance of two candidates in separate partitions gaining a majority for the same term.

A Raft cluster is typically made up of an odd number of servers. This helps to improve fault tolerance in the cluster. For example, a cluster of 4 nodes needs 3 nodes to make a decision, and a cluster of 5 nodes also needs the same amount of nodes. This means that we can tolerate more failures and still make progress in the cluster. More generally, a cluster of 2f + 1 servers can tolerate f failed servers.

Systems that rely on the overlap of majority set of servers for operation are referred to as quorum systems.

How does log replication work?

A leader is responsible for receiving client requests once it has been elected. These client requests contain commands that must be replicated to the servers in the cluster.

When a request comes in, the leader creates a new log entry containing that command and appends the entry to its log. It then sends AppendEntries RPCs in parallel to its followers. An entry will only be considered as committed when it has been safely replicated on the majority of its followers. Once an entry is marked as committed, it must be durably persisted on all the followers. If a follower crashes or there is a network fault that drops packets to the follower, the leader will keep retrying the request indefinitely until the follower receives the log entry. All followers must eventually store committed log entries. When an entry is committed, it can then be executed by the state machine.

Figure 4: Raft Log

Figure 4 - "Logs are composed of entries, which are numbered sequentially. Each entry contains the term in which it was created (the number in each box) and a command for the state machine. An entry is considered committed if it is safe for that entry to be applied to state machines"

The protocol for log replication in Raft is described as follows:

What happens if the leader crashes?

During normal operations, the protocol described above helps the leader and the followers' logs to remain consistent. However, leader crashes can result in inconsistencies. These inconsistencies can then be compounded by subsequent leader and follower crashes.

The figure below describes how inconsistencies between the logs can play out.

Figure 5: Possible inconsistency scenarios

Figure 5 - Possible inconsistencies in the logs of Raft servers.

It is possible that any of the scenarios (a-f) in Figure 5 could happen in follower logs when the leader at the top is elected. Each box represents a log entry and the number in a box is the term of the entry. In this figure, we see that a follower may be missing entries (as in a and b), may have extra uncommitted entries (c-d), or both scenarios (e-f). Quoting the paper:

Scenario (f) could occur if that server was the leader for term 2, added several entries to its log, then crashed before committing any of them; it restarted quickly, became leader for term 3, and added a few more entries to its log; before any of the entries in either term 2 or term 3 were committed, the server crashed again and remained down for several terms.

Raft handles inconsistencies by forcing the followers' log to duplicate the leader's, meaning that conflicting entries in a follower's log can be overwritten by the entries in the leader's log. To make the logs between a leader and follower consistent, the leader first finds the latest index at which their logs are identical, deletes all the entries in the follower's log after that index, and then sends the follower the entries in its own log that come after that point.

Why is it safe for a new leader to overwrite its follower's logs?

Raft places a restriction on what servers can be elected as a leader with the Leader Completeness Property, which states that:

The leader for any given term must contain all the entries committed in previous terms.

It is able to enforce this because the RequestVote RPC contains information about the candidate's log, and a voter will deny a candidate's vote request if its log is more up-to-date than the candidate's. Since a candidate must get votes from the majority, at least one node in the majority must have the latest entries. A log is more up-to-date than another if it has a higher term number. If the two logs have the same term number, the log with the higher index entry is the more up-to-date one.

By this restriction, it is safe for a leader to overwrite a follower's log since the leader will always have the latest committed entries. Any uncommitted entries can be safely discarded because there is no expectation on the client's side that their request has been executed. Only committed entries guarantee that. Therefore, the server can return an error message to the client, telling it to retry the requests for the uncommitted entries.

What happens when a follower or candidate crashes?

When a follower crashes, any RPCs sent to it will fail. These requests are retried indefinitely until the server restarts and the RPC completes. It is safe for the AppendEntries and RequestVote RPCs to be retried because they are idempotent. For example, a follower can ignore an AppendEntries RPC if the request contains entries that are already present in its logs. Candidate crashes are handled in the same way.

How do we choose the right election timeout?

Although Raft does not depend on timing to ensure the correctness of the logs, timing is important in ensuring that client requests are responded to swiftly. If the election timeout is too long, the servers might be without a leader for some time, which will delay the time it takes to respond to a client request.

The timing requirement that a Raft system should maintain is stated in the paper as:

broadcast Time << election Timeout << MTBF

where broadcast time is the average time taken by a server to send RPCs to all the servers in parallel and receive responses from them, and MTBF stands for the Mean Time Between Failures for a single server.

By selecting an election timeout that is significantly greater than the broadcast time, it means that leaders can reliably send the required heartbeat messages to prevent unnecessary elections.

In addition, the election timeout should be orders of magnitude less than the average time it takes for a server to fail. If the MTBF is less than the chosen election timeout, there could be a delay between when a leader fails and when its followers begin a new election. However, by selecting a timeout that is based on the MTBF, we can reduce that delay on average.

How does Raft tame its logs?

To prevent the log from growing indefinitely, which can increase the time it takes for the state machine to replay a log when it restarts, Raft uses snapshotting for log compaction. A snapshot of the current application's state is written to durable storage, and all the log entries up to the point of the snapshot are deleted from the log. Each server takes its snapshots independently, and snapshots are taken when the log reaches a fixed size in bytes.

A snapshot also contains metadata such as the last included index, which is the index of the last entry in the log being replaced by the snapshot, and the last included term. These metadata are kept because of the AppendEntries consistency check for the first log entry after the snapshot, which needs a previous log entry and term.

Conclusion

There are a few details missed in this post about how Raft manages the changes to the server in a cluster, and how client interactions are handled, among others. I do hope, however, that this post has helped you build some intuition for a consensus algorithm like Raft if you were not previously familiar. The further reading section contains links to resources with more detail.

Note: The study group will be going on a break for the two weeks until 15 June, and so there will be a longer delay before the next lecture summary.

Further Reading

mit-6.824 distributed-systems learning-diary

To get notified when I write something new, you can subscribe to the RSS feed.

A small favour

Did you find anything I wrote confusing, outdated, or incorrect? Or do you have an answer to a question I posed here? Please let me know! Just write a few words below and I'll be sure to amend this post with your suggestions.

← Home