Sunday, 17 April 2016

Raft Consensus Algorithm

Imagine we have a single node database server that stores a single value. We also have a client that can send a value to the database server.

Coming to agreement (or consensus) on that value is easy with one node. But how do we come to consensus if we have more than one node?

Here we need to use distributed consensus. Distributed consensus (i.e. protocols) allows nodes in an unreliable distributed system to agree on an ordering of events. Raft is a protocol for implementing distributed consensus.
Distributed consensus is typically framed in the context of a replicated state machine, drawing a clear distinction between the state machine (the fault tolerant application), the replicated log and the consensus module.
Replicated State Machine
Replicated state machines are typically implemented using a replicated log. Each server stores a log containing a series of commands, which its state machine executes in order. Each log contains the same commands in the same order, so each state ma- chine processes the same sequence of commands. Since the state machines are deterministic, each computes the same state and the same sequence of outputs.
Consensus algorithm is responsible for keeping the replicated log consistent. The consensus module on a server receives commands from clients and adds them to its log. It communicates with the consensus modules on other servers to ensure that every log eventually contains the same requests in the same order, even if some servers fail. Once commands are properly replicated, each server’s state machine processes them in log order, and the outputs are returned to clients. As a result, the servers appear to form a single, highly reliable state machine.

Replicated state machines are used to solve variety of fault tolerant problems in distributed system.
Properties of consensus algorithm
  1. They never return an incorrect result under all non-Byzantine conditions including network delays, partition, packet loss, duplication, reordering
  2. They are fully functional as long as any majority of servers are operational and can communicate with each other and with the clients. For example, a cluster of five servers can tolerate the failure of any two servers
  3. They do not depend on the timing to ensure the consistency of the logs
  4. In the common case, a command can complete as soon as a majority of the servers has responded to a single round of remote procedure calls. A minority of slow servers do not have impact on the overall system performance.
Now we will see how Raft works. Before that we need to make ourself familiar with some Raft concepts.
Raft is a consensus algorithm for managing replicated log. Raft uses strong leadership. At first Raft selects a leader with the complete responsibility for managing the replicated log. The leader accepts the log entries from the clients, replicates the log entries to the other servers and tells them when it is safe to apply these log entries to their state machines. When a leader fails and becomes disconnected from other servers, a new leader gets elected. Clients are external to the system and must contact the leader directly to communicate with the system.
Raft cluster
Typically Raft cluster is set up using five nodes, so that the system can tolerate two failures.

Server States
According to Raft protocol, a node can be one of three states:
  • Follower: A follower is a passive node. It does not issue any request on its own but simply responds to the requests from the leader and the candidates
  • Candidate: A candidate is an active node which is attempting to become a Leader. It initiates a request for votes from other nodes. A candidate that receives votes from a majority of the full cluster becomes the new leader
  • Leader: Leader node is an active node which is currently leading the cluster. This node handles requests from clients. If a client contacts a follower, it redirects the client to the leader

How does Raft detect obsolete information such as stale leader? Raft detects this using a concept called term.
Term is an arbitrary length of time. Terms are numbered with consecutive integers. Terms act as logical clock in Raft. Each term begins with an election in which one or more candidates attempt to become leader. If a candidate wins the election, then it serves as leader for the rest of the term. There may be a situation in which a term ends with no leader, in this case a new term begins with a new election. Raft ensures that there is at most one leader in a given term.

Each server stores its perspective of the term in persistent storage, which increases monotonically over time. A server’s term is only updated when it starts (or restarts) an election, or when it learns from another server that its term is out of date. All messages include the source server’s term. The receiving server checks it, with two possible outcomes: if the receiver’s term is larger, a negative response is sent, while if the receiver’s term is smaller than or equal to the source’s, its term is updated before parsing the message.
Types of messages
Raft servers communicate with each other using remote procedure calls (RPCs). There are three types of message used in Raft:
  • RequestVote: this message is used by the candidates during the election.
  • AppendEntries: this message is initiated by the leader to replicate the log entries and to provide a form of heartbeat to the followers.
  • InstallSnapshot: this message is used by the leader to send a snapshot of it’s log to the followers that are too far behind.
Leader Election
In Raft, there are two timeout settings which control elections. First is the election timeout. The election timeout is the amount of time a follower waits until becoming a candidate. The election timeout is randomized to be between 150ms and 300ms. After the election timeout the follower becomes a candidate and starts a new election term and votes for itself and sends out RequestVote messages to other servers.

If the receiving server hasn't vote yet in this term then it votes for the candidate and the server resets it's election timeout. Once a candidate has a majority of votes it becomes leader. The leader begins sending out AppendEntries messages to its followers. These messages are sent in intervals specified by the heartbeat timeout. Followers then respond to each AppendEntries message. This election term will continue until a follower stops receiving heartbeats and become a candidate. Requiring a majority of votes guarantees that only one leader can be elected per term.

This process is called Leader Election. All the changes to the system will now go through the leader.
There may be a situation when a candidate neither wins nor loses the election. For example, two followers become candidates at the same time and votes could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of RequestVote messages. Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly.
Log Replication
Once a leader is elected, the leader needs to replicate all changes to the system to all servers. This is done by using the same AppendEntries message that are used for heartbeats.
First a client sends a change to the leader. The change is appended to the leader's log. This log entry is currently uncommitted so it will not update the leader server value. Leader then sends the change to the followers on the next heartbeat.

The change gets replicated in the followers' logs.

An entry is committed on the leader server once a majority of followers acknowledge it.

After receiving the acknowledgement from the followers, leader commits the entry.

The leader then notifies the followers that the entry is committed.

The cluster has now come to consensus about the system state and leader sends the response to the client.

This process is called Log Replication.
Now, consider the case that some messages have been lost or servers have failed and recovered, leaving some logs incomplete. It is the responsibility of the leader to fix this by replicating its log to all other servers. When a follower receives an AppendEntries message, it contains the log index and term associated with the previous entry. If this does not match the last entry in the log, the follower sends an unsuccessful response to the leader. The leader is now aware that the follower's log is inconsistent and needs to be updated. The leader decrements the previous log index and term associated with that server. The leader keeps dispatching the AppendEntries message, adding entries to the log until the follower server replies with success and is therefore up to date.
Each server keeps its log in persistent storage, including a history of all commands and their associated terms. Each server also has a commit index, which represents the most recent command to be applied to the replicated state machine. When the commit index is updated, the server passes all commands between the new and old commit index to the local application state machine.
Raft uses State Machine Safety properties:
  • Election Safety: at most one leader can be elected in a given term.
  • Leader Append-Only: a leader never overwrites or deletes entries in its log; it only appends new entries.
  • Log Matching: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.
  • Leader Completeness: if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.
  • State Machine Safety: if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.
The authors of Raft focus on the understandability. Raft is designed to be easy to understand. According to the authors:
"...our most important goal—and most difficult challenge—was understandability. It must be possible for a large audience to understand the algorithm comfortably. In addition, it must be possible to develop intuitions about the algorithm, so that system builders can make the extensions that are inevitable in real-world implementations."
I like the effort Raft's authors put to make the algorithm understandable. They have given many talks and created course materials. All these you can find here