PolarSPARC |
Raft Consensus Algorithm Unraveled
Bhaskar S | 08/22/2020 |
Overview
Modern applications are being built the Cloud Native way. This implies distributed processing systems (including Data Stores) running on different hosts and communicating using API and/or Messaging.
Designing and building distributed fault-tolerant systems is hard and in particular keeping the state replicated and in-sync across the different nodes of a distributed data store (Distributed Consensus) is even more challenging. The reason for this - since it involves many different nodes, there are multiple points of failure - one or more nodes can crash, communication with nodes can fail, messages could be lost, etc.
Look at any of the modern distributed data stores, namely, etcd, TiDB, or Kudu, etc. What do they all have in common w.r.t Distributed Consensus ???
YES !!! They all use the Raft Consensus Algorithm under-the-hood to achieve consensus on the replicated state.
In this article, we will unravel the inner workings of the Raft Consensus by implementating our own simplified version of the algorithm in Golang.
There are 3 important aspects of the Raft Consensus algorithm for the distributed consensus to work properly:
Leader Election :: One of the nodes in the cluster is nominated as the Leader. There can be *ONLY* one Leader and all updates to the cluster is *ONLY* via the Leader
Log Replication :: All the non-Leader nodes in the cluster get their updates *ONLY* from the Leader to stay in-sync and consistent
Safety :: Only a consistent node will be allowed to become a Leader. The Leader node will append an update only once (no dups) and distribute the update to all the other non-Leaders. Once an update is accepted and committed by all the nodes in the cluster, implies all the previous updates are consistent as well
We will now drill into each of the above 3 aspects. Before we do that, let us get some terms defined.
Term | Description |
---|---|
Quorum | If there are N (typically odd number - 3, 5, 7, etc) nodes in the cluster, then a quorum count is defined as (N/2 + 1) |
Follower | All nodes in a cluster start in this state |
Candidate | To initiate the Leader election process, one of the nodes becomes a candidate to request votes from the other nodes in the cluster |
Leader | After a successful Leader election, only one node in the cluster is the Leader and all updates to the cluster is through the Leader |
Term | A monotonically increasing number that represents an instance of the Leader election. Once a Leader is chosen, it does not change until the next Leader election |
Index | A monotonically increasing number that represents the next update to be stored and committed |
Election Timer | A timer that is started on each node of the cluster. The first timer that expires is the node that becomes the Candidate node and initiates the Leader election process |
Heartbeat | A regular periodic ping from the Leader node to all the other Follwer nodes in the cluster to indicate the Leader is alive and working |
Every node in the cluster starts a network server, listening for incoming messages. The following code segment shows the starting of the network server and handling of incoming requests by the go-routine handleRequest :
Every message between nodes is serialized into an array of bytes, with the first byte indicating the type of the message payload. The following diagram illustrates the layout of the network payload exchanged between the nodes in the cluster:
The following code segment shows the functions for serialization and deserialization of the different messages:
A node serializes a message and writes it on the network connection of another node as a payload. The target node reads the network payload and deserializes it to a message. The target node then sends the message via a go chan for further processing of the message. The following diagram illustrates the flow of a message between two nodes in the cluster:
Let us now dive into the first aspect Leader Election of the Raft Consensus algorithm.
Leader Election
The following code segment lists all the data types relevant for Leader Election:
The type RaftNodeType indicates the type of the node in the cluster - Follower, Candidate, or Leader (defined as constants).
The type RaftNode encapsulates the necessary state of a node in the cluster. The following are the description of some of the fields:
CurrentTerm :: indicates the election Term the node currently is in
Duration :: indicates the randomly selected election timer interval of the node
Addr :: indicates the network address (ip:port string) of the node. Every node in the cluster needs to listen on an unique network address to receive commands from other nodes. This network address also serves as an unique identifier of the node in the cluster
PeerAddr :: indicates the list of network addresses of the other nodes in the cluster
LeaderAddr :: indicates the network address of the Leader node
LastHB :: indicates last time the node heard from the Leader node
GrantedVote :: flag that indicates if this node granted a vote to a Candidate node
VotesAck :: indicates the count of approved and denied votes from the other Follwer nodes in the cluster
VotesCycles :: indicates the count of election timeouts cycles to wait for before the Candidate can determine if it has quorum of votes to promote itself to a Leader
PeerC :: indicates the Go channel that is used to communicate the next message received by a node via the network
Let us assume a cluster of 3 nodes identified as A, B, and C respectively. When a node starts up, it starts as a Follower node. Also, it randomly picks an Election timer duration value (in milliseconds) between 150 and 300. In order to observe and reason how the Leader Election works, for our demonstration we choose values between 1150 and 1300 .
The following diagram illustrates the initial state of the cluster:
From the illustration above in Figure.3, we see the node B's timer will go off first. Node B promotes itself as a Candidate node and broadcasts a RequestVote command to all the other Follower nodes in the cluster. The RequestVote is basically asking the other Follower nodes to grant a vote for the Candidate node to be become a Leader node.
The following diagram illustrates the broadcast of the RequestVote command to the nodes of the cluster:
The following code segment shows the handling of the election timer going off to broadcast the RequestVote message:
Code segments in the article have been trimmed to just show the pieces relevant to the context
Also, lines with three dots '...' indicates code that has been truncated
On receiving the RequestVote command, the Follower nodes A and C of the cluster respond with a GrantVote command. The GrantVote command from a Follower node indicates either an approve or deny of the RequestVote from the Candidate node.
The following diagram illustrates the GrantVote command from the Follower nodes of the cluster:
The following code segment shows the handling of the RequestVote by the Follwer nodes and each responding with GrantVote message:
Once the Candidate node B receives quorum of approved GrantVote commands, The Candidate node promotes itself to be a Leader node of the cluster. All the other nodes in the cluster reset themselves to Follower nodes.
The following diagram illustrates the successful election of node B as the Leader of the cluster:
The following code segment shows the handling of the GrantVote commands from the Follower nodes by the Candidate node to promote itself to a Leader node and start the Heartbeat messages:
Once a Leader node is elected, it establishes a Term and synchronizes state with the other nodes in the cluster (CurrentTerm) via Heartbeat messages (at regular intervals).
We will now demonstrate the Leader Election with a 3 node cluster. We will open 3 Terminals one for each of the nodes. We will refer to them as A, B, and C respectively.
The following is the output of node B, which has been elected as the Leader:
++ [192.168.1.179:9002] Ready to start server on Raft node ++ [192.168.1.179:9002] Election timer duration 1234 ++ [192.168.1.179:9002] Ready to accept remote connections ++ [192.168.1.179:9002] Election timer started @ 2020-08-22 19:21:59.208772097 -0400 EDT m=+0.001042014 ++ [192.168.1.179:9002] Election timeout - {Type: Follower, IsLeaderOk: false, GrantedVote: false, LastHB: 2020-08-22 19:21:59.208549237 -0400 EDT m=+0.000819154} ++ [192.168.1.179:9002] Election timeout - Follower ready to start Leader election -- [192.168.1.179:9002] Could not establish connection to peer [192.168.1.179:9001] - dial tcp 192.168.1.179:9001: connect: connection refused -- [192.168.1.179:9002] Could not establish connection to peer [192.168.1.179:9003] - dial tcp 192.168.1.179:9003: connect: connection refused -- [192.168.1.179:9002] Election timeout - No QUORUM of Peer connections, RESET back to Follower ++ [192.168.1.179:9002] Election timeout - {Type: Follower, IsLeaderOk: false, GrantedVote: false, LastHB: 2020-08-22 19:21:59.208549237 -0400 EDT m=+0.000819154} ++ [192.168.1.179:9002] Election timeout - Follower ready to start Leader election -- [192.168.1.179:9002] Could not establish connection to peer [192.168.1.179:9001] - dial tcp 192.168.1.179:9001: connect: connection refused ++ [192.168.1.179:9002] GrantVote event - {Term: 1, Index: 0, Leader: []} => {Vote: true, Term: 2, Index: 0, From: [192.168.1.179:9003]} ++ [192.168.1.179:9002] GrantVote - Candidate {Votes: [YES: 2, NO: 0], Quorum: 2} ++ [192.168.1.179:9002] GrantVote - Candidate PROMOTED to Leader -- [192.168.1.179:9002] Could not establish connection to peer [192.168.1.179:9001] - dial tcp 192.168.1.179:9001: connect: connection refused ++ [192.168.1.179:9002] Heartbeat timer duration 1000 ++ [192.168.1.179:9002] Heartbeat timer - {Term: 2, Index: 0, Committed: 0} ++ [192.168.1.179:9002] Election timeout - {Type: Leader, IsLeaderOk: true, GrantedVote: false, LastHB: 2020-08-22 19:22:02.680284618 -0400 EDT m=+3.472554585} ++ [192.168.1.179:9002] Heartbeat timer - {Term: 2, Index: 0, Committed: 0} ++ [192.168.1.179:9002] Election timeout - {Type: Leader, IsLeaderOk: true, GrantedVote: false, LastHB: 2020-08-22 19:22:03.680932944 -0400 EDT m=+4.473202981} ... ... ...
The following is the output of node C that granted a vote for node B:
++ [192.168.1.179:9003] Ready to start server on Raft node ++ [192.168.1.179:9003] Election timer duration 1245 ++ [192.168.1.179:9003] Ready to accept remote connections ++ [192.168.1.179:9003] Election timer started @ 2020-08-22 19:22:00.728157567 -0400 EDT m=+0.000939241 ++ [192.168.1.179:9003] RequestVote event - {Term: 0, Index: 0, Leader: []} => {Term: 2, Index: 0, From: [192.168.1.179:9002]} ++ [192.168.1.179:9003] RequestVote - Follower will GrantVote to [192.168.1.179:9002] ++ [192.168.1.179:9003] Heartbeat event - {Term: 2, Index: 0, Leader: [192.168.1.179:9002]} => {Term: 2, Index: 0, From: [192.168.1.179:9002]} ++ [192.168.1.179:9003] Heartbeat event - {Term: 2, Index: 0, Leader: [192.168.1.179:9002]} => {Term: 2, Index: 0, From: [192.168.1.179:9002]} ... ... ...
The following is the output of node A:
++ [192.168.1.179:9001] Ready to start server on Raft node ++ [192.168.1.179:9001] Election timer duration 1268 ++ [192.168.1.179:9001] Ready to accept remote connections ++ [192.168.1.179:9001] Election timer started @ 2020-08-22 19:22:02.240564441 -0400 EDT m=+0.001018941 ++ [192.168.1.179:9001] Heartbeat event - {Term: 0, Index: 0, Leader: []} => {Term: 2, Index: 0, From: [192.168.1.179:9002]} ++ [192.168.1.179:9001] Heartbeat event - {Term: 2, Index: 0, Leader: [192.168.1.179:9002]} => {Term: 2, Index: 0, From: [192.168.1.179:9002]} ++ [192.168.1.179:9001] Heartbeat event - {Term: 2, Index: 0, Leader: [192.168.1.179:9002]} => {Term: 2, Index: 0, From: [192.168.1.179:9002]} ... ... ...
Let us now TERMINATE the Leader node B by pressing CTRL-C.
The following is the output of node A, which has now been elected as the new Leader:
++ [192.168.1.179:9001] Election timeout - {Type: Follower, IsLeaderOk: false, GrantedVote: false, LastHB: 2020-08-22 19:22:09.684588286 -0400 EDT m=+7.445042755} ++ [192.168.1.179:9001] Election timeout - Follower ready to start Leader election -- [192.168.1.179:9001] Could not establish connection to peer [192.168.1.179:9002] - dial tcp 192.168.1.179:9002: connect: connection refused ++ [192.168.1.179:9001] GrantVote event - {Term: 2, Index: 0, Leader: [192.168.1.179:9002]} => {Vote: true, Term: 3, Index: 0, From: [192.168.1.179:9003]} ++ [192.168.1.179:9001] GrantVote - Candidate {Votes: [YES: 2, NO: 0], Quorum: 2} ++ [192.168.1.179:9001] GrantVote - Candidate PROMOTED to Leader -- [192.168.1.179:9001] Could not establish connection to peer [192.168.1.179:9002] - dial tcp 192.168.1.179:9002: connect: connection refused ++ [192.168.1.179:9001] Heartbeat timer duration 1000 ++ [192.168.1.179:9001] Heartbeat timer - {Term: 3, Index: 0, Committed: 0} -- [192.168.1.179:9001] Could not establish connection to peer [192.168.1.179:9002] - dial tcp 192.168.1.179:9002: connect: connection refused ... ... ...
The following is the output of node C, which has granted a vote for node A:
++ [192.168.1.179:9003] Election timeout - {Type: Follower, IsLeaderOk: false, GrantedVote: true, LastHB: 2020-08-22 19:22:09.684801758 -0400 EDT m=+8.957583532} ++ [192.168.1.179:9003] RequestVote event - {Term: 2, Index: 0, Leader: [192.168.1.179:9002]} => {Term: 3, Index: 0, From: [192.168.1.179:9001]} ++ [192.168.1.179:9003] RequestVote - Follower will GrantVote to [192.168.1.179:9001] ++ [192.168.1.179:9003] Heartbeat event - {Term: 3, Index: 0, Leader: [192.168.1.179:9001]} => {Term: 3, Index: 0, From: [192.168.1.179:9001]} ++ [192.168.1.179:9003] Heartbeat event - {Term: 3, Index: 0, Leader: [192.168.1.179:9001]} => {Term: 3, Index: 0, From: [192.168.1.179:9001]} ... ... ...
Now, let us dive into the second aspect Log Replication of the Raft Consensus algorithm.
Log Replication
It starts with a remote client sending a ClientUpdate message to the Leader for replication across a quorum of nodes. The message is published to the go channel called LeaderC so the Leader can process it. Once the Leader successfully processes the ClientUpdate message, it publishes an UpdateAck back to a client specific go channel, which is then serialized and dispatched to the remote client.
The following code segment shows the client sending the ClientUpdate message to the Leader:
Once the Leader receives a ClientUpdate message, it appends the update to its own Log, assigning the next Index value to CurrentIndex), and it sends an AppendEntries message to all the Follower nodes in the cluster for replication.
The following diagram illustrates the Leader getting an update from a client and sending a Log Replication message to the Follower nodes:
The following code segment shows the Leader sending an AppendEntries message (in response to a ClientUpdate message) to the Follower nodes:
Once a Follower node proccesses the AppendEntries message, based on the processing status (success or failure), updates its Index (on success), and sends a CommitAck message back to the Leader.
The following diagram illustrates the Leader sending replication message to Follower nodes. Once a Follower node processes the replication, it sends an acknowledgement back to the Leader:
The following code segment shows the Leader processing the CommitAck messages from the Follower nodes:
Once the Leader node receives a quorum of CommitAck messages (from the Followers), it finalizes the Index number by updating CommittedIndex. It the Leader does not receive a quorum, it waits few cycles (to accomodate for delays in the network to deliver CommitAck messages) to check. Finally, the Leader sends an UpdateAck message back to the client.
The following diagram illustrates the Leader processing commit acknowledgement messages from the Follower nodes. Once the Leader node gets either a quorum or exhausts the wait cycles (3 election timeout durations in our case), it sends an update acknowledgement back to the client:
The following code segment shows the Leader processing the CommitAck messages via the wait cycles in the election timer:
We will now demonstrate the Log Replication capability with the same 3 node cluster. We will start a client that sends updates to the Leader node.
The following is the output of node B, which has been elected as the Leader, receives updates from the client, replicates to the Followers, receives commit acknowledgement from the Followers, and finally sends an update acknowledgement to the client:
++ [192.168.1.179:9002] Ready to start server on Raft node ++ [192.168.1.179:9002] Election timer duration 1263 ++ [192.168.1.179:9002] Ready to accept remote connections ++ [192.168.1.179:9002] Election timer started @ 2020-08-22 21:48:52.642250222 -0400 EDT m=+0.001114231 ++ [192.168.1.179:9002] Election timeout - {Type: Follower, IsLeaderOk: false, GrantedVote: false, LastHB: 2020-08-22 21:48:52.641940769 -0400 EDT m=+0.000804798} ++ [192.168.1.179:9002] Election timeout - Follower ready to start Leader election -- [192.168.1.179:9002] Could not establish connection to peer [192.168.1.179:9001] - dial tcp 192.168.1.179:9001: connect: connection refused ++ [192.168.1.179:9002] GrantVote event - {Term: 0, Index: 0, Leader: []} => {Vote: true, Term: 1, Index: 0, From: [192.168.1.179:9003]} ++ [192.168.1.179:9002] GrantVote - Candidate {Votes: [YES: 2, NO: 0], Quorum: 2} ++ [192.168.1.179:9002] GrantVote - Candidate PROMOTED to Leader -- [192.168.1.179:9002] Could not establish connection to peer [192.168.1.179:9001] - dial tcp 192.168.1.179:9001: connect: connection refused ++ [192.168.1.179:9002] Heartbeat timer duration 1000 ++ [192.168.1.179:9002] Heartbeat timer - {Term: 1, Index: 0, Committed: 0} ++ [192.168.1.179:9002] Election timeout - {Type: Leader, IsLeaderOk: true, GrantedVote: false, LastHB: 2020-08-22 21:48:54.908250491 -0400 EDT m=+2.267114470} ... ... ... ++ [192.168.1.179:9002] ClientUpdate event from 192.168.1.179:40316 ++ [192.168.1.179:9002] ClientUpdate event - {Addr: [192.168.1.179:40316], Entry: It does not matter how slowly you go as long as you do not stop, Time: 2020-08-22 21:49:42.008284085 -0400 EDT} ++ [192.168.1.179:9002] CommitAck event - {Term: 1, Index: 1} => {Ack: true, Term: 1, Index: 1, Peer: [192.168.1.179:9001], Client: [192.168.1.179:40316]} ++ [192.168.1.179:9002] CommitAck - Leader ACCEPTED update {Client: [192.168.1.179:40316], Index: 1} ++ [192.168.1.179:9002] Client - UpdateAck event from [192.168.1.179:9002] {Ack: true, Addr: 192.168.1.179:40316} ++ [192.168.1.179:9002] CommitAck event - {Term: 1, Index: 1} => {Ack: true, Term: 1, Index: 1, Peer: [192.168.1.179:9003], Client: [192.168.1.179:40316]} ... ... ... ++ [192.168.1.179:9002] ClientUpdate event - {Addr: [192.168.1.179:40316], Entry: The secret of getting ahead is getting started, Time: 2020-08-22 21:49:52.010354848 -0400 EDT} ++ [192.168.1.179:9002] CommitAck event - {Term: 1, Index: 2} => {Ack: true, Term: 1, Index: 2, Peer: [192.168.1.179:9001], Client: [192.168.1.179:40316]} ++ [192.168.1.179:9002] CommitAck - Leader ACCEPTED update {Client: [192.168.1.179:40316], Index: 2} ++ [192.168.1.179:9002] Client - UpdateAck event from [192.168.1.179:9002] {Ack: true, Addr: 192.168.1.179:40316} ... ... ... ++ [192.168.1.179:9002] ClientUpdate event - {Addr: [192.168.1.179:40316], Entry: Be kind whenever possible. It is always possible, Time: 2020-08-22 21:50:02.012186897 -0400 EDT} ++ [192.168.1.179:9002] CommitAck event - {Term: 1, Index: 3} => {Ack: true, Term: 1, Index: 3, Peer: [192.168.1.179:9001], Client: [192.168.1.179:40316]} ++ [192.168.1.179:9002] CommitAck - Leader ACCEPTED update {Client: [192.168.1.179:40316], Index: 3} ++ [192.168.1.179:9002] Client - UpdateAck event from [192.168.1.179:9002] {Ack: true, Addr: 192.168.1.179:40316} ++ [192.168.1.179:9002] CommitAck event - {Term: 1, Index: 3} => {Ack: true, Term: 1, Index: 3, Peer: [192.168.1.179:9003], Client: [192.168.1.179:40316]} ... ... ...
The following is the output of node C that granted a vote for node B, receives replication messages from the Leader, updates its own Log, and finally acknowledges the replication commit back to the Leader:
++ [192.168.1.179:9003] Ready to start server on Raft node ++ [192.168.1.179:9003] Election timer duration 1152 ++ [192.168.1.179:9003] Ready to accept remote connections ++ [192.168.1.179:9003] Election timer started @ 2020-08-22 21:48:53.873848859 -0400 EDT m=+0.000953318 ++ [192.168.1.179:9003] RequestVote event - {Term: 0, Index: 0, Leader: []} => {Term: 1, Index: 0, From: [192.168.1.179:9002]} ++ [192.168.1.179:9003] RequestVote - Follower will GrantVote to [192.168.1.179:9002] ++ [192.168.1.179:9003] Heartbeat event - {Term: 1, Index: 0, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 0, From: [192.168.1.179:9002]} ... ... ... ++ [192.168.1.179:9003] AppendEntries event - {Term: 1, Index: 0, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 1, Client: [192.168.1.179:40316], Entry: It does not matter how slowly you go as long as you do not stop} ++ [192.168.1.179:9003] AppendEntries - UPDATE accepted {Index: 1} with {Entry: It does not matter how slowly you go as long as you do not stop} ++ [192.168.1.179:9003] Heartbeat event - {Term: 1, Index: 1, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 1, From: [192.168.1.179:9002]} ... ... ... ++ [192.168.1.179:9003] AppendEntries event - {Term: 1, Index: 1, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 2, Client: [192.168.1.179:40316], Entry: The secret of getting ahead is getting started} ++ [192.168.1.179:9003] AppendEntries - UPDATE accepted {Index: 2} with {Entry: The secret of getting ahead is getting started} ++ [192.168.1.179:9003] Heartbeat event - {Term: 1, Index: 2, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 2, From: [192.168.1.179:9002]} ... ... ... ++ [192.168.1.179:9003] AppendEntries event - {Term: 1, Index: 2, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 3, Client: [192.168.1.179:40316], Entry: Be kind whenever possible. It is always possible} ++ [192.168.1.179:9003] AppendEntries - UPDATE accepted {Index: 3} with {Entry: Be kind whenever possible. It is always possible} ++ [192.168.1.179:9003] Heartbeat event - {Term: 1, Index: 3, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 3, From: [192.168.1.179:9002]} ... ... ...
The following is the output of node A that receives replication messages from the Leader, updates its own Log, and finally acknowledges the replication commit back to the Leader:
++ [192.168.1.179:9001] Ready to start server on Raft node ++ [192.168.1.179:9001] Election timer duration 1245 ++ [192.168.1.179:9001] Ready to accept remote connections ++ [192.168.1.179:9001] Election timer started @ 2020-08-22 21:48:54.626840575 -0400 EDT m=+0.001222294 ++ [192.168.1.179:9001] Heartbeat event - {Term: 0, Index: 0, Leader: []} => {Term: 1, Index: 0, From: [192.168.1.179:9002]} ... ... ... ++ [192.168.1.179:9001] AppendEntries event - {Term: 1, Index: 0, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 1, Client: [192.168.1.179:40316], Entry: It does not matter how slowly you go as long as you do not stop} ++ [192.168.1.179:9001] AppendEntries - UPDATE accepted {Index: 1} with {Entry: It does not matter how slowly you go as long as you do not stop} ++ [192.168.1.179:9001] Heartbeat event - {Term: 1, Index: 1, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 1, From: [192.168.1.179:9002]} ... ... ... ++ [192.168.1.179:9001] AppendEntries event - {Term: 1, Index: 1, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 2, Client: [192.168.1.179:40316], Entry: The secret of getting ahead is getting started} ++ [192.168.1.179:9001] AppendEntries - UPDATE accepted {Index: 2} with {Entry: The secret of getting ahead is getting started} ++ [192.168.1.179:9001] Heartbeat event - {Term: 1, Index: 2, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 2, From: [192.168.1.179:9002]} ... ... ... ++ [192.168.1.179:9001] AppendEntries event - {Term: 1, Index: 2, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 3, Client: [192.168.1.179:40316], Entry: Be kind whenever possible. It is always possible} ++ [192.168.1.179:9001] AppendEntries - UPDATE accepted {Index: 3} with {Entry: Be kind whenever possible. It is always possible} ++ [192.168.1.179:9001] Heartbeat event - {Term: 1, Index: 3, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 3, From: [192.168.1.179:9002]} ... ... ...
Let us now TERMINATE the Leader node C by pressing CTRL-C.
The following is the output of node B, which continues to be a Leader and processes updates from the client as there is still a quorum (2 nodes):
... ... ... ++ [192.168.1.179:9002] ClientUpdate event - {Addr: [192.168.1.179:40316], Entry: It always seems impossible until it's done, Time: 2020-08-22 21:50:12.0141336 -0400 EDT} -- [192.168.1.179:9002] Could not establish connection to peer [192.168.1.179:9003] - dial tcp 192.168.1.179:9003: connect: connection refused ++ [192.168.1.179:9002] CommitAck event - {Term: 1, Index: 4} => {Ack: true, Term: 1, Index: 4, Peer: [192.168.1.179:9001], Client: [192.168.1.179:40316]} ++ [192.168.1.179:9002] CommitAck - Leader ACCEPTED update {Client: [192.168.1.179:40316], Index: 4} ++ [192.168.1.179:9002] Client - UpdateAck event from [192.168.1.179:9002] {Ack: true, Addr: 192.168.1.179:40316} ++ [192.168.1.179:9002] Heartbeat timer - {Term: 1, Index: 4, Committed: 4} -- [192.168.1.179:9002] Could not establish connection to peer [192.168.1.179:9003] - dial tcp 192.168.1.179:9003: connect: connection refused ++ [192.168.1.179:9002] Election timeout - {Type: Leader, IsLeaderOk: true, GrantedVote: false, LastHB: 2020-08-22 21:50:12.951802123 -0400 EDT m=+80.310666183} ++ [192.168.1.179:9002] Heartbeat timer - {Term: 1, Index: 4, Committed: 4} -- [192.168.1.179:9002] Could not establish connection to peer [192.168.1.179:9003] - dial tcp 192.168.1.179:9003: connect: connection refused ... ... ... ++ [192.168.1.179:9002] ClientUpdate event - {Addr: [192.168.1.179:40316], Entry: If you're going through hell, keep going, Time: 2020-08-22 21:50:22.015607258 -0400 EDT} ++ [192.168.1.179:9002] CommitAck event - {Term: 1, Index: 5} => {Ack: true, Term: 1, Index: 5, Peer: [192.168.1.179:9001], Client: [192.168.1.179:40316]} ++ [192.168.1.179:9002] CommitAck - Leader ACCEPTED update {Client: [192.168.1.179:40316], Index: 5} ++ [192.168.1.179:9002] Client - UpdateAck event from [192.168.1.179:9002] {Ack: true, Addr: 192.168.1.179:40316} ++ [192.168.1.179:9002] CommitAck event - {Term: 1, Index: 5} => {Ack: true, Term: 1, Index: 5, Peer: [192.168.1.179:9003], Client: [192.168.1.179:40316]} ++ [192.168.1.179:9002] Heartbeat timer - {Term: 1, Index: 5, Committed: 5} ++ [192.168.1.179:9002] Election timeout - {Type: Leader, IsLeaderOk: true, GrantedVote: false, LastHB: 2020-08-22 21:50:22.957223561 -0400 EDT m=+90.316087591} ... ... ...
The following is the output of node A, which continues to process replication messages from the Leader:
... ... ... ++ [192.168.1.179:9001] AppendEntries event - {Term: 1, Index: 3, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 4, Client: [192.168.1.179:40316], Entry: It always seems impossible until it's done} ++ [192.168.1.179:9001] AppendEntries - UPDATE accepted {Index: 4} with {Entry: It always seems impossible until it's done} ++ [192.168.1.179:9001] Heartbeat event - {Term: 1, Index: 4, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 4, From: [192.168.1.179:9002]} ... ... ... ++ [192.168.1.179:9001] AppendEntries event - {Term: 1, Index: 4, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 5, Client: [192.168.1.179:40316], Entry: If you're going through hell, keep going} ++ [192.168.1.179:9001] AppendEntries - UPDATE accepted {Index: 5} with {Entry: If you're going through hell, keep going} ++ [192.168.1.179:9001] Heartbeat event - {Term: 1, Index: 5, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 5, From: [192.168.1.179:9002]} ... ... ...
Now, let us bring BACK node C by restarting it.
The following is the output of node C, which synchronizes with the Leader on the old missed updates and catches up:
++ [192.168.1.179:9003] Ready to start server on Raft node ++ [192.168.1.179:9003] Election timer duration 1212 ++ [192.168.1.179:9003] Ready to accept remote connections ++ [192.168.1.179:9003] Election timer started @ 2020-08-22 21:50:17.482606286 -0400 EDT m=+0.000887964 ++ [192.168.1.179:9003] Heartbeat event - {Term: 0, Index: 0, Leader: []} => {Term: 1, Index: 4, From: [192.168.1.179:9002]} -- [192.168.1.179:9003] Heartbeat timeout - Node index BEHIND from [192.168.1.179:9002] ++ [192.168.1.179:9003] AppendEntries event - {Term: 1, Index: 0, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 1, Client: [192.168.1.179:9002], Entry: It does not matter how slowly you go as long as you do not stop} ++ [192.168.1.179:9003] AppendEntries - SYNC accepted {Index: 1} with {Entry: It does not matter how slowly you go as long as you do not stop} ++ [192.168.1.179:9003] AppendEntries event - {Term: 1, Index: 1, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 2, Client: [192.168.1.179:9002], Entry: The secret of getting ahead is getting started} ++ [192.168.1.179:9003] AppendEntries - SYNC accepted {Index: 2} with {Entry: The secret of getting ahead is getting started} ++ [192.168.1.179:9003] AppendEntries event - {Term: 1, Index: 2, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 3, Client: [192.168.1.179:9002], Entry: Be kind whenever possible. It is always possible} ++ [192.168.1.179:9003] AppendEntries - SYNC accepted {Index: 3} with {Entry: Be kind whenever possible. It is always possible} ++ [192.168.1.179:9003] AppendEntries event - {Term: 1, Index: 3, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 4, Client: [192.168.1.179:9002], Entry: It always seems impossible until it's done} ++ [192.168.1.179:9003] AppendEntries - SYNC accepted {Index: 4} with {Entry: It always seems impossible until it's done} ++ [192.168.1.179:9003] Heartbeat event - {Term: 1, Index: 4, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 4, From: [192.168.1.179:9002]} ... ... ... ++ [192.168.1.179:9003] AppendEntries event - {Term: 1, Index: 4, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 5, Client: [192.168.1.179:40316], Entry: If you're going through hell, keep going} ++ [192.168.1.179:9003] AppendEntries - UPDATE accepted {Index: 5} with {Entry: If you're going through hell, keep going} ++ [192.168.1.179:9003] Heartbeat event - {Term: 1, Index: 5, Leader: [192.168.1.179:9002]} => {Term: 1, Index: 5, From: [192.168.1.179:9002]} ... ... ...
Now, let us dive into the final aspect Safety measures of the Raft Consensus algorithm.
Safety
The Raft Consensus algorithm guarantees consistency at all times if the following set of properties are *TRUE*:
Election Safety :: at any point in time, there can be *ONLY* one Leader. This is why we need odd number of nodes in a cluster for quorum. Also, a Candidate with * UP-TO-DATE* committed log entries will be allowed to become a Leader
Leader Append Only :: a Leader will *NEVER* update or delete log entries; it will *ONLY* append log entries
Log Matching :: if two replication logs contain a log entry for the same Term and Index at some point in time T, then the replication logs *WILL* be consistent and identical for all the previous log entries leading up to the point in time T
Leader Completeness :: if a log entry is committed at a particular Index in a given Term, that log entry *WILL* exist in the replication logs of future Leaders at a higher Term
State Machine Safety :: if a node (including the Leader) has committed a log entry at a particular Index, *NO* other node can committed a different log entry at that same Index
This concludes our practical hands-on approach to unravelling the Raft Consensus algorithm.
The code in this article is a simple implementation of Raft Consensus to understand the inner workings of the algorithm. It is by NO means EXHAUSTIVELY and THROUGHLY tested. It is PURELY for learning purposes.
References