Raft (2) FAQ Q: What are some uses of Raft besides GFS master replication? A: You could (and will) build a fault-tolerant key/value database using Raft. You could make the MapReduce master fault-tolerant with Raft. You could build a fault-tolerant locking service. Q: When raft receives a read request does it still commit a no-op? A: Section 8 mentions two different approaches. The leader could send out a heartbeat first; or there could be a convention that the leader can't change for a known period of time after each heartbeat (the lease). Real systems usually use a lease, since it requires less communication. Section 8 says the leader sends out a no-op only at the very beginning of its term. Q: The paper states that no log writes are required on a read, but then immediately goes on to introduce committing a no-op as a technique to get the committed. Is this a contradiction or are no-ops not considered log 'writes'? A: The no-op only happens at the start of the term, not for each read. Q: I find the line about the leader needing to commit a no-op entry in order to know which entries are committed pretty confusing. Why does it need to do this? A: The problem situation is shown in Figure 8, where if S1 becomes leader after (b), it cannot know if its last log entry (2) is committed or not. The situation in which the last log entry will turn out not to be committed is if S1 immediately fails, and S5 is the next leader; in that case S5 will force all peers (including S1) to have logs identical to S5's log, which does not include entry 2. But suppose S1 manages to commit a new entry during its term (term 4). If S5 sees the new entry, S5 will erase 3 from its log and accept 2 in its place. If S5 does not see the new entry, S5 cannot be the next leader if S1 fails, because it will fail the Election Restriction. Either way, once S1 has committed a new entry for its term, it can correctly conclude that every preceding entry in its log is committed. The no-op text at the end of Section 8 is talking about an optimization in which the leader executes and answers read-only commands (e.g. get("k1")) without committing those commands in the log. For example, for get("k1"), the leader just looks up "k1" in its key/value table and sends the result back to the client. If the leader has just started, it may have at the end of its log a put("k1", "v99"). Should the leader send "v99" back to the client, or the value in the leader's key/value table? At first, the leader doesn't know whether that v99 log entry is committed (and must be returned to the client) or not committed (and must not be sent back). So (if you are using this optimization) a new Raft leader first tries to commit a no-op to the log; if the commit succeeds (i.e. the leader doesn't crash), then the leader knows everything before that point is committed. Q: How does using the heartbeat mechanism to provide leases (for read-only) operations work, and why does this require timing for safety (e.g. bounded clock skew)? A: I don't know exactly what the authors had in mind. Perhaps every AppendEntries RPC the leader sends out says or implies that the no other leader is allowed to be elected for the next 100 milliseconds. If the leader gets positive responses from a majority, then the leader can serve read-only requests for the next 100 milliseconds without further communication with the followers. This requires the servers to have the same definition of what 100 milliseconds means, i.e. they must have clocks that tick at the close to the same rate. Q: What exactly do the C_old and C_new variables in Section 6 (and Figure 11) represent? Are they the leader in each configuration? A: They are the set of servers in the old/new configuration. The paper doesn't provide details. I believe it's the identities (network names or addresses) of the servers. Q: When transitioning from cluster C_old to cluster C_new, how can we create a hybrid cluster C_{old,new}? I don't really understand what that means. Isn't it following either the network configuration of C_old or of C_new? What if the two networks disagreed on a connection? A: During the period of joint consensus (while Cold,new is active), the leader is required to get a majority from both the servers in Cold and the servers in Cnew. There can't really be disagreement, because after Cold,new is committed into the logs of both Cold and Cnew (i.e. after the period of joint consensus has started), any new leader in either Cold or Cnew is guaranteed to see the log entry for Cold,Cnew. Q: I'm confused about Figure 11 in the paper. I'm unsure about how exactly the transition from 'C_old' to 'C_old,new' to 'C_new' goes. Why is there the issue of the cluster leader not being a part of the new configuration, where the leader steps down once it has committed the 'C_new' log entry? (The second issue mentioned in Section 6) A: Suppose C_old={S1,S2,S3} and C_new={S4,S5,S6}, and that S1 is the leader at the start of the configuration change. At the end of the configuration change, after S1 has committed C_new, S1 should not be participating any more, since S1 isn't in C_new. One of S4, S5, or S6 should take over as leader. Q: About cluster configuration: During the configuration change time, if we have to stop receiving requests from the clients, then what's the point of having this automated configuration step? Doesn't it suffice to just 1) stop receiving requests 2) change the configurations 3) restart the system and continue? A: The challenge here is ensuring that the system is correct even if there are failures during this process, and even if not all servers get the "stop receiving requests" and "change the configuration" commands at the same time. Any scheme has to cope with the possibility of a mix of servers that have and have not seen or completed the configuration change -- this is true even of a non-automated system. The paper's protocol is one way to solve this problem. Q: The last two paragraphs of section 6 discuss removed servers interfering with the cluster by trying to get elected even though they’ve been removed from the configuration. Wouldn’t a simpler solution be to require servers to be shut down when they leave the configuration? It seems that leaving the cluster implies that a server can’t send or receive RPCs to the rest of the cluster anymore, but the paper doesn’t assume that. Why not? Why can’t you assume that the servers will shut down right away? A: I think the immediate problem is that the Section 6 protocol doesn't commit Cnew to the old servers, it only commits Cnew to the servers in Cnew. So the servers that are not in Cnew never learn when Cnew takes over from Cold,new. The paper does say this: When the new configuration has been committed under the rules of Cnew, the old configuration is irrelevant and servers not in the new configuration can be shut down. So perhaps the problem only exists during the period of time between the configuration change and when an administrator shuts down the old servers. I don't know why they don't have a more automated scheme. Q: How common is it to get a majority from both the old and new configurations when doing things like elections and entry commitment, if it's uncommon, how badly would this affect performance? A: I imagine that in most cases there is no failure, and the leader gets both majorities right away. Configuration change probably only takes a few round trip times, i.e. a few dozen milliseconds, so the requirement to get both majorities will slow the system down for only a small amount of time. Configuration change is likely to be uncommon (perhaps every few months); a few milliseconds of delay every few months doesn't seem like a high price. Q: And how important is the decision to have both majorities? A: The requirement for both majorities is required for correctness, to cover the possibility that the leader fails during the configuration change. Q: Just to be clear, the process of having new members join as non-voting entities isn't to speed up the process of replicating the log, but rather to influence the election process? How does this increase availability? These servers that need to catch up are not going to be available regardless, right? A: The purpose of non-voting servers is to allow those servers to get a complete copy of the leader's log without holding up new commits. The point is to allow a subsequent configuration change to be quick. If the new servers didn't already have nearly-complete logs, then the leader wouldn't be able to commit Cold,new until they caught up; and no new client commands can be executed between the time Cold,new is first sent out and the time at which it is committed. Q: If the cluster leader does not have the new configuration, why doesn't it just remove itself from majority while committing C_new, and then when done return to being leader? Is there a need for a new election process? A: Is this about the "second issue" in Section 6? The situation they describe is one in which the leader isn't in the new configuration at all. So after Cnew is committed, the leader shouldn't be participating in Raft at all. Q: How does the non-voting membership status work in the configuration change portion of Raft. Does that server state only last during the changeover (i.e. while c_new not committed) or do servers only get full voting privileges after being fully "caught up"? If so, at what point are they considered "caught up"? A: The paper doesn't have much detail here. I imagine that the leader won't start the configuration change until the servers to be added (the non-voting members) are close to completely caught up. When the leader sends out the Cold,new log entry in AppendEntries RPCs to those new servers, the leader will bring them fully up to date (using the Figure 2 machinery). The leader won't be able to commit the Cold,new message until a majority of those new servers are fully caught up. Once the Cold,new message is committed, those new servers can vote. Q: I don't disagree that having servers deny RequestVotes that are less than the minimum election timeout from the last heartbeat is a good idea (it helps prevent unnecessary elections in general), but why did they choose that method specifically to prevent servers not in a configuration running for election? It seems like it would make more sense to check if a given server is in the current configuration. E.g., in the lab code we are using, each server has the RPC addresses of all the servers (in the current configuration?), and so should be able to check if a requestVote RPC came from a valid (in-configuration) server, no? Q: I agree that the paper's design seems a little awkward, and I don't know why they designed it that way. Your idea seems like a reasonable starting point. One complication is that there may be situations in which a server in Cnew is leader during the joint consensus phase, but at that time some servers in Cold may not know about the joint consensus phase (i.e. they only know about Cold, not Cold,new); we would not want the latter servers to ignore the legitimate leader. Q: When exactly does joint consensus begin, and when does it end? Does joint consensus begin at commit time of "C_{o,n}"? A: Joint consensus is in progress when the current leader is aware of Cold,new. If the leader doesn't manage to commit Cold,new, and crashes, and the new leader doesn't have Cold,new in its log, then joint consensus ends early. If a leader manages to commit Cold,new, then joint consensus has not just started but will eventually complete, when a leader commits Cnew. Q: Can the configuration log entry be overwritten by a subsequent leader (assuming that the log entry has not been committed)? A: Yes, that is possible, if the original leader trying to send out Cold,new crashes before it commits the Cold,new. Q: How can the "C_{o,n}" log entry ever be committed? It seems like it must be replicated to a majority of "old" servers (as well as the "new" servers), but the append of "C_{o,n}" immediately transitions the old server to new, right? A: The commit does not change the set of servers in Cold or Cnew. For example, perhaps the original configuration contains servers S1, S2, S3; then Cold is {S1,S2,S3}. Perhaps the desired configuration is S4, S5, S6; then Cnew is {S4,S5,S6}. Once Cnew is committed to the log, the configuration is Cnew={S4,S5,S6}; S1,S2, and S3 are no longer part of the configuration. Q: When snapshots are created, is the data and state used the one for the client application? If it's the client's data then is this something that the client itself would need to support in addition to the modifications mentioned in the raft paper? A: Example: if you are building a key/value server that uses Raft for replication, then there will be a key/value module in the server that stores a table of keys and values. It is that table of keys and values that is saved in the snapshot. Q: The paper says that "if the follower receives a snapshot that describves a prefix of its log, then log entries covered by the snapshot are deleted but entries following the snapshot are retained". This means that we could potentially be deleting operations on the state machinne. A: I don't think information will be lost. If the snapshot covers a prefix of the log, that means the snapshot includes the effects of all the operations in that prefix. So it's OK to discard that prefix. Q: It seems that snapshots are useful when they are a lot smaller than applying the sequence of updates (e.g., frequent updates to a few keys). What happens when a snapshot is as big as the sum of its updates (e.g., each update inserts a new unique key)? Are there any cost savings from doing snapshots at all in this case? A: If the snapshot is about as big as the log, then there may not be a lot of value in having snapshots. On the other hand, perhaps the snapshot organizes the data in a way that's easier to access than a log, e.g. in a sorted table. Then it might be faster to re-start the service after a crash+reboot from a snapshotted table than from the log (which you would have to sort). It's much more typical, however, for the log to be much bigger than the state. Q: Also wouldn't a InstallSnapshot incur heavy bandwidth costs? A: Yes, if the state is large (as it would be for e.g. a database). However, this is not an easy problem to solve. You'd probably want the leader to keep enough of its log to cover all common cases of followers lagging or being temporarily offline. You might also want a way to transfer just the differences in server state, e.g. just the parts of the database that have changed recently. Q: Is there a concern that writing the snapshot can take longer than the election timeout because of the amount of data that needs to be appended to the log? A: You're right that it's a potential problem for a large server. For example if you're replicating a database with a gigabyte of data, and your disk can only write at 100 megabytes per second, writing the snapshot will take ten seconds. One possibility is to write the snapshot in the background (i.e. arrange to not wait for the write, perhaps by doing the write from a child process), and to make sure that snapshots are created less often than once per ten seconds. Q: Under what circumstances would a follower receive a snapshot that is a prefix of its own log? A: The network can deliver messages out of order, and the RPC handling system can execute them out of order. So for example if the leader sends a snapshot for log index 100, and then one for log index 110, but the network delivers the second one first. Q: Additionally, if the follower receives a snapshot that is a prefix of its log, and then replaces the entries in its log up to that point, the entries after that point are ones that the leader is not aware of, right? A: The follower might have log entries that are not in a received snapshot if the network delays delivery of the snapshot, or if the leader has sent out log entires but not yet committed them. Q: Will those entries ever get committed? A: They could get committed. Q: How does the processing of InstallSnapshot RPC handle reordering, when the check at step 6 references log entries that have been compacted? Specifically, shouldn't Figure 13 include: 1.5: If lastIncludedIndex < commitIndex, return immediately. or alternatively 1.5: If there is already a snapshot and lastIncludedIndex < currentSnapshot.lastIncludedIndex, return immediately. A: I agree -- for Lab 3B the InstallSnapshot RPC handler must reject stale snapshots. I don't know why Figure 13 doesn't include this test; perhaps the authors' RPC system is better about order than ours. Or perhaps the authors intend that we generalize step 6 in Figure 13 to cover this case. Q: What happens when the leader sends me an InstallSnapshot command that is for a prefix of my log, but I've already undergone log compaction and my snapshot is ahead? Is it safe to assume that my snapshot that is further forward subsumes the smaller snapshot? A: Yes, it is correct for the recipient to ignore an InstallSnapshot if the recipient is already ahead of that snapshot. This case can arise in Lab 3, for example if the RPC system delivers RPCs out of order. Q: How do leaders decide which servers are lagging and need to be sent a snapshot to install? A: If a follower rejects an AppendEntries RPC for log index i1 due to rule #2 or #3 (under AppendEntries RPC in Figure 2), and the leader has discarded its log before i1, then the leader will send an InstallSnapshot rather than backing up nextIndex[]. Q: In actual practical use of raft, how often are snapshots sent? A: I have not seen an analysis of real-life Raft use. I imagine people using Raft would tune it so that snapshots were rarely needed (e.g. by having leaders keep lots of log entries with which to update lagging followers). Q: Is InstallSnapshot atomic? If a server crashes after partially installing a snapshot, and the leader re-sends the InstallSnapshot RPC, is this idempotent like RequestVote and AppendEntries RPCs? A: The implementation of InstallSnapshot must be atomic. It's harmless for the leader to re-send a snapshot. Q: Why is an offset needed to index into the data[] of an InstallSNapshot RPC, is there data not related to the snapshot? Or does it overlap previous/future chunks of the same snapshot? Thanks! A: The complete snapshot may be sent in multiple RPCs, each containing a different part ("chunk") of the complete snapshot. The offset field indicates where this RPC's data should go in the complete snapshot. Q: How does copy-on-write help with the performance issue of creating snapshots? A: The basic idea is for th server to fork(), giving the child a complete copy of the in-memory state. If fork() really copied all the memory, and the state was large, this would be slow. But most operating systems don't copy all the memory in fork(); instead they mark the pages as "copy-on-write", and make them read-only in both parent and child. Then the operating system will see a page fault if either tries to write a page, and the operating system will only copy the page at that point. The net effect is usually that the child sees a copy of its parent process' memory at the time of the fork(), but with relatively little copying. Q: What data compression schemes, such as VIZ, ZIP, Huffman encoding, etc. are most efficient for Raft snapshotting? A: It depends on what data the service stores. If it stores images, for example, then maybe you'd want to compress them with JPEG. If you are thinking that each snapshot probably shares a lot of content with previous snapshots, then perhaps you'd want to use some kind of tree structure which can share nodes across versions. Q: Does adding an entry to the log count as an executed operation? A: No. A server should only execute an operation in a log entry after the leader has indicated that the log entry is committed. "Execute" means handing the operation to the service that's using Raft. In Lab 3, "execute" means that Raft gives the committed log entry to your key/value software, which applies the Put(key,value) or Get(key) to its table of key/value pairs. Q: According to the paper, a server disregards RequestVoteRPCs when they think a current leader exists, but then the moment they think a current leader doesn't exist, I thought they try to start their own election. So in what case would they actually cast a vote for another server? For the second question, I'm still confused: what does the paper mean when it says a server should disregard a RequestVoteRPC when it thinks a current leader exists at the end of Section 6? In what case would a server think a current leader doesn't exist but hasn't started its own election? Is it if the server thinks it hasn't yet gotten a heartbeat from the server but before its election timeout? A: Each server waits for a randomly chosen election timeout; if it hasn't heard from the leader for that whole period, and no other server has started an election, then the server starts an election. Whichever server's election timer expires first is likely to get votes from most or all of the servers before any other server's timer expires, and thus is likely to win the election. Suppose the heartbeat interval is 10 milliseconds (ms). The leader sends out heartbeats at times 10, 20, and 30. Suppose server S1 doesn't hear the heartbeat at time 30. S1's election timer goes off at time 35, and S1 sends out RequestVote RPCs. Suppose server S2 does hear the heartbeat at time 30, so it knows the server was alive at that time. S2 will set its election timer to go off no sooner than time 40, since only a missing heartbeat indicates a possibly dead server, and the next heartbeat won't come until time 40. When S2 hears S1'a RequestVote at time 35, S2 can ignore the RequestVote, because S2 knows that it heard a heartbeat less than one heartbeat interval ago. Q: I'm a little confused by the "how to roll back quickly" part of the Lecture 6 notes (and the corresponding part of the paper): paper outlines a scheme towards end of Section 5.3: if follower rejects, includes this in reply: the term of the conflicting entry the index of the first entry for conflicting term if leader knows about the conflicting term: move nextIndex[i] back to its last entry for the conflicting term else: move nextIndex[i] back to follower's first index I think according to the paper, the leader should move nextIndex[i] to the index of the first entry for conflicting term. What does the situation "if leader knows about the conflicting term" mean? A: The paper's description of the algorithm is not complete, so we have to invent the details for ourselves. The notes have the version I invented; I don't know if it's what the authors had in mind. The specific problem with the paper's "index of the first entry for the conflicting term" is that the leader might no have entries at all for the conflicting term. Thus my notes cover two cases -- if the server knows about the conflicting term, and if it doesn't. Q: What are the tradeoffs in network/ performance in decreasing nextIndex by a factor of 2 each time at each mismatch? i.e. first by 1,2,4, 8 and so on A: The leader will overshoot by up to a factor of two, and thus have to send more entries than needed. Of course the Figure 2 approach is also wasteful if one has to back up a lot. Best might be to implement something more precise, for example the optimization outlined towards the end of section 5.3. Q: Unrelatedly - How does your experience teaching Raft and Paxos correspond to section 9.1 of the paper? Do your experiences support their findings? A: I was pretty happy with the 6.824 Paxos labs from a few years ago. I'm pretty happy with the current Raft labs too. The Raft labs are more ambitious: unlike the Paxos labs, the Raft labs have a leader, persistence, and snapshots. I don't think we have any light to shed on the findings in Section 9.1; we didn't perform a side-by-side experiment on the students, and our Raft labs are noticeably more ambitious. Q: What has been the impact of Raft, from the perspective of academic researchers in the field? Is it considered significant, inspiring, non-incremental work? Or is it more of "okay, this seems like a natural progression, and is a bit easier to teach, so let's teach this?" A: The Raft paper does a better job than any paper I know of in explaining modern replicated state machine techniques. I think it has inspired lots of people to build their own replication implementations. Q: The paper states that there are a fair amount of implementations of Raft out in the wild. Have there been any improvement suggestions that would make sense to include in a revised version of the algorithm? A: Here's an example: https://www.cl.cam.ac.uk/~ms705/pub/papers/2015-osr-raft.pdf