ZooKeeper FAQ

Q: Why are only update requests A-linearizable? Why not reads as well?

A: The authors want high total read throughput, so they want replicas
to be able to satisfy client reads without involving the leader. A
given replica may not know about a commited write (if it's not in the
majority that the leader waited for), or may know about a write but
not yet know if it is committed. Thus a replica's state may lag behind
the leader and other replicas. Thus serving reads from replicas can
yield data that doesn't reflect recent writes -- that is, reads can
return stale results.

Q: How does linearizability differ from serializability?

A: The usual definition of serializability is much like
linearizability, but without the requirement that operations respect
real-time ordering. Have a look at this explanation:
http://www.bailis.org/blog/linearizability-versus-serializability/

Section 2.3 of the ZooKeeper paper uses "serializable" to indicate
that the system behaves as if writes (from all clients combined) were
executed one by one in some order. The "FIFO client order" property
means that reads occur at specific points in the order of writes, and
that a given client's successive reads never move backwards in that
order. One thing that's going on here is that the guarantees for
writes and reads are different.

Q: What is pipelining?

There are two things going on here. First, the ZooKeeper leader
(really the leader's Zab layer) batches together multiple client
operations in order to send them efficiently over the network, and in
order to efficiently write them to disk. For both network and disk,
it's often far more efficient to send a batch of N small items all at
once than it is to send or write them one at a time. This kind of
batching is only effective if the leader sees many client requests at
the same time; so it depends on there being lots of active clients.

The second aspect of pipelining is that ZooKeeper makes it easy for
each client to keep many write requests outstanding at a time, by
supporting asynchronous operations. From the client's point of view,
it can send lots of write requests without having to wait for the
responses (which arrive later, as notifications after the writes
commit). From the leader's point of view, that client behavior gives
the leader lots of requests to accumulate into big efficient batches.

A worry with pipelining is that operations that are in flight might be
re-ordered, which would cause the problem that the authors to talk
about in 2.3. If a the leader has many write operations in flight
followed by write to ready, you don't want those operations to be
re-ordered, because then other clients may observe ready before the
preceding writes have been applied. To ensure that this cannot
happen, Zookeeper guarantees FIFO for client operations; that is the
client operations are applied in the order they have been issued.

Q: What does wait-free mean?

A: The precise definition: A wait-free implementation of a concurrent
data object is one that guarantees that any process can complete any
operation in a finite number of steps, regardless of the execution
speeds of the other processes. This definition was introduced in the
following paper by Herlihy:
https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf

Zookeeper is wait-free because it processes one client's requests
without needing to wait for other clients to take action. This is
partially a consequence of the API: despite being designed to support
client/client coordination and synchronization, no ZooKeeper API call
is defined in a way that would require one client to wait for another.
In contrast, a system that supported a lock acquire operation that
waited for the current lock holder to release the lock would not be
wait-free.

Ultimately, however, ZooKeeper clients often need to wait for each
other, and ZooKeeper does provide a waiting mechanism -- watches. The
main effect of wait-freedom on the API is that watches are factored
out from other operations. The combination of atomic test-and-set
updates (e.g. file creation and writes condition on version) with
watches allows clients to synthesize more complex blocking
abstractions (e.g. Section 2.4's locks and barriers).

Q: How does the leader know the order in which a client wants a bunch
of asynchronous updates to be performed?

A: The paper doesn't say. The answer is likely to involve the client
numbering its asynchronous requests, and the leader tracking for each
client (really session) what number it should next expect. The leader
has to keep state per session anyway (for client session timeouts), so
it might be little extra work to track a per-session request sequence
number. This information would have to be preserved when a leader
fails and another server takes over, so the client sequence numbers
are likely passed along in replicated log entries.

Q: What does a client do if it doesn't get a reply for a request? Does
it re-send, in case the network lost a request or reply, or the leader
crashed before committing? How does ZooKeeper avoid re-sends leading
to duplicate executions?

A: The paper doesn't say how all this works. Probably the leader
tracks what request numbers from each session it has received and
committed, so that it can filter out duplicate requests. Lab 3 has a
similar arrangement.

Q: If a client submits an asynchronous write, and immediately
afterwards does a read, will the read see the effect of the write?

A: The paper doesn't explicitly say, but the implication of the "FIFO
client order" property of Section 2.3 is that the read will see the
write. That seems to imply that a server may block a read until the
server has received (from the leader) all of the client's preceding
writes. ZooKeeper probably manages this by having the client send, in
its read request, the zxid of the latest preceding operation that the
client submitted.

Q: What is the reason for implementing 'fuzzy snapshots'?

A: A precise snapshot would correspond to a specific point in the log:
the snapshot would include every write before that point, and no
writes after that point; and it would be clear exactly where to start
replay of log entries after a reboot to bring the snapshot up to date.
However, creation of a precise snapshot requires a way to prevent any
writes from happening while the snapshot is being created and written
to disk. Blocking writes for the duration of snapshot creation would
decrease performance a lot.

The point of ZooKeeper's fuzzy snapshots is that ZooKeeper creates the
snapshot from its in-memory database while allowing writes to the
database. This means that a snapshot does not correspond to a
particular point in the log -- a snapshot includes a more or less
random subset of the writes that were concurrent with snapshot
creation. After reboot, ZooKeeper constructs a consistent snapshot by
replaying all log entries from the point at which the snapshot
started. Because updates in Zookeeper are idempotent and delivered in
the same order, the application-state will be correct after reboot and
replay---some messages may be applied twice (once to the state before
recovery and once after recovery) but that is ok, because they are
idempotent. The replay fixes the fuzzy snapshot to be a consistent
snapshot of the application state.

The Zookeeper leader turns the operations in the client API into
idempotent transactions. For example, if a client issues a conditional
setData and the version number in the request matches, the Zookeeper
leader creates a setDataTXN that contains the new data, the new
version number, and updated time stamps. This transaction (TXN) is
idempotent: Zookeeper can execute it twice and it will result in the
same state.

Q: How does ZooKeeper choose leaders?

A: Zookeeper uses ZAB, an atomic broadcast system, which has leader
election built in, much like Raft. Here's a paper about Zab:
http://dl.acm.org/citation.cfm?id=2056409

Q: How does Zookeeper's performance compare to other systems
such as Paxos?

A: It has impressive performance (in particular throughput); Zookeeper
would beat the pants of your implementation of Raft. 3 zookeeper
servers process 21,000 writes per second. Your raft with 3 servers
commits on the order of tens of operations per second (assuming a
magnetic disk for storage) and maybe hundreds per second with
SSDs.

Q: How does the ordering guarantee solve the race conditions in
Section 2.3?

If a client issues many write operations to various z-nodes, and then the
write to Ready, then Zookeeper will guarantee that all the writes
will be applied to the z-nodes before the write to Ready. Thus, if
another client observes Ready, then all the preceding writes must
have been applied and thus it is ok for the client to read the info
in the z-nodes.

Q: How big is the ZooKeeper database? It seems like the server must
have a lot of memory.

It depends on the application, and, unfortunately, the paper doesn't
report the authors' experience in this area. Since Zookeeper is
intended for configuration and coordination, and not as a
general-purpose data store, an in-memory database seems reasonable.
For example, you could imagine using Zookeeper for GFS's master and
that amount of data should fit in the memory of a well-equipped
server, as it did for GFS.

Q: What's a universal object?

A: It is a theoretical statement of how good the API of Zookeeper is
based on a theory of concurrent objects that Herlihy introduced:
https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf. We won't
spend any time on this statement and theory, but if you care there is
a gentle introduction on this wikipedia page:
https://en.wikipedia.org/wiki/Non-blocking_algorithm.

The authors appeal to this concurrent-object theory in order to show
that Zookeeper's API is general-purpose: that the API includes enough
features to implement any coordination scheme you'd want.

Q: How does a client know when to leave a barrier (top of page 7)?

A: Leaving the barrier involves each client watching the znodes for
all other clients participating in the barrier. Each client waits for
all of them to be gone. If they are all gone, they leave the barrier
and continue computing.

Q: Is it possible to add more servers into an existing ZooKeeper
without taking the service down for a period of time?

It is -- although when the original paper was published, cluster
membership was static. Nowadays, ZooKeeper supports "dynamic
reconfiguration":

https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html

... and there is actually a paper describing the mechanism:

https://www.usenix.org/system/files/conference/atc12/atc12-final74.pdf

How do you think this compares to Raft's dynamic configuration change
via overlapping consensus, which appeared two years later?

Q: How are watches implemented in the client library?

It depends on the implementation. In most cases, the client library
probably registers a callback function that will be invoked when the
watch triggers.

For example, a Go client for ZooKeeper implements it by passing a
channel into "GetW()" (get with watch); when the watch triggers, an
"Event" structure is sent through the channel. The application can
check the channel in a select clause.

See https://godoc.org/github.com/samuel/go-zookeeper/zk#Conn.GetW.