Paper Review of Compartmentalized Consensus

Salem Alqahtani
10 min readJul 8, 2020

In the state machine replication systems, people try to improve the system throughput by introducing more techniques. These techniques are not helpful all the time. Some techniques increase the system performance in some cases and some techniques hurt the system in other cases.

This paper introduced compartmentalization, that revolves around decoupling and scaling. The decoupling component means to divide the component into a number of subcomponents, then scale up the subcomponents.

There are three heuristics in Compartmentizated consensus protocol.

Heuristics-1: Decouple control flow from the data flow.

Heuristics-2: Decouple unscalable nodes from the scalable nodes.

Heuristics-3: Process independent commands independently.

Compartmentalization techniques applied in MultiPaxos, Mencius, and S-Paxos.

Step 1: Proxy Leaders

The paper introduces a set of at least f + 1 proxy leaders, as shown in Figure 4. The leader is responsible for sequencing commands, while the proxy leaders are responsible for getting commands are chosen and broadcasting chosen commands to the replicas. This technique already introduced in PigPaxos called relays nodes.

Upon receiving a Phase 2a message, a proxy leader broadcasts it to the acceptors, gathers a quorum of (f + 1) Phase 2b responses, and notifies the replicas of the chosen value. Without proxy leaders, the leader processes 3f + 4 messages per command, but with proxy leaders, the leader only processes 2. This makes the leader significantly less of a throughput bottleneck.

Step 2: Multiple Acceptor Groups

The acceptors used to choose the first command can be disjoint from the acceptors used to chose the second. The paper introduces multiple acceptor groups, with each acceptor group having 2f + 1 acceptors. This is illustrated in Figure 5. Log entries are round-robin partitioned among the acceptor groups. Given n acceptor groups, when a proxy leader receives a Phase 2a message for slot s, it contacts acceptor group (i) where (i) mod n ≡ s.

Step 3: Scaling Replicas

Certain aspects of the replicas do not scale. For example, every replica must receive and execute every state machine command. This is unavoidable, and adding more replicas does not reduce this overhead. However, recall that for every state machine command, only one of the replicas has to send the result of executing the command back to the client. Thus, with n replicas, every replica only has to send back results for (1/n) of the commands. If we increase the number of replicas, we reduce the number of messages that each replica has to send. This reduces the load on the replicas and helps prevent them from becoming a throughput bottleneck. This is illustrated in Figure 6.

Compartmentalized MultiPaxos can process 200,000 commands per second, 8× more than MultiPaxos without compartmentalization.

Step 4: Batching

Step 5: Batchers

Batching is a well-known technique for increasing throughput by amortizing the communication and computation cost of processing a command. Without batching, a proxy leader has to process a total of at least 3f +4 messages per command. With batching, however, a proxy leader has to process 3f + 4 messages per batch.

To process a single batch of n commands, the leader has to receive n messages and send one message. Ideally, it would only have to receive one message and send one message. This makes the leader a potential throughput bottleneck. To remove the bottleneck, the paper applies Heuristic 1 and Heuristic 2 and decouple the leader.

The leader has two responsibilities: forming batches and sequencing batches. We decouple the two responsibilities by introducing a set of at least f + 1 batchers, as illustrated in Figure 8. The batchers are responsible for receiving commands from clients and forming batches, while the leader is responsible for sequencing batches.

A client sends the command to a randomly selected batcher. After receiving sufficiently many commands from the clients (or after a timeout expires), a batcher places the commands in a batch and forwards it to the leader. When the leader receives a batch of commands, it assigns it a log entry, forms a Phase 2a message, and sends the Phase 2a message to a proxy leader. Without batchers, the leader has to receive n messages per batch of n commands. With batchers, the leader only has to receive one. This either reduces the load on the bottleneck leader or eliminates it as a bottleneck completely.

Step 6: Unbatchers

After executing a batch of n commands, a replica has to send n messages back to the n clients. Thus, the replicas (like the leader without batchers) suffer communication overheads linear in the number of commands rather than the number of batches. To solve this, the paper applies the heuristics and decouple the replicas as illustrated in Figure 9.

The replicas are responsible for executing commands, while the unbatchers are responsible for sending the results of executing the commands back to the clients. With batching, batchers, and unbatchers, Compartmentalized MultiPaxos is able to process 900,000 commands per second.

Mencius without Compartmentalization

Mencius is a MultiPaxos variant that attempts to eliminate this bottleneck by using more than one leader. Rather than having a single leader sequence all commands in the log, Mencius round-robin partitions the log among multiple leaders. When a client wants to propose a state machine command x, it sends x to any of the servers. Upon receiving command x, a server s1 plays the role of a leader. It assigns the command x in a slot (i) and sends Phase 2a message to the other servers that include x and i. Upon receiving a Phase 2a message, a server si plays the role of an acceptor and replies with a Phase 2b message.

Mencius with Compartmentalization

We can solve this problem by decoupling the servers. Instead of deploying a set of heavily loaded servers, we instead view Mencius as a MultiPaxos variant and deploy it as a set of proposers, a set of acceptors, and set of replicas. This is illustrated in Figure 13. Now, Mencius is equivalent to MultiPaxos with the following key differences. First, every proposer is a leader, with the log round-robin partitioned among all the proposers. If a client wants to propose a command, it can send it to any of the proposers. Second, the proposers periodically broadcast their next available slots to one another. Every server uses this information to gauge whether it is lagging behind. If it is, it chooses noops in its vacant slots.

By applying heuristics and compartmentalize Mencius, the paper leverage proxy leaders, multiple acceptor groups, and more replicas. This is illustrated in Figure 14.

To support batching, the paper introduces batchers and unbatchers as MultiPaxos. Without compartmentalization, Mencius can process 30,000 commands per second without batching and 200,000 with batching. Compartmentalized Mencius can process 250,000 commands per second without batching and 850,000 commands per second with batching.

S-Paxos

S-Paxos decouples command dissemination from command sequencing — separating control from data flow — and distributes command dissemination across all nodes. More concretely, an S-Paxos deployment that tolerates f faults consists of 2f + 1 servers, as illustrated in Figure 15. Every server plays the role of a MultiPaxos proposer, acceptor, and replica. It also plays the role of a disseminator and stabilizer, two roles that will become clear momentarily.

Though S-Paxos relieves the MultiPaxos leader of its duty to broadcast commands, the leader still has to broadcast command ids. In other words, the leader is no longer a bottleneck on the data path but is still a bottleneck on the control path. Moreover, disseminators and stabilizers are potential bottlenecks. We can resolve these issues by compartmentalizing S-Paxos similar to how we compartmentalized MultiPaxos.

Paper introduces proxy leaders, multiple acceptor groups, and more replicas. This is illustrated in Figure 17.

Discussion

Paper measures the throughput and median latency of the three protocols under workloads with 1, 10, 50, 100, 300, 600, 1000, 2000, and 4000 clients. Each client issues state machine commands in a closed loop. It waits to receive the result of executing its most recently proposed command before it issues another.

The results of the experiment are shown in Figure 18. In Figure 18a, we see the median latency and throughput of the three protocols without batching. MultiPaxos is able to process at most 25,000 commands per second. Compartmentalized MultiPaxos is able to process roughly 200,000 commands per second, an 8× throughput improvement. Moreover, at peak throughput, MultiPaxos’ median latency is three times that of Compartmentalized MultiPaxos.

The median latency and throughput of the three protocols with batching are shown in Figure 18b. With 4,000 clients, MultiPaxos, Compartmentalized MultiPaxos, and the unreplicated state machine achieve 200,000, 900,000, and 1,100,000 commands per second respectively. With batching, Compartmentalized MultiPaxos comes closer to matching the throughput of the unreplicated state machine since batching amortizes the overheads of the leader.

Figure 18c shows the median latency of the three protocols when subjected to load generated by 1, 10, and 100 clients. We refer to the number of network delays that a client must wait between proposing a command x and receiving the result of executing x as the commit delay. Referring to Figure 6, we see that Compartmentalized MultiPaxos has a commit delay of six, while MultiPaxos has a commit delay of only four. With only one client, this smaller commit delay translates directly to lower latency.

MultiPaxos achieves a median latency of 0.27 milliseconds compared to Compartmentalized MultiPaxos’ 0.38 milliseconds. However, with fast networks and moderate to heavy loads, queueing times (rather than network traversals) become the determining factor of commit latency.

With as few as 10 or 100 clients, Compartmentalized MultiPaxos is able to achieve lower latency than MultiPaxos. We note though that this result is particular to our deployment within a single data center. For geo-replicated protocols deployed on the WAN, commit delay is the determining factor of commit latency. Compartmentalized protocols are not a good fit for this scenario. Also, note that the primary goal of this experiment is to measure the relative speedups that compartmentalization provides. While the absolute performance numbers that we achieve are important, they are not our primary focus.

Mencius Latency-Throughput

Compartmentalized Mencius uses the same number of machines as Compartmentalized MultiPaxos, except it uses three proposers instead of two. The results are shown in Figure 19. Without batching, Mencius can process roughly 30,000 commands per second. Compartmentalized Mencius can process roughly 250,000 commands per second, an 8.3× improvement.

Compartmentalized Mencius outperforms Compartmentalized MultiPaxos and comes close to matching the performance of the unreplicated state machine by avoiding the single leader bottleneck. With batching, Mencius and Compartmentalized Mencius achieve peak throughputs of nearly 200,000 and 850,000 commands per second respectively, a 4.25× improvement.

The latencies reported in Figure 18c confirm that Compartmentalized Mencius has higher latency than Mencius under low load but lower latency under moderate to heavy load.

S-Paxos Latency-Throughput

Without batching, Compartmentalized S-Paxos achieves a peak throughput of 180,000 commands per second compared to S-Paxos’ throughput of 22,000 (an 8.2× improvement). With batching, Compartmentalized SPaxos achieves a peak throughput of 750,000 commands per second compared to S-Paxos’ throughput of 180,000 (a 4.16× improvement).

Note that our implementation of S-Paxos is not as optimized as our other two implementations, so its absolute performance is lower. As noted above though, demonstrating absolute performance is a secondary goal to demonstrating relative speedups.

Experiment Description The previous experiments confirm that compartmentalization can increase the throughput of a state machine protocol by as much as 8.3× without batching and by as much as 4.5× with batching.

Results The unbatched ablation study results are shown in Figure 20a. MultiPaxos achieves a throughput of roughly 25,000 commands per second. If we decouple the protocol — separating the proposers, acceptors, and replicas — and introduce proxy leaders, we achieve a throughput of roughly 55,000 commands per second. This decoupled MultiPaxos uses the bare minimum number of proposers (2), proxy leaders (2), acceptors (3), and replicas (2).

This paper techniques are going to apply for three existing protocols MultiPaxos, Mencius, and S-Paxos — increasing their throughput by as much as 8.3×.

--

--