Paper Review of PigPaxos Consensus Protocol

Salem Alqahtani
5 min readJul 5, 2020

Paxos is the basic consensus protocol that well-studied and defined in many papers. However, Paxos is not a scalable protocol. That means when the number of nodes in the cluster exceeds a certain number, the Paxos performance is diminished.

This performance bottleneck caused by incurring high communication on the leader node. To address this bottleneck, PigPaxos decouples the communication process from the consensus process in the leader node.

So, what exactly is PigPaxos?

PigPaxos is a round-based consensus protocol. The leader randomly selects nodes from acceptor groups to relay its message to the rest of the acceptors. This is done by using a relay/aggregate-based message flow. In Figure 4, I mentioned where are relays nodes among followers (I circle them on the figure). These relays nodes will send to all nodes in its group and collect replies in synchronous communication by using a timeout mechanism. This limits leader communication with few nodes comparing to all nodes in Paxos.

PigPaxos is similar to the Paxos failure model where nodes crash by stopping only. However, the communication topology is different. In Paxos, the topology is one-to-all communication patterns while, in PigPaxos, the topology is a tree-based pattern to reduce the message exchange bottleneck at the leader. This is nice optimization in consensus protocols.

Instead of sending the user request to all of the followers, the leader transmits these to a small set of relay nodes(2 nodes in figure 4), which propagate these messages to the remaining followers. The relay nodes also act as aggregators for collecting communication of the followers’ responses and pass the combined results to the leader(a very straightforward communication logic).

As illustrated in Figure 4, we have three phases: Phase-1, Phase-2, and Phase-3. As a stable leader assumption hold, PigPaxos would only perform Phase-2 for each consensus instance, with Phase-3 messages piggybacked to Phase-2.

Phase-1 will only execute one time, and it can run as long as the leader is honest. The leader picks relays nodes and tries to lead them with his ballot. The relays nodes will accept the leadership ballot if it’s larger than what they have seen so far. This is helpful to prevent dual leaders at the same time. Also, the rotation of relay nodes is important for load-balancing the communication burden to all followers and avoiding hotspots.

PigPaxos communication follows can summarize as follow: upon receiving a message from the leader, a relay node processes this message as a regular follower and resends the message to the remaining nodes in its relay group. Then, upon receiving the messages, the followers respond back to the relay node as if they were responding directly to the leader. The relay nodes wait for the followers’ responses and piggyback them together into a single message. By default, the relays wait for all followers in the group to respond. To avoid liveness violation in relays nodes responses, PigPaxos employs a tight timeout at the relay nodes. If some followers do not reply within the timeout, a relay stops waiting and replies to the leader with all responses collected so far.

PigPaxos leader has to receive the majority from the relays nodes and relays nodes need to receive from all followers in its group. If there is a crash in both relays nodes and followers, the timeout mechanism will trigger to prevent liveness violation.

There are a number of optimizations implemented in PigPaxos such as dynamic relay groups where the leader has to pick different relay nodes every time there is a new request to the system, partial response collection, and improving reads.

Figure 8 illustrates the latency and throughput performance of these three protocols in a 25-node cluster.

Figure 9 illustrates PigPaxos in WAN communication with geographical location. Each region represents a separate PigPaxos relay group, with the leader node located in the Virginia region. In this setup, the latency is dominated by cross-region distance, and as such, the difference between Paxos and PigPaxos is not observable at low loads. Similarly to the local area experiments, PigPaxos maintains low latency for much higher levels of throughput.

Figure 10 illustrates a 5 node cluster running Paxos and PigPaxos with 2 relay groups. PigPaxos scales to higher throughput than Paxos since it communicates with fever nodes. PigPaxos talks to just two nodes, which is exactly how many followers Paxos needs to contact for majority quorum, but Paxos still sends four messages in each round.

The PigPaxos and Paxos experiment with 25 node clusters, where PigPaxos uses 3 relay groups. To gauge the performance and scalability with respect to payload size we measured the maximum throughput on each system under a write-only workload generated by 150 clients running on 3 VMs.

Figure 12a shows the maximum throughput of PigPaxos and Paxos at payload sizes varying from 8 to 1280 bytes. While both protocols show a drastically different level of performance, they exhibit a similar relative level of degradation as payload size increases. Figure 12b illustrates the throughput normalized to the maximum observed value.

Finally, the paper did not mention what will happen if the leader fails. I guess a new leader will come along with a new ballot number, and if the followers already promised a previous leader to commit his value, the new leader will learn that value and will commit it. This is exactly what Paxos does when the leader crashed.

--

--