My Understanding of WPaxos

Salem Alqahtani
4 min readOct 6, 2020

Before WPaxos, most of us know what Paxos is and how it works. If you have not, please read this link. The single leader-paradigm in Paxos imposes limitations on the protocol scalability. In order to overcome Paxos limitations, WPaxos is a strong candidate to scale in WAN deployment despite many solutions have proposed so far such as leaderless protocol EPaxos.

WPaxos is a consensus protocol for geo-replication systems. WPaxos objectives are low latency, high throughput and availability in WAN setting comparing to EPaxos and others. WPaxos uses a modification version of flexible quorum to reduce the communication cost.

Each leader in WPaxos shares a dynamic ownership of the object space by using object stealing. Leaders steal the object from each other by performing phase(1) of the protocol and use phase(2) to commit locally until the object has stolen by other leader.

Here is my speculation of what object actually is. I guess the object is like an account number in a financial system. Key and value are part of the command that executes on the object. Key and value form the object space. Ballot number is an identification of the object to be lead by the leader who owns the ballot number. Slots are holding the state transitions history. Every object has a ballot number and many slots (a log). Every leader maintains an object and maintains a lineralizability in each object’s log. Log is a sequence of instances ordered by slot number where each slot has ballot number lead by leader.

Quorum in Paxos is different than the quorum in FPaxos. In Paxos, quorums members should intersect while in FPaxos quorum from different phases can intersect. Quorum intersection in WPaxos cannot be empty, |Q1+Q2|< N. f_z represents number of zone failure while f_n represents number of node failures inside a zone. WPaxos picks (f_n+1) in a zone from total number of nodes l and f_z out of Z to tolerate the failure. Q1 selected from (Z-f_z) zones and (f_z +1).

Every node in WPaxos acts as a leader for a subset of objects in the system. The protocol process requests for objects under different leaders concurrently. Every object has its own log. WPaxos is two phases protocol. Phase(1) is to steal the object and phase(2) is to commit the update-requests of the object. Below figure is the WPaxos common case message pattern exchange.

Object stealing: leader who receives a client request tries to steal the object to perform the client request. The leader has to check its cache for latest ballot number and increased the ballot number by 1. Then, leader performs phase(1). If the attempt succeeded (ballot number is higher than ballot number for existing leader). If old leader is not from a new leader q_1 quorum, the stealing object is still work because an intersection between q_1 and q_2.

There is a Ballot number for each object but not per leader. Every leader in the system has set of variables and list of commands. Each object executes a single command. Every leader has a set of objects own{}.

Prepare is a phase (1). A client sends a request to a leader in the closest zone to reduce communication cost. The request has a command and objects. Upon receiving, leader checks if the object is locally exist. If yes, the leader go to phase (2), otherwise, start the phase(1). Phase(1a) selects a larger ballot number and performs a Q1.

In Phase(1b), the node accepts Phase(1a) message if the sender has a higher ballot number than the receiver node has. Then, node sends back its acknowledgment.

Accept is a phase (2) that executes after phase(1) has finished or if there is no need to perform phase(1). Phase(2) forms Q2 from (F+1) closest zones. The leader leads the object if Q1 satisfied.

In Phase(2b), followers acknowledged the sender and replies with 2b message.

Last message is commit message. It has to be send by the leader if Q2 satisfied.

WPaxos adapts locality for client requests. Based on the most frequently access zone, the object transfer to local zone. Also, with less frequently used, the local zone redirect the request to remote zone.

In evaluation section. object space is only many keys, preloaded to the system say 1000. Every leader will hold a index of all keys and what they are leaders.

--

--