A Summary of Ziplog’s White Paper

Salem Alqahtani
9 min readOct 20, 2020

A log is a general abstract data structure. It is an ordered list where data changes captured as events and stored in the log. Data changes stored in a sequence of records R with a specific size. A record is appended to the end of the log, and reads proceed left-to-right. Each record is assigned a unique sequential log number. The purpose of the log is to record what happened and when. The below Figure illustrates abstract view of the Log Structure. Read this for more details.

Log Structure: reference(1)

Ziplog is a total ordered shared log that achieves low latency and high throughput. It has strong consistency due its total order property that has achieved by using InsertAfter() API. Ziplog has Low latency (two or three messages) and has high throughput by using sharding techniques. Sharding divides the log across number of sites each site has more than one machines called Shard. Ziplog provides three methods: write, read, and subscribe to the log

In contrast to previous work such as Scalog, Corfu, vCorfu, Ziplog needs no coordination across multiple shards to have linearizability in its log. However, Ziplog has coordination when there is a reconfiguration or a failure.

Ziplog’s client library sends a record to the shard; the shard replicates the record and assigns it to global sequence number which returns to client library. Client library returns to the client a record identifier.

Record identifier is the key architecture to achieve consistency. Client sends the insertAfter(R, previous record identifier) and receives the new record identifier assigned to R. This guarantees that the current record identifier inserts after previous record identifier.

How does the client know the identifiers for all records? The answer is a LogDB where all mapping(keys and identifiers), commit and abort status.

In case of failure, and reconfiguration, Ziplog requires a cross-shard coordination. This is done by using Paxos-based services. In the meanwhile, all insertAfter() operations are suspended in the affected shards while subscribe() operations returns after the shards back online.

Ziplog’s protocol has three functionalities for adding records to the total ordering.

The Global Sequence Number Assignment algorithm lets each shard independently assign a gsn to the record.

Two challenges: First, the set of shards may change. Second, different shards may see different rates of updates. To overcome these challenges, the Ziplog introduced a membership management service (MMS) that replicated for fault tolerance using paxos.

MMS provides each shard with up-to-date information about the set of shards that can be used to insert new records in the log. Then, Ziplog uses a sequence of epochs, where the length of each epoch is defined, at each shard, by the time interval between two successive broadcasts from the MMS.

Each broadcast starts a new epoch by sharing with all shards four items: the identifiers of the shards for the current epoch, the rate of updates for each shard in current epoch, the base and ending gsn.

Epoch changed by changes in the set of shards that accept inserts, or to adjust the shards’ expected rate of updates.

Each shard can compute the expected arrival time of records at every other shard, and assign gsns to its records in a way that reflects that order, using shard identifiers to break ties (see Figure 2).

When record arrival rates for the current epoch are accurate, the log can be filled without no-ops and without delays. If none occurs past a certain timeout, the shard assigns the corresponding gsn to a no-op. With a higher rate than expected, shards simply assign to them their next gsns for this epoch. In the transition between two epochs, each shard creates a single special no-op token covering every gsn in the epoch.

New epochs can also be triggered by reconfigurations. To add a new shard, the MMS lists it in its next broadcast, together with its expected rate of updates. When a shard is about to be removed, it notifies its clients, directing them to submit any future record to another shard. Once they have done so, the shard reports to the MMS an expected update rate of zero for the next epoch: this tells the MMS to remove this shard form its next broadcast. The shard will henceforth no longer accept new records, but will continue to be available for read operations.

In replication protocol, each shard stores a subset of the records. A shard consists of a collection of storage servers, one acting as the primary. To insert a record R into the log, the client library selects a shard and broadcasts the record to all storage servers of the shard. Each storage server (i) speculatively assigns a global sequence number to the record, (ii) sends to the client library an acknowledgment containing the gsn it has independently calculated, and (iii) makes the record persistent in log position gsn. If the client library receives identical gsns from all storage servers, it can conclude that the record is ordered, and the insert operation completes after two message delays (Figure 3a).

Otherwise, the client continues waiting (Figure 3b). The primary, having sent its gsnp to the client, concurrently forwards the tuple (R,gsnp) to the other storage servers in the shard’s backups. When a backup receives this tuple from the primary, it checks whether the record it is currently holding in log position gsnp is indeed R. If so, no further action is needed; if not, the backup stores R in log position gsnp and sends an updated acknowledgment to the client that initiated R.

Ziplog’s stable records confirmation protocol allows the client library to retrieve the record from a single storage server, instead of requiring it to hear from all of them. To achieve this, each backup sends periodically to the primary the highest gsn of the longest prefix of consecutive records received from the primary and stored in its local storage. The primary collects this information and periodically sends the backups the minimum among those values, which is the number of consecutive records on which all storage servers agree. When responding to a read request from the client library, the storage server includes whether the record is stable or not, only if it is not, does the client library need to get a matching response from all storage servers; otherwise, a single response suffices.

Failure Recovery: failure in storage server inside shard causes unavailability of the insert operation. Ziplog uses failure handlers to deal with suspected storage server failures. Like the member management service, a failure handler is a logically centralized entity, replicated for fault tolerance using Paxos. Each failure handler manages failure recovery for a subset of shards. Failure recovery in Ziplog involves three phases: (i) failure suspicion; (ii) state acquisition; and (iii) state recovery.

Failure suspicion. Using heartbeat messages, storage servers inside a shard tracks each other availability. If a storage server fails, other storage servers report the failure to failure handler. There is at least one storage server will report failure because inside shard the maximum failures bound is f and there is at least f+1 servers. The recovery process triggered after at least one report of failure to the handler. The server who report the failure will never accept any records. Other servers who stores records after this point will not be stored.

State acquisition: s shares with the failure handler the state of each of its log positions, which can be either: Filled and known to be stable, Filled, and Unfilled.

LogDB: LogDB is a transactional key/value store that uses a shared log as the storage layer. Besides the shared log, LogDB includes a transaction coordinator and a client library. The shared log is used to store key/value mappings and also to record and order transactions.

The transaction coordinator stores a mapping from each key to the latest record identifier in the log and is responsible for deciding whether a transaction should commit or abort. The client library interacts with both the shared log and the transaction coordinator to submit transactions.

In each transaction, a client can interactively read and write keys using the client library. To read the value of a key, the client library first interacts with the transaction coordinator to retrieve the record identifier, then reads the shared log to retrieve the value. To update a key/value pair, the client library inserts a record in the log and stores the record identifier. At the end of the transaction, the client library writes to the log a commit record, which includes the read set and the write set for all keys involved in the transaction and their corresponding record identifiers.

The transaction coordinator stores its state in memory and is not replicated. If it fails, a new transaction coordinator can replay the log to reconstruct the state. To reduce this reconstruction time, the transaction coordinator could periodically store a snapshot and compact the shared log using techniques like the segment cleaner in log-structured file systems.

Evaluation: Write Latency

The write latency is the time from when a client invokes the InsertAfter interface until it returns. Comparison is between Ziplog, Scalog and NOPaxos. The experiment consists of six shards. Scalog disable aggregation layer and NOPaxos does not support shards.

Ziplog has two execution paths in failure free situation: a fast path of two message delays (no contention) and a slow path of three message delays (when the primary resolves contention).

Fast path takes roughly 150 µs on average, and the slow path takes 220 µs. The slow path latency is similar to the latency of NOPaxos.

Figure 4, presents ratio of operations that compete with the fast path. The fast path ratio decreases until either shards saturated or there are more than 16 clients per shard. So the ratio is 70%. Record size is 4KB and each reached 16.8K record per second bottlenecked by SSDs. InsertAfter is separated by 60 µs on average.

Figure 5 shows write latency for one client with no contention and 256 with significant contention. Ziplog achieves a write latency of 150 µs at low contention and of 200 µs at high contention, while NOPaxos’s latency is between 220 µs and 240 µs. Scalog achieves 900 µs latency.

Scalable Write Throughput

The number of shards that Ziplog can support depends on how frequent the membership management service broadcast membership information. With a 10 ms broadcast interval, Ziplog can support about 4,000 shards, slightly higher than the number of shards Scalog can support.

Reconfiguration and Failure Recovery

A shard notifies its clients to redirect to other shards before sending the removal request to the member management service. Ziplog’s throughput is unaffected by adding or removing shards. Ziplog’s write throughput temporarily decreases (by 1/6th) until the failure is detected and all clients connected to the failed shard are redirected to other shards. The failure timeout in our setting is 1s.

Impact on LogDB

The throughput with a variable number of clients. It shows that LogDB on Ziplog achieves higher throughput than LogDB on Scalog. The difference is explained by the significantly reduced abort rate due to Ziplog’s lower latency.

End of the post.

--

--