Part 1: Paper Summary of CockroachDB: The Resilient Geo-Distributed SQL Database
In this first part post, I will summarize the CockroachDB system. In the following posts, I will go deep into the system’s architecture layers.
In the database world, there are many classes. Here I will just mention the relevant classes to the CockroachDB such as Relational, NoSQL, and NewSQL.
Many distributed database systems found that the relational model/class and the strong ACID semantics are very challenging [1, 2] to have in distributed databases. To overcome the such challenge, NoSQL is developed, a database class for improving system scalability. NoSQL supports more flexible data models and weaker consistency. The weaker consistency model is eventual consistency which makes no guarantees about the order of reading and writing operations. NoSQL systems compromise the consistency property for availability. An example of such systems includes key-value stores e.g, Redis.
Recently, a new distributed database class was developed called NewSQL. NewSQL class aims to restore the relational model and ACID semantics without sacrificing much scalability. NewSQL has drawn attention since Google introduced Spanner, the first NewSQL system. It was followed by a few database vendors, such as CockroachDB.
An Introduction to CockroachDB:
CockroachDB is a distributed SQL database built on top of a transactional and consistent key-value store. The primary design goals are supported for ACID transactions, horizontal scalability, high availability, and strong consistency. CockroachDB uses RAFT consensus for replication. With replication, CockroachDB achieved consistency and fault tolerance.
The Architecture Of CockroachDB:
- SQL Layer: This layer is a top layer that provides an interface to run SQL queries as well as the parser, optimizer, and SQL execution engine.
- Transactional Key: Value Store Layer: This is a distributed key-value store.
- Distributed Layer: Transaction is the core and fundamental part of the applications. The implementation of distributed transactions includes the monolithic logical key space ordered by key. The key space divides in ranges from 64 MiB. Ranges are ordered by a two-level index.
- Replication: This layer uses Raft to duplicate the ranges in three copies either in the same node or different nodes.
- Store: Each node contains one or more stores, and each store contains potentially many ranges. Every store is managed with RocksDB, an open-source storage engine from Facebook based on Google’s LevelDB.
Transactions of CockroachDB:
A SQL transaction starts at the gateway node for the SQL connection. The gateway node receives and responds to the client nodes. Clients connect to nearby gateway nodes for improving latency.
SQL requires that a response to the current operation must be returned before the next operation is issued. This comes with a performance cost. To improve system performance and employs parallel transactions, cockroachDB proposed two optimizations called write pipelining and parallel commits.
Write pipelining allows returning a result without waiting for the replication of the current operation. Parallel commits let the commit operation and the write pipeline replicate in parallel. Combining both optimizations, CockroachDB completes multiple SQL transactions in one round of replication. However, the coordinator should track all operations which may not be fully replicated.
The following figure is an illustration of unpipelined transactions.
The following figure is an illustration of pipelined transaction.
The following figure is an illustration of parallel commits.
CockroachDB is a multi-version concurrency control. CockroachDB keys do not store a single value, but rather store multiple timestamped versions of that value. New writes do not overwrite old values, but rather create a new version with a later timestamp.
Conflict types:
- Read-Write (RW) — The second operation overwrites a value that was read by the first operation.
- Write-Read (WR) — The second operation reads a value that was written by the first operation.
- Write-Write (WW) — The second operation overwrites a value that was written by the first operation.
Read operations on a key return the most recent version with a lower timestamp than the operation. CockroachDB does not form WR conflicts with later transactions because read operations will never read a value with a later timestamp.
On any read operation, the timestamp of that read operation is recorded in a node-local timestamp cache that returns the most recent timestamp at which the key was read. If the returned timestamp is greater than the operation timestamp, this indicates RW conflicts with a later timestamp and must be aborted and restarted with a later timestamp.
If a write operation attempts to write to a key, but that key already has a version with a later timestamp than the operation itself, allowing the operation would create a WW conflict with the later transaction and must be aborted and restarted with a later timestamp.