Summary of Spanner: Google’s Globally-Distributed Database
In this post, I will explain Google Spanner. It is a distributed database built by Google with unique features including hierarchical scaling, external consistency, fault tolerance, global distribution, multi-versioned, and SQL-based database. Currently, Spanner is used by Google ads and other applications.
Why Google Spanner?
The main motivation why Google invented Spanner was to manage cross-datacenter replicated data. In the past, relational databases worked great on a single machine and provided ACID-compliant transactions. As data grew, relational databases are not scalable. This is hard to achieve with ACID-compliant transactions. Instead, we have a simple data model that gave up some of the relational database functionalities like not a joint statement and no guarantee on the consistency, but a weaker level of consistency can be guaranteed called eventual consistency in exchange for getting fault tolerance and low latency in a distributed database like NoSQL.
The developers who use NoSQL should take the responsibility for many system-level design properties including application correctness given the weak consistency that supports by the underneath database. So, SQL and NoSQL-based systems lack features that the others have. The figure below illustrates the main features.
One of the solutions to horizontally scale relational databases is master-slave replication and sharding, but both models have introduced a few issues:
- Hot sharding problem: one shard accessed a lot more than others.
- Rehashing: try to solve the hot shard problem, but data migration becomes so expensive.
- Application-aware sharding: put a lot of responsibility on some applications on how the data is shared.
- In the master-slave approach, consistency for slaves and scalability for a master node are posing issues.
So, a relational database would not scale and a NoSQL database would not provide the type of consistency guarantees google needed. → Spanner was born.
Spanner Architecture
Spanner is a database that shards data across many sets of Paxos state machines, which are located in data centers all over the world. Spanner's high-level architecture is based on two layers called the transaction layer and the replication layer. The client(the web browser) submits its request to the system’s client(webserver) that hosts the client library. The web server will interact with the Spanner database and its server nodes. For simplicity, I will not focus on the application layer. The figure below illustrates the storage( Spanner database) and servers(spanservers) and how they are geographically replicated.
Later, I will go into detail about the spanner server(Spanserver) architecture. For now, only remember that each server/node serves client data. The split in the storage layer represents a set of rows from the database table that is replicated in different zones. Each spanserver is responsible for between 100 and 1000 instances of a data structure called a tablet, illustrated in the figure below. Spanner assigns timestamps to data, a set of (key, timestamp) -> value
called mappings.
Spanserver designs and techniques:
Spanner database stores one or more relational tables. Tables shard in tablets and the state either checkpoint into a disk or Write-ahead Log(WAL) on distributed file system called Colossus. Normally, each tablet is managed by three tablet servers and replicates the state across zones. A set of tablet servers replicating one tablet is called the Paxos group. Inside the group, one acts as a leader for some period of time and the rest are followers.
The leader takes care of client operations, grabbing locks to ensure isolation from concurrent transactions and log entries. Across tablets, leaders of replica groups use two-phase commit(2PC) to preserve atomicity. One group’s leader will be a coordinator and others will be participants. If the coordinator becomes unresponsive, this leads the system to a halt. However, the 2PC is used in conjunction with Paxos in the Spanner. The leader of each group is backed by many followers. So any follower can become a leader. The below figure is the illustration of Spanserver architecture.
The spanner has more design choices like a directory. The directory is the unit of data placement(geographic-replication properties) that manages replication and locality. When data is moved between the Paxos group, it’s moved directory by directory. 50MB directory can be moved in a few seconds.
The spanner has a nice parent-child relationship called interleaving in its data model for improving system performance and storage. Interleaving enables the spanner to colocate parent and child rows in storage. For instance, customer and invoice tables. If the application frequently fetches customer tables with all its invoice tables. Then, it is necessary to define the invoice table as interleaved child table with customers. This means the spanner will store customer rows with one or more invoice rows. Figure below for an illustration of interleaving.
TrueTime:
Getting a global notion of real-time in a distributed system is difficult. Clocks on individual machines drift over time. So, we can synchronize them periodically to a reference time by using many protocols such as network time protocol. However, the error can be large and cannot be precisely bounded. To address this issue, TrueTime by Google was introduced. TrueTime relies on expensive hardware and is accurately bound the synchronization error on any machine in the data center to a set of reference clocks.
TrueTime architecture:
- In each data center, there are a set of GPS and atomic clocks. These clocks represent a reference clock.
- In each machine, there is a regular clock. The machine runs a time daemon, which selects multiple reference clocks from multiple data centers and performs a particular algorithm to synchronize its own clock with theirs. During synchronization, the error is precisely bounded.
TTinterval is an interval with bounded time uncertainty. The endpoints of a TTinterval are of type TTstamp. The TT.now() method returns a TTinterval that is guaranteed to contain the absolute time during which TT.now() was invoked.
In Spanner, TrueTime helps to implement concurrency mechanisms. We have concurrent transactions running at the same time. To guarantee strict serializability among all transactions, TrueTime was the way to go. Spanner supports read-write transactions(RW Transactions), Read-only Transactions (RO Transactions), and Snapshot reads.
Standalone writes are implemented as RW Transactions. Non-snapshot standalone reads are implemented as RO Transactions. Snapshot read is a read in the past that executes without locking.
RW Transactions are read operations and write operations, ACID properties, and are strictly serializable which is external consistency. A read-write transaction executes atomically at a single point in time by using 2PL for isolation and 2PC for atomicity.
RW Transaction details:
- A client directs all the read operations to the leader of the replica group. This leader acquires read locks for rows and columns involved before serving read requests. Every read also returns the timestamps of any data read.
- Any write operations are buffered locally in the client until the point the transactions are committed. While the transaction is open, the client sends a keepalive message to prevent the participant leader from timing out a transaction.
- After the client completes all reads and buffers all write, it starts a two-phase commit. It chooses one leader as a coordinator leader and sends a prepare request to all participant leaders.
- Every participant leader acquires the necessary write locks, chooses a prepare timestamp si that is larger than any timestamps of previous transactions, and logs a prepare record in its replica group through Paxos. The response to the coordinator leader with the prepared timestamp.
- The coordinator leader will send and replicate the commit record. Upon success, the coordinator will send the commit and its timestamp to all participants’ leaders. Then replies to the client.
How does Spanner assign a timestamp to the RW Transactions?
Monotonicity invariant: within each Paxos group, Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders. Other invariants like disjointness(leader lease) help to maintain the correctness of across-leader ordering. External consistency invariant guarantees that a committed timestamp transaction should be less than a new transaction timestamp.
There are two rules to execute and assign timestamps:
- Start rule:
1. a write request arrives at Paxos's leader.
2. s will be the write timestamp.
3. leader sets s to the timestamp to be TT.now().latest. Example: w1 at group1, TT=[20,30], s1=30; w2 at group2, TT=[21,31], s2=31. - Commit rule:
1. the leader delays until s less than TT.now().earliest.
2. s guarantee to be in the past compared to absolute time.
3. Commit wait until TT = [31,41].
4. reply to the client.
5. commit wait until TT = [32,42].
5. Reply to the client.
Serving read at a timestamp:
- Safe time(t_safe): the timestamp of the last write transaction within the replica. The replica is up to date if the asked timestamp is less than t_safe.
- t_safe = min(Paxos(t_safe), TM(t_safe)). TM(t_safe) → Ti prepared but not committed at replica group g. If no tx, then TM(t_safe) equals infinity. Paxos(t_safe) → highest-applied Paxos write at the replica group.
Read-only Transactions:
- allows the client to perform multiple reads at the same timestamp and these operations are guaranteed to be strictly serializable.
- Properties → no lock + no block for other transactions.
- Why? any concurrent/future writes will update the date at a later timestamp.
- The leader assigns a timestamp(external consistency), then it does a snapshot read on any replica.
- external consistency: read should see all previously committed writes. S_read = TT.now().latest.
- If read involves only one Paxos group, S_read is the timestamp of the last committed write(LastTS()).
- If read involves more than one Paxos group, there are options:
1. A round of communication to negotiate S_read based on LastTS();
2. A simple choice is that client has its read execute at S_read = TT.now().latest(which may wait for a safe time to advance). - After TT.now().latest is selected, the read operation at timestamp t_read can be served by any replica that is up to date, t_read ≤ t_safe.
Spanner handling deadlocks: 2PL can result in deadlocks. Uses a wound-wait scheme. Tx1 aborts Tx2 which holds the desired lock only if Tx1 is older than Tx2.
Standalone Read:
Strong read guarantees to see all data committed up until the start of the read. The stale read is at a timestamp in the past.
For performance results, please take a look at the paper. They have very nice explanations about it.
To sum up, Spanner offers cross-datacenter replications and semi-relational data models. Transactions are done through the Paxos algorithm. TrueTime API which depends on GPS receivers and atomic clocks guarantees concurrency control in Spanner.