The Second Part of Concurrency Control Summary
The transaction is the execution of a sequence of one or more operations e.g. SQL queries on a database to perform some high-level function. The transaction is the basic unit of change in a database system that should execute completely to satisfy atomicity. As you may know, atomicity is one of the transaction ACID properties which means all or none of all operations in the transaction to commit or abort.
To execute the transaction on the database, the database can schedule the received transaction one by one based on who arrived first. We call this schedule a serial schedule. With today's load, the databases need to handle concurrent transactions for better utilization of resources, more throughput, and less latency. With that being said, concurrent execution should maintain the fairness and correctness properties of the database systems.
To maintain the correctness of the current transactions in the database, we need to guarantee the isolation property that deals with interleaving transactions. Specifically, we need concurrency control mechanisms that decide the proper interleaving of operations from multiple transactions.
Important definitions:
The schedule is correct if it is equivalent to some serial execution. A serial schedule is a schedule that does not interleave the actions of different transactions. A serializable schedule is a schedule that is equivalent to some serial execution of the transactions.
The below figure is an example of serial schedules and an interleaving schedule.
In some schedules, there are conflict operations that break the guarantee of serializability. We say that the two operations are in conflict if: 1) They are by different transactions. 2) They are on the same object and at least one of them is a write. There are three types of conflict operations:
Read-Write Conflict: unrepeatable read violates the isolation property.
Write-Read Conflict: read uncommitted data(dirty read).
Write-Write Conflict: overwriting uncommitted data.
To detect the conflict, there are two ways called swapping and dependency graphs. Swapping is a way of swapping operations in the schedule to determine if the schedule can produce a serializable schedule. This is not easy as the number of transactions increases. The dependency graph determines that the schedule is conflict serializable if the graph has no cycle. In the figure below, the example illustrates how the dependency graph works.
There is an isolation level that determines which anomalies the system can tolerate(or what conflicts the system can tolerate) for transactions:
Based on the figure above, you can see that the snapshot and serializable scheduling are the two types that do not allow any conflict in the schedules. Note: there is a stronger isolation level called external consistency which combines linearizability and serializability with a real-time order.
There are two categories of concurrency control protocols that avoid violation of isolation properties:
- Pessimistic which means you avoid the violation of isolated property. An example of this type is two-phase locking(2PL).
- Optimistic assume conflict is very rare and deal with them after they happen. An example of this type is multi-version concurrency control.
Multi-version Concurrency Control(MVCC): this is an optimization that can be used with any protocol. Unlike 2PL where there is a single global copy of the database, MVCC has multiple copies(versions) of the database. The MVCC versioning costs more hardware capacity.
MVCC creates a new version of the data objects when there is a write operation. When a transaction reads an object, it reads the newest version that existed when the transaction started. Write and read do not block each other while the write and write affect each other. The read-only transaction can read a consistent snapshot without acquiring locks and use timestamps to determine visibility. Below is an example of a versioning database.
Similar to the above with more active transactions.
MVCC is used to implement snapshot isolation levels. In general, most optimistic protocols did not make the change visible to other transactions, but they perform the change in a protected. The protected area is determined by the data versioning. There are many ways to version the data like log DB.
Here I will give an example of row-wise versioning.
MVCC has three phases called simulation, validation, and commit.
The below figure has two transactions. They only see the database before their timestamp.
In the validation phase, the transaction should determine if there is a commit while the transaction running.
The third phase is to commit the data to the database. After this commit, in some transaction isolation mode, the value will be visible.
The second transaction will abort because there is an active transaction running in parallel accessing the same data item.
To sum up, this is a short summary that explains briefly the transaction, scheduling, isolation level, and concurrency mechanisms.