An Intro to SQL Layer in Cockroach DB

Salem Alqahtani
5 min readOct 31, 2022

The top layer of cockroach DB is called the SQL layer. The SQL layer takes SQL statements from the application by providing SQL API and converts them into low-level read-and-write requests to the underlined key-value store. The key-value pairs are passed to the transaction layer. The below Figure illustrates where SQL is in the cockroach DB layers.

After a preliminary check of the SQL statements, there are sub-layers including SQL API which is a user interface, parser which transfers SQL statements to Abstract Syntax Tree (AST), cost-based optimizer which transfers AST to optimized logical query plan, physical planner which transfers logical query plan to physical query plan, and SQL execution engine which transfers physical plan to make read and write requests to the underline key-value store.

Query optimization is the process of choosing an efficient plan for executing an SQL statement by applying rules to rewrite the tree of operators that is invoked in a query. One of the query optimizations is called a cost-based optimizer.

Cockroach DB and most distributed SQL generate a default plan from the SQL query. Then, perform a series of transformations that transfers a query to the logically equivalent alternative. These alternatives are stored in a compact data structure called a memo.

Cockroach DB uses a cost-based optimizer (CBO) to improve query plans. It acts as an advisor automatically enhancing the efficiency of your queries. Given that SQL is a declarative language, the database chooses the right plan without user intervention.

The SQL optimizer analyzes a SQL query and chooses the most efficient way to execute it. The simplest query has only one way to execute while complex queries can have thousands or even millions of ways. The better the optimizer, the closer it gets to choosing the optimal execution plan.

CBO allocates a cost in numerical form which is related to each step of a possible plan and then finds these values together to get a cost estimate for the plan. After calculating the costs of all possible plans, the optimizer tries to choose a plan which will have the possible lowest cost estimate and hands it off for execution.

The most important factor in determining the cost of the plans:

1- Hardware configuration (not considered in Cockroach DB).

2- Data distribution like only in US data or in both US and Europe data.

3- Types of operation (Scan needs disk I/O, Filter needs CPU operations). The cost refers to the amount of money spent on the system to optimize the system. The measure of cost fully depends upon the work done or the number of resources used.

4- Number of rows processed by each operator. Cardinality is a fancy way of saying the number of rows used by the query. The lower the cardinality, that is the fewer number of rows that each execution operator has to process, the faster the query will be.

Optimization stages or the phases of the plan generating. I illustrate the stages in the Figure below.

1- Parse: converts the SQL string to an abstract syntax tree (AST). An example is below.

The abstract syntax tree is illustrated in the Figure below.

2- OptBuild: analyze AST semantically and convert it to a rational expression tree. This part of the code understands SQL and includes resolving DB object names as well as type checking and type inference. An example is illustrated in the Figure below.

Semantic analysis in OptBuild makes sure the query makes logical sense.

3- Normalize: optimizer has rewritten transformations (normalization rules). These rules convert a rational expression to another equivalent expression that is likely to result in a better plan. Examples are push-down filters, decorrelation, or simplification of unnecessary operators.

4- Explore: Exploration rules are transformations that generate new relational expressions that are equivalent to starting expressions and which might or might not result in faster execution. For each expression, cockroach DB estimates a cost and at the end of the process and picks the cheapest expression. To efficiently store and process these expressions, cockroach DB uses a memo data structure that organizes expressions into equivalent groups.

The memo data structure takes a join query that could have several identical logical plans, except that one plan uses a hash join, another uses a merge join, and the third uses a lookup join.

Selected Plan after exploring all possible plans which will be passed to the execution engine.

5- DistSQL Planning: the final step is to convert the final optimized expression that was generated for a single machine to a data structure that is understood by the execution engine.

How to assign costs for alternatives?

Example of building a histogram.

I was thinking to use machine learning. However, it seems [Balsa] people have done it already.