SQL Query Optimizers

Understand how SQL query optimizers parse declarations, apply algebraic transformations, estimate executing costs, and build physical query plans.

AdvancedDatabasesChapter: Database Systems15 min read

The Concept

SQL is a declarative language: you write what data you want to retrieve (SELECT * FROM users WHERE active = true), not how to access it. Under the hood, the database must convert this declaration into a procedural sequence of execution steps.

The database component responsible for determining the execution pathway is the SQL Query Optimizer.

If a query contains joins, filters, and projections, there are thousands of valid execution paths. The difference in execution speed between a naive, unoptimized plan and an optimized one can be the difference between a query taking 10 milliseconds or running for 10 hours. The query optimizer evaluates these alternatives to construct the most efficient physical execution plan.

The Query Compilation Workflow

Before executing a query, the SQL engine passes it through several distinct compiler and optimizer stages:

1. Parsing and Semantic Checking

The database parses the raw SQL text, checking syntax rules and compiling it into an Abstract Syntax Tree (AST). The engine then performs semantic validation checks: verifying that the target tables and columns exist, and confirming the user has read permissions.

2. Logical Query Plan Formulation

The parser converts the AST into a Logical Query Plan. A logical plan represents the query as a tree of mathematical operations using Relational Algebra:

Selection (σ): Filters rows matching specific conditions (similar to SQL WHERE).
Projection (π): Retains only specified columns, discarding the rest (similar to SQL SELECT name, email).
Join (⋈): Merges records from two relation sources on a matching attribute.

3. Rule-Based Optimization (RBO)

The engine applies a series of heuristics (rules) to rewrite the logical plan into a more efficient form:

Predicate Pushdown: Moves filters down the plan tree to execute them before joins. This reduces the number of rows that must be processed during the expensive join stage.
Projection Pruning: Discards unused columns early in the plan tree, saving memory and processing time.

4. Cost-Based Optimization (CBO)

After applying rules, the optimizer converts the logical plan into a Physical Query Plan. A physical plan replaces relational algebra operations with concrete database algorithms (e.g. replacing a generic logical join with a physical Hash Join or Nested Loop Join).

Because there are many physical plan alternatives, the optimizer estimates the resource cost (CPU cycles and disk access seeks) of each plan. The optimizer uses database statistics to determine this cost, choosing the plan with the lowest score.

Cost-Based Estimations and Statistics

To calculate costs, the optimizer retrieves statistics from the database catalog:

Row Cardinality: The total row count in each table.
Selectivity Ratio: The fraction of rows expected to pass a filter condition. For example, filtering by a unique primary key has a selectivity of 1/N, whereas filtering by a boolean state might have a selectivity of 0.5.
Histograms: Databases divide column value ranges into buckets to track value distribution details. This enables accurate selectivity estimates on non-uniform data (e.g. calculating how many users are aged between 20 and 30).

If the catalog statistics are stale, the optimizer might construct an inefficient physical plan (e.g., choosing a full table scan because it incorrectly estimates a table contains only 10 rows). To prevent this, databases run periodic background analyzer routines (like PostgreSQL's ANALYZE command) to refresh statistics.

Physical Join Algorithms

The optimizer chooses specific algorithms to execute joins, matching the selected strategy to the dataset size and indexing details:

Nested Loop Join

For every row in the outer table, the engine scans the inner table to find matches.

Complexity: O(N * M)
Use Case: Efficient when the outer table is small, or when the join column in the inner table has a B+ Tree index, permitting fast index seeks instead of full table scans.

Hash Join

The engine reads the smaller table, builds a temporary hash map in RAM, and scans the larger table to find matches in the hash table.

Complexity: O(N + M)
Use Case: The standard strategy for joining large, unindexed tables. It requires enough RAM to hold the temporary hash table. If the tables are too large to fit in memory, the engine uses a Grace Hash Join, partitioning the tables to disk.

Sort-Merge Join

The engine sorts both tables by the join column, then traverses them in parallel to merge matches.

Complexity: O(N log N + M log M)
Use Case: Efficient when the tables are already sorted on the join column (e.g. by a index). If sorting must be done on the fly, it is slower than a Hash Join.

Search Frameworks: System R vs Volcano

To evaluate candidate plans, optimizers use specific search frameworks:

System R (Dynamic Programming): Evaluates plans bottom-up, pruning inefficient sub-trees early to reduce search space. It excels at finding efficient join orderings for small numbers of tables but scales poorly when queries contain dozens of joins.
Volcano/Cascades: A top-down search framework using object-oriented rule matching and memoization. It models optimization rules as pluggable transformations, allowing developers to extend the optimizer with custom rules.

Verifying Plans with EXPLAIN

To understand the optimization decisions made by a database, prefix your query with EXPLAIN (or EXPLAIN ANALYZE to execute the query and measure runtime performance):

sql

EXPLAIN ANALYZE 
SELECT users.name, orders.id 
FROM users 
JOIN orders ON users.id = orders.user_id 
WHERE users.active = true;

Reading the output reveals the selected physical strategies, estimated row counts, index lookups, and the actual execution times.

Prerequisites

B+ Trees Database Indexes

Code Examples

Core Literature References

Access Path Selection in a Relational Database Management System

by Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price — Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, pp. 23-34

View source

The Volcano Optimizer Generator: Extensibility and Efficient Search

by Goetz Graefe — IEEE International Conference on Data Engineering, pp. 341-350

View source

Continue learning

ACID & Isolation Levels

Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.

API Gateways

Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.

API Security & OAuth 2.0

Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.