SQL Query Optimizers
Understand how SQL query optimizers parse declarations, apply algebraic transformations, estimate executing costs, and build physical query plans.
The Concept
SQL is a declarative language: you write what data you want to retrieve (SELECT * FROM users WHERE active = true), not how to access it. Under the hood, the database must convert this declaration into a procedural sequence of execution steps.
The database component responsible for determining the execution pathway is the SQL Query Optimizer.
If a query contains joins, filters, and projections, there are thousands of valid execution paths. The difference in execution speed between a naive, unoptimized plan and an optimized one can be the difference between a query taking 10 milliseconds or running for 10 hours. The query optimizer evaluates these alternatives to construct the most efficient physical execution plan.
The Query Compilation Workflow
Before executing a query, the SQL engine passes it through several distinct compiler and optimizer stages:
1. Parsing and Semantic Checking
The database parses the raw SQL text, checking syntax rules and compiling it into an Abstract Syntax Tree (AST). The engine then performs semantic validation checks: verifying that the target tables and columns exist, and confirming the user has read permissions.
2. Logical Query Plan Formulation
The parser converts the AST into a Logical Query Plan. A logical plan represents the query as a tree of mathematical operations using Relational Algebra:
- Selection (σ): Filters rows matching specific conditions (similar to SQL
WHERE). - Projection (π): Retains only specified columns, discarding the rest (similar to SQL
SELECT name, email). - Join (⋈): Merges records from two relation sources on a matching attribute.
3. Rule-Based Optimization (RBO)
The engine applies a series of heuristics (rules) to rewrite the logical plan into a more efficient form:
- Predicate Pushdown: Moves filters down the plan tree to execute them before joins. This reduces the number of rows that must be processed during the expensive join stage.
- Projection Pruning: Discards unused columns early in the plan tree, saving memory and processing time.
4. Cost-Based Optimization (CBO)
After applying rules, the optimizer converts the logical plan into a Physical Query Plan. A physical plan replaces relational algebra operations with concrete database algorithms (e.g. replacing a generic logical join with a physical Hash Join or Nested Loop Join).
Because there are many physical plan alternatives, the optimizer estimates the resource cost (CPU cycles and disk access seeks) of each plan. The optimizer uses database statistics to determine this cost, choosing the plan with the lowest score.
Cost-Based Estimations and Statistics
To calculate costs, the optimizer retrieves statistics from the database catalog:
- Row Cardinality: The total row count in each table.
- Selectivity Ratio: The fraction of rows expected to pass a filter condition. For example, filtering by a unique primary key has a selectivity of
1/N, whereas filtering by a boolean state might have a selectivity of0.5. - Histograms: Databases divide column value ranges into buckets to track value distribution details. This enables accurate selectivity estimates on non-uniform data (e.g. calculating how many users are aged between 20 and 30).
If the catalog statistics are stale, the optimizer might construct an inefficient physical plan (e.g., choosing a full table scan because it incorrectly estimates a table contains only 10 rows). To prevent this, databases run periodic background analyzer routines (like PostgreSQL's ANALYZE command) to refresh statistics.
Physical Join Algorithms
The optimizer chooses specific algorithms to execute joins, matching the selected strategy to the dataset size and indexing details:
Nested Loop Join
For every row in the outer table, the engine scans the inner table to find matches.
- Complexity:
O(N * M) - Use Case: Efficient when the outer table is small, or when the join column in the inner table has a B+ Tree index, permitting fast index seeks instead of full table scans.
Hash Join
The engine reads the smaller table, builds a temporary hash map in RAM, and scans the larger table to find matches in the hash table.
- Complexity:
O(N + M) - Use Case: The standard strategy for joining large, unindexed tables. It requires enough RAM to hold the temporary hash table. If the tables are too large to fit in memory, the engine uses a Grace Hash Join, partitioning the tables to disk.
Sort-Merge Join
The engine sorts both tables by the join column, then traverses them in parallel to merge matches.
- Complexity:
O(N log N + M log M) - Use Case: Efficient when the tables are already sorted on the join column (e.g. by a index). If sorting must be done on the fly, it is slower than a Hash Join.
Search Frameworks: System R vs Volcano
To evaluate candidate plans, optimizers use specific search frameworks:
- System R (Dynamic Programming): Evaluates plans bottom-up, pruning inefficient sub-trees early to reduce search space. It excels at finding efficient join orderings for small numbers of tables but scales poorly when queries contain dozens of joins.
- Volcano/Cascades: A top-down search framework using object-oriented rule matching and memoization. It models optimization rules as pluggable transformations, allowing developers to extend the optimizer with custom rules.
Verifying Plans with EXPLAIN
To understand the optimization decisions made by a database, prefix your query with EXPLAIN (or EXPLAIN ANALYZE to execute the query and measure runtime performance):
EXPLAIN ANALYZE
SELECT users.name, orders.id
FROM users
JOIN orders ON users.id = orders.user_id
WHERE users.active = true;
Reading the output reveals the selected physical strategies, estimated row counts, index lookups, and the actual execution times.
Further Reading
- Access Path Selection in a Relational Database Management System — The seminal System R optimizer paper by Patricia Selinger.
- The Volcano Optimizer Generator Paper — Goetz Graefe's framework design for extensible query optimization.
- PostgreSQL Query Plan Visualization — Official documentation explaining how to read and interpret PostgreSQL EXPLAIN plans.
Prerequisites
Code Examples
Core Literature References
Access Path Selection in a Relational Database Management System
by Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price — Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, pp. 23-34
View sourceThe Volcano Optimizer Generator: Extensibility and Efficient Search
by Goetz Graefe — IEEE International Conference on Data Engineering, pp. 341-350
View sourceContinue learning
ACID & Isolation Levels
Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.
API Gateways
Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.
API Security & OAuth 2.0
Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.