SQL Joins & Performance

Explore the mathematical and physical execution layers of SQL joins, including nested loop, hash, and sort-merge strategies, and how to optimize join performance.

BeginnerDatabasesChapter: Database Systems10 min read

The Problem: Connecting Isolated Relations

In a normalized relational database, related entities are isolated into separate tables to prevent redundancy and maintain consistency. For example, a customer's personal details are stored in a users table, while their transactions are stored in an orders table.

However, to answer business queries (such as "Retrieve all orders placed by customers living in New York"), the database engine must reconnect these tables at query time. This operation is called a join.

Because tables can contain millions of rows, joining them requires the database engine to compare key fields across files. Naive comparisons can crash a server. Backend engineers must understand how database query planners choose physical algorithms to execute joins, and how indexes speed up these operations.

Logical Joins in Relational Algebra

At the logical layer, SQL queries are mapped to relational algebra operators. The type of join determines how the database handles unmatched records:

Inner Join: Returns only the rows where the join keys match in both tables. Unmatched rows are discarded.
Left Outer Join: Returns all rows from the left table, plus matching rows from the right table. If there is no match, the columns from the right table are filled with NULL.
Right Outer Join: Returns all rows from the right table, plus matching rows from the left table. If there is no match, the left table's columns are filled with NULL.
Full Outer Join: Returns all rows from both tables, filling in NULL values for missing matches on either side.
Cross Join (Cartesian Product): Combines every row of the first table with every row of the second table, creating a result set of size $N \times M$ rows.
Self Join: A table joined with itself, which is useful for querying hierarchical relationships (such as an employees table containing a manager_id pointing back to the same table's primary key).

Physical Join Execution Algorithms

When executing a logical join, the database query optimizer translates it into a physical algorithm based on table sizes, memory limits, and indexes:

1. Nested Loop Join

The database loops through each row of the outer table, and for each row, scans the entire inner table to find matches.

Complexity: $O(N \times M)$
Trade-off: Highly inefficient for large datasets. However, if the inner table has a B+ Tree index on the join column, the scan is replaced by an Index Nested Loop Join, turning the inner search into a fast $O(\log M)$ tree seek.

2. Hash Join

The database reads the smaller table and builds an in-memory hash table of the join keys. It then scans the larger table, hashes each join key, and probes the hash table for matches.

Complexity: $O(N + M)$
Trade-off: The standard strategy for large, unindexed tables. It requires enough memory to hold the build table's hash map. If the map exceeds memory limits, the engine must split partitions to disk (a process known as disk spilling or Grace Hash Join), which slows down execution due to disk I/O.

3. Sort-Merge Join

The database sorts both tables by their join keys, then scans both sorted inputs in a single parallel pass to merge matching values.

Complexity: $O(N \log N + M \log M)$
Trade-off: If sorting must be done on the fly, it is slower than a Hash Join. However, if the tables are already sorted (for example, if they are indexed on the join columns), the sort step is skipped. The merge phase runs in $O(N + M)$ time, making it highly efficient.

Indexing and Join Optimization

The query optimizer uses database statistics and indexes to determine the execution plan:

Outer vs Inner Table Selection: In a nested loop, the optimizer assigns the smaller table as the outer loop. This is because the outer loop determines how many times the database must scan or seek the inner table.
Index Seeks: An index on the inner table's join column allows the optimizer to replace expensive table scans with fast B-Tree lookups. For instance, when joining orders and users on user_id, an index on users(id) allows the database to read an order, jump directly to the user record in the B-Tree index, and merge the data instantly.
Join Order Optimization: When queries join three or more tables, the query planner uses search algorithms (such as dynamic programming or greedy search heuristics) to calculate the cheapest order of operations. Joining tables in the wrong order can generate massive intermediate tables, stalling execution.

Subquery to Join Rewrites

Engineers often write queries using nested subqueries (e.g. SELECT * FROM users WHERE id IN (SELECT user_id FROM orders)). Modern query optimizers analyze these statements and rewrite correlated subqueries into joins under the hood.

This process, called query flattening, allows the optimizer to choose efficient physical algorithms like Hash Joins or Index Nested Loops rather than running the subquery once for every single row in the outer table.

Prerequisites

Database Indexes

Code Examples

Core Literature References

Join Processing in Relational Databases

by Dmitry Mishkin, et al. — ACM Computing Surveys, pp. 1-36

View source

Query Evaluation Techniques for Large Databases

by Goetz Graefe — ACM Computing Surveys, Vol. 25, No. 2, pp. 73-169

View source

Continue learning

ACID & Isolation Levels

Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.

API Gateways

Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.

API Security & OAuth 2.0

Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.