B+ Trees
Understand the structural design, traversal efficiency, write mutation mechanics, and concurrency structures of B+ Tree database indexes.
The Concept
In database storage design, reading data from disk is the primary performance bottleneck. While in-memory operations take nanoseconds, retrieving a block from a spinning hard disk or solid-state drive (SSD) takes milliseconds or microseconds. To minimize these expensive disk I/O operations, databases structure their indexes using B+ Trees.
A B+ Tree is a self-balancing search tree adapted to block storage environments. Unlike binary search trees, which have a fan-out of two, a B+ Tree has a very large fan-out, often holding hundreds or thousands of child references per node. This minimizes the height of the tree, ensuring that even in a database with billions of rows, a target record can be located in three or four disk page lookups.
Unlike standard B-Trees, which store keys and data records across all levels, a B+ Tree strictly separates routing metadata from raw data records. This structural optimization makes range queries highly efficient and maximizes cache utility.
B+ Tree Architecture
A B+ Tree consists of three distinct types of nodes: the root node, internal routing nodes, and leaf nodes. All nodes are sized to fit exactly within one or more operating system page boundaries, typically 4KB to 16KB, to prevent partial page write alignment penalties.
Internal Routing Nodes vs Leaf Nodes
The structural layout differs significantly between internal levels and leaf levels:
- Internal Nodes: Store routing keys and child page address pointers. They do not hold raw row data. Because they hold only index metadata, their capacity is high. A single 8KB internal page can contain thousands of child pointers, facilitating rapid branching.
- Leaf Nodes: Store key-value pairs (the actual row data or pointers to data files) along with pointers to both the left and right sibling leaves. These sibling links form a doubly linked list, enabling the database to perform high-speed sequential range scans (
WHERE age >= 21 AND age <= 35) entirely within the leaf layer, bypassing any need to traverse up or down the parent nodes.
B+ Tree Operations
Search Operations
To search for a key, the database executes a binary search within the current node starting at the root. The search returns the location of the highest key that is less than or equal to the target. The search follows the corresponding child pointer down to the next level. This process repeats recursively until reaching a leaf node, where another binary search determines if the target key is present.
Insertion and Page Splits
When inserting a key, the database traverses down to the target leaf node. If the leaf has available capacity, the key is inserted in sorted order. If the leaf is full, the node must be split to maintain balance:
- A new leaf node is allocated.
- The elements of the full leaf node are divided. The lower half remains in the original leaf, while the upper half is moved to the new leaf.
- The smallest key of the new leaf is copied (promoted) up to the parent internal node to serve as a routing boundary.
- Sibling pointers are updated to link the new leaf into the leaf sequence.
If the parent internal node is also full, it splits in turn. In this case, the middle key is moved (rather than copied) up to the grandparent node. If the root node splits, a new root node is created, increasing the height of the tree by one. Because all splits propagate upward, a B+ Tree remains balanced: every leaf node is always the exact same distance from the root.
Engineering Trade-offs
Fan-Out and OS Page Alignment
The performance of a database index is determined by the number of disk accesses required to read nodes. The number of accesses is bounded by the height of the tree. The tree height depends on the node capacity:
Height = O(logB N)
Here, B represents the branch factor (fan-out), and N is the total number of records. By maximizing the fan-out, we minimize the tree height.
To maximize the branching factor B, index nodes are designed to match OS memory page boundaries, typically 4KB or 8KB. Since database systems perform disk reads and writes in blocks, keeping node structures aligned with these boundaries prevents write amplification. Write amplification occurs when writing a single byte requires reading, modifying, and rewriting multiple physical disk blocks.
Latch Crabbing Concurrency
When multiple threads query and update a B+ Tree concurrently, they must coordinate access using locks (commonly called latches in database contexts to distinguish them from transaction locks). Without coordination, a search thread could read a page while an insertion thread is splitting it, causing pointers to point to invalid memory.
To prevent this corruption without blocking the entire tree, databases implement latch crabbing:
- When traversing down the tree, a thread acquires a shared latch on the parent node.
- The thread then acquires a latch on the child node.
- If the child node is safe (cannot split during insertion, or cannot merge during deletion), the thread releases (unpin) the latch on the parent node. This resembles a crab crawling down the tree.
For write operations:
- The thread acquires exclusive locks down the path.
- If a child is safe, the thread releases all exclusive locks on the ancestors.
- This localizes locks to only the sub-tree undergoing modification, allowing other reader threads to query unrelated sections of the index.
B+ Trees vs LSM Trees
Modern databases choose index structures based on workload patterns:
- B+ Trees: Optimized for read-heavy workloads and range queries. Reads require a constant number of disk page accesses (typically 3 or 4). However, write mutations require random disk writes to update leaf pages, which can degrade performance on write-heavy systems.
- LSM Trees (Log-Structured Merge Trees): Optimized for write throughput. Writes are appended sequentially to an in-memory buffer, which is periodically flushed to sequential files on disk. Range queries and random reads are slower, as they must search across multiple disk files (SSTables) and reconcile updates.
Further Reading
- B-Trees and Database Indexes Explained — PlanetScale's exploration of B-tree layout and mechanics.
- latch crabbing in B+ Trees — Carnegie Mellon University's database slides detailing latch concurrency.
- Modern B-Tree Techniques — Goetz Graefe's comprehensive survey of B-tree optimizations and concurrency strategies.
Prerequisites
Code Examples
Core Literature References
Database Management Systems
by Raghu Ramakrishnan and Johannes Gehrke — Chapter 10: Tree-Structured Indexing, pp. 338-372
View sourceDatabase System Concepts
by Abraham Silberschatz, Henry F. Korth, and S. Sudarshan — Chapter 14: Indexing, pp. 625-650
View sourceContinue learning
ACID & Isolation Levels
Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.
API Gateways
Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.
API Security & OAuth 2.0
Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.