B+ Trees

Understand the structural design, traversal efficiency, write mutation mechanics, and concurrency structures of B+ Tree database indexes.

AdvancedDatabasesChapter: Database Systems18 min read

The Concept

In database storage design, reading data from disk is the primary performance bottleneck. While in-memory operations take nanoseconds, retrieving a block from a spinning hard disk or solid-state drive (SSD) takes milliseconds or microseconds. To minimize these expensive disk I/O operations, databases structure their indexes using B+ Trees.

A B+ Tree is a self-balancing search tree adapted to block storage environments. Unlike binary search trees, which have a fan-out of two, a B+ Tree has a very large fan-out, often holding hundreds or thousands of child references per node. This minimizes the height of the tree, ensuring that even in a database with billions of rows, a target record can be located in three or four disk page lookups.

Unlike standard B-Trees, which store keys and data records across all levels, a B+ Tree strictly separates routing metadata from raw data records. This structural optimization makes range queries highly efficient and maximizes cache utility.


B+ Tree Architecture

A B+ Tree consists of three distinct types of nodes: the root node, internal routing nodes, and leaf nodes. All nodes are sized to fit exactly within one or more operating system page boundaries, typically 4KB to 16KB, to prevent partial page write alignment penalties.

B+ Tree Structural Architecture Root: [ 20 | 50 ] Internal: [ 10 ] Internal: [ 35 ] Internal: [ 65 ] Leaf: [ 5, 8 ] Leaf: [ 10, 15 ] Leaf: [ 20, 30 ] Leaf: [ 35, 45 ] Leaf: [ 50, 60 ] Leaf: [ 65, 70 ] Horizontal links connect leaf nodes (yellow lines) for high-speed sequential scans.

Internal Routing Nodes vs Leaf Nodes

The structural layout differs significantly between internal levels and leaf levels:

  • Internal Nodes: Store routing keys and child page address pointers. They do not hold raw row data. Because they hold only index metadata, their capacity is high. A single 8KB internal page can contain thousands of child pointers, facilitating rapid branching.
  • Leaf Nodes: Store key-value pairs (the actual row data or pointers to data files) along with pointers to both the left and right sibling leaves. These sibling links form a doubly linked list, enabling the database to perform high-speed sequential range scans (WHERE age >= 21 AND age <= 35) entirely within the leaf layer, bypassing any need to traverse up or down the parent nodes.

B+ Tree Operations

Search Operations

To search for a key, the database executes a binary search within the current node starting at the root. The search returns the location of the highest key that is less than or equal to the target. The search follows the corresponding child pointer down to the next level. This process repeats recursively until reaching a leaf node, where another binary search determines if the target key is present.

Insertion and Page Splits

When inserting a key, the database traverses down to the target leaf node. If the leaf has available capacity, the key is inserted in sorted order. If the leaf is full, the node must be split to maintain balance:

  1. A new leaf node is allocated.
  2. The elements of the full leaf node are divided. The lower half remains in the original leaf, while the upper half is moved to the new leaf.
  3. The smallest key of the new leaf is copied (promoted) up to the parent internal node to serve as a routing boundary.
  4. Sibling pointers are updated to link the new leaf into the leaf sequence.

If the parent internal node is also full, it splits in turn. In this case, the middle key is moved (rather than copied) up to the grandparent node. If the root node splits, a new root node is created, increasing the height of the tree by one. Because all splits propagate upward, a B+ Tree remains balanced: every leaf node is always the exact same distance from the root.


Engineering Trade-offs

Fan-Out and OS Page Alignment

The performance of a database index is determined by the number of disk accesses required to read nodes. The number of accesses is bounded by the height of the tree. The tree height depends on the node capacity:

Height = O(logB N)

Here, B represents the branch factor (fan-out), and N is the total number of records. By maximizing the fan-out, we minimize the tree height.

To maximize the branching factor B, index nodes are designed to match OS memory page boundaries, typically 4KB or 8KB. Since database systems perform disk reads and writes in blocks, keeping node structures aligned with these boundaries prevents write amplification. Write amplification occurs when writing a single byte requires reading, modifying, and rewriting multiple physical disk blocks.

Latch Crabbing Concurrency

When multiple threads query and update a B+ Tree concurrently, they must coordinate access using locks (commonly called latches in database contexts to distinguish them from transaction locks). Without coordination, a search thread could read a page while an insertion thread is splitting it, causing pointers to point to invalid memory.

To prevent this corruption without blocking the entire tree, databases implement latch crabbing:

  1. When traversing down the tree, a thread acquires a shared latch on the parent node.
  2. The thread then acquires a latch on the child node.
  3. If the child node is safe (cannot split during insertion, or cannot merge during deletion), the thread releases (unpin) the latch on the parent node. This resembles a crab crawling down the tree.

For write operations:

  • The thread acquires exclusive locks down the path.
  • If a child is safe, the thread releases all exclusive locks on the ancestors.
  • This localizes locks to only the sub-tree undergoing modification, allowing other reader threads to query unrelated sections of the index.

B+ Trees vs LSM Trees

Modern databases choose index structures based on workload patterns:

  • B+ Trees: Optimized for read-heavy workloads and range queries. Reads require a constant number of disk page accesses (typically 3 or 4). However, write mutations require random disk writes to update leaf pages, which can degrade performance on write-heavy systems.
  • LSM Trees (Log-Structured Merge Trees): Optimized for write throughput. Writes are appended sequentially to an in-memory buffer, which is periodically flushed to sequential files on disk. Range queries and random reads are slower, as they must search across multiple disk files (SSTables) and reconcile updates.

Further Reading

Prerequisites

Code Examples

Core Literature References

Database Management Systems

by Raghu Ramakrishnan and Johannes Gehrke — Chapter 10: Tree-Structured Indexing, pp. 338-372

View source

Database System Concepts

by Abraham Silberschatz, Henry F. Korth, and S. Sudarshan — Chapter 14: Indexing, pp. 625-650

View source