Bitcask Log-Structured Hash Tables

Discover the design, write append speed, single-seek reads, and compaction mechanics of Bitcask log-structured storage engines.

IntermediateDatabasesChapter: Database Systems12 min read

The Concept

In traditional database systems, write operations are slow because they require random page modifications. A single row update in a B+ Tree index might require finding a page on disk, modifying its bytes, and writing it back to a random location.

To optimize write throughput, modern log-structured database systems use a different design pattern. Bitcask represents the cleanest implementation of this concept. It executes writes by appending updates sequentially to the end of a log file, transforming slow random I/O operations into high-speed sequential disk writes.

However, log-structured layouts present a challenge for read performance. If data is scattered across files in the order it was written, reading a key could require scanning the entire history of log files.

Bitcask solves this read performance challenge using a simple design: it pairs an append-only log-structured file storage on disk with an in-memory hash table called the Keydir.

Read/Write Architecture

A Bitcask instance consists of a directory containing a series of append-only log files (where only one file is marked active for writing) and an in-memory hash table directory.

The Keydir Hash Map

The Keydir is an in-memory hash table containing every key present in the database. Instead of storing the values directly in RAM, Keydir maps each key to a lightweight location descriptor:

File ID: The identifier of the log file containing the value.
Value Size: The size of the value in bytes.
Offset: The exact byte offset location inside the target log file.
Timestamp: The epoch timestamp indicating when the write occurred.

The Write Path

To insert or update a key-value pair, the database engine:

Formats a record containing a fixed-size header (timestamp, key size, and value size) followed by the raw key and value bytes.
Appends the formatted record to the active log file in a single write operation.
Obtains the starting offset of the write.
Updates the in-memory Keydir hash table with the file ID, value size, and offset.

Because the write path consists of a single sequential write append to disk and an in-memory hash table update, it completes in O(1) time, maximizing write throughput.

The Read Path

To read a key, the database engine:

Looks up the key in the in-memory Keydir hash table.
If the key exists, it retrieves the location descriptor (file ID, offset, and size).
Executes a file seek to that offset and reads the value from the log file.

Because the database uses the memory-based Keydir to resolve the exact disk location of a key, reading a value requires only a single disk seek lookup, executing in O(1) time.

Log Compaction and Merging

Because updates are appended to the log, overwriting a key leaves duplicate records in the log files. The old, superseded values are dead space.

text

Log File:
[LSN 0: u_101="Alice"] -> [LSN 45: u_102="Bob"] -> [LSN 93: u_101="Alice Cooper"]
                                                    ^
                                                    Active Keydir references this LSN

To prevent disk space exhaustion, Bitcask uses a background process called compaction (or merging).

The compaction process:

Iterates through the cold log files (files that are no longer active).
For each record, it compares its timestamp with the current in-memory Keydir entry.
If the record LSN matches the current Keydir offset, it is active. The engine copies the active record to a new compacted log file.
If the record offset does not match Keydir, it is stale. The engine discards it.
Once a file has been processed, the old redundant log file is deleted.

Startup Recovery and Hint Files

If a Bitcask database crashes or restarts, the in-memory Keydir is lost. The engine must rebuild the Keydir map from the disk files on startup.

The naive way to rebuild Keydir is to read every log file from start to finish, parsing every record header. On large databases with gigabytes of logs, this process is slow.

To speed up startup, Bitcask writes a hint file alongside each log file during compaction:

The hint file contains the record headers (key, value size, offset, and timestamp) but excludes the value bytes.
When the database restarts, it parses the lightweight hint files rather than the raw data logs.
This allows the database to rebuild the Keydir map at memory speeds, reducing startup recovery times.

Engineering Trade-offs

When choosing a storage engine, consider the design trade-offs of the Bitcask model:

Pros

High Write Throughput: Writes are appended sequentially, minimizing random disk operations.
Low Read Latency: Reading a value requires at most a single disk seek.
Simple Recovery: Recovery requires parsing append-only records, preventing indexes from becoming corrupted or inconsistent.

Cons

RAM Constraints: Every key must fit in RAM. If the database contains millions of keys, the Keydir memory footprint will grow, even if values are kept on disk. This limits the engine to datasets with small key spaces.
No Range Queries: Because Keydir is structured as a hash table, the database cannot perform range scans (WHERE age > 21). To find a range, the engine would have to scan the entire Keydir or perform a full table scan of the logs.

Prerequisites

File Descriptors Hash Functions

Code Examples

Core Literature References

Bitcask: A Log-Structured Hash Table for Fast Key/Value Data

by Justin Shearer, Dave Smith, and Andy Gross — Basho Technologies Whitepaper, pp. 1-15

View source

Continue learning

ACID & Isolation Levels

Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.

API Gateways

Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.

API Security & OAuth 2.0

Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.