Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure that can definitively say an item is absent, trading a small false-positive rate for dramatically lower memory use than a hash set.

IntermediateDatabasesChapter: Database Systems10 min read

The Problem: Expensive Lookups for Missing Keys

Imagine you run a database with millions of rows spread across hundreds of disk files. A user queries for a record that does not exist. Without help, your database must open file after file, read index pages, and confirm the record is missing, only to discover nothing is there. Each file access can cost milliseconds on spinning disk.

The question Bloom filters answer is: can we cheaply confirm that an item definitely does NOT exist before paying for any expensive I/O?

Think of it like checking your coat pockets before tearing apart your whole apartment looking for lost keys. If you feel nothing in any pocket, you know for certain the keys are not there. You have saved yourself a lengthy search.

What Is a Bloom Filter?

A Bloom filter is a probabilistic data structure built from two ingredients:

  • A bit array of m bits, all initially set to 0.
  • k independent hash functions, each mapping any input to a position in the bit array.

Inserting an item

To insert an item, run it through all k hash functions. Each function produces a bit-array index. Set all k positions to 1.

Checking membership

To test whether an item exists, run it through the same k hash functions and check those positions:

  • If any bit is 0, the item is definitely absent. It could never have been inserted without setting all k bits to 1.
  • If all bits are 1, the item is probably present, but not certainly. Those bits might have been set by other items that happened to share those positions (a false positive).

This asymmetry is the essential property: no false negatives, possible false positives.

Diagram: Bit Array in Action

Bloom Filter — 10-bit array, 3 hash functions 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 0 0 0 0 insert 'alice' h1=2, h2=5, h3=8 insert 'bob' h1=1, h2=5, h3=9 Check 'charlie' — h1=2, h2=4, h3=7 1 ✓ set 0 ✗ zero! 0 not checked 'charlie' DEFINITELY NOT present Check 'dave' — h1=1, h2=2, h3=5 — all bits are 1 'dave' MAYBE present (false positive)

False Positives: The Acceptable Trade-off

The filter can answer "definitely no" or "probably yes." It cannot answer "definitely yes." This is fine for the use cases it targets.

In the Cassandra example: a false positive means the database reads a disk file unnecessarily and then discovers the key is missing. That is one wasted I/O. Without a Bloom filter it would waste that I/O on every SSTable. With one, it wastes it only on the small fraction of false positives.

The false positive rate depends on:

  • m: the number of bits. More bits, lower collision probability.
  • k: the number of hash functions. Too few means too few bits set per item; too many means the array fills quickly. There is an optimal k for any target false positive rate.
  • n: the number of items inserted. More items fill more bits.

A common rule of thumb: roughly 10 bits per item gives about a 1% false positive rate with the optimal k.

No Deletions

Standard Bloom filters are append-only. You cannot "unset" a bit for a deleted item because that bit may also have been set by a completely different item. Clearing it would corrupt other items' membership information.

Counting Bloom filters solve this by replacing each bit with a small counter. Inserting increments the counters, deleting decrements them, and a position is "set" when its counter is greater than zero. The trade-off is higher memory use (typically 4 bits per cell instead of 1).

Real-World Uses

  • Apache Cassandra: maintains one Bloom filter per SSTable on disk. When reading a key, Cassandra checks each SSTable's filter first. If the filter says "definitely absent," the SSTable is skipped entirely, avoiding a disk read.
  • Google Chrome Safe Browsing: your browser keeps a locally cached Bloom filter of known malicious URLs. When you visit a site, Chrome checks the local filter. Only if the filter says "possibly present" does Chrome send a network request to Google's servers to confirm. This saves a network round-trip for the vast majority of safe URLs.
  • RocksDB and LevelDB: each SSTable file carries a Bloom filter in its index block. Reads that miss all levels do not need to open any file.
  • Redis: the RedisBloom module provides Bloom filter commands (BF.ADD, BF.EXISTS) for use cases like deduplicating event streams.
  • Akamai CDN: reportedly uses Bloom filters to avoid caching one-hit-wonder content that is only requested once and would pollute the cache.

Bloom Filter vs. Hash Set

Property Hash Set Bloom Filter
False negatives Never Never
False positives Never Possible (tunable)
Memory O(n) items stored Fixed m bits
Deletions Supported Not in standard form
Lookup cost O(1) average O(k) hash operations
Scales with item size Yes (stores the item) No (only bits are stored)

The key insight is that a Bloom filter never stores the items themselves. You cannot retrieve them and you cannot enumerate what is in it. It is purely a membership oracle.

Code Example

python
import hashlib

class BloomFilter:
    def __init__(self, m: int, k: int):
        # m: number of bits in the array
        # k: number of hash functions
        self.m = m
        self.k = k
        self.bits = [0] * m

    def _hashes(self, item: str):
        # Derive k independent positions using a seeded SHA-256
        for seed in range(self.k):
            digest = hashlib.sha256(f"{seed}:{item}".encode()).hexdigest()
            yield int(digest, 16) % self.m

    def add(self, item: str) -> None:
        for pos in self._hashes(item):
            self.bits[pos] = 1

    def might_contain(self, item: str) -> bool:
        # False  -> DEFINITELY absent (safe to skip the disk read)
        # True   -> PROBABLY present (do the actual lookup to confirm)
        return all(self.bits[pos] == 1 for pos in self._hashes(item))


bf = BloomFilter(m=1000, k=3)
bf.add("alice")
bf.add("bob")

print(bf.might_contain("alice"))    # True  — probably present
print(bf.might_contain("charlie"))  # False — definitely absent, skip I/O

In production you would use a well-tested library (e.g., pybloom-live in Python, or the built-in filters in RocksDB and Cassandra) rather than rolling your own. The implementation above is for illustration.

Prerequisites

Code Examples