Vector Databases

Explore the mathematics of high-dimensional spaces, metric calculations, and indexing designs like HNSW and Product Quantization that power vector databases.

AdvancedDatabasesChapter: Database Systems15 min read

The Concept

Traditionally, databases indexed simple, scalar data types: integers, strings, dates, and booleans. Finding a record required matches using structures like B+ Trees.

However, artificial intelligence and machine learning models analyze unstructured data: text documents, raw images, audio clips, and video streams. To process this unstructured data, models convert it into dense vectors (embeddings). An embedding is a high-dimensional array of floating-point numbers, representing the semantic meaning of the source object.

Searching for semantically similar items requires a database that can query high-dimensional spaces. Standard B+ Trees or Hash indexes cannot search these spaces. This requirement led to the development of the Vector Database.


Approximate Nearest Neighbor (ANN) Indexing

To bypass the dimensional bottleneck, vector databases trade mathematical precision for query speed using Approximate Nearest Neighbor (ANN) indexing. ANN algorithms do not guarantee finding the absolute nearest neighbor, but they locate an approximate match with high probability in logarithmic time.

Hierarchical Navigable Small World (HNSW) Graphs

The industry standard ANN index is the Hierarchical Navigable Small World (HNSW) graph. HNSW applies the skip-list data structure concept to a multi-layer graph model.

HNSW Hierarchical Graph Routing Layer 2 (Sparse Routing) Node A Node B Layer 1 (Medium Density) Layer 0 (All Vectors - Dense) G (Match) Step 1: Jump to closest (A -> B) Step 2: Drop Step 3: Route Left Step 4: Drop Step 5: Final Converge

In an HNSW index:

  • Layer 2 (Sparse Layer): Contains a sparse selection of vectors. Traversal hops across long distances to find the general neighborhood of the query.
  • Layer 1 (Medium Layer): Houses more vectors. Traversal performs finer local search adjustments.
  • Layer 0 (Dense Layer): Contains every vector in the index. The search performs small adjustments to identify the nearest neighbor.

This layered structure allows HNSW graphs to route queries in O(log N) search time.


Vector Compression: Quantization

Storing millions of high-dimensional floating-point vectors in RAM requires significant memory. If each float requires 4 bytes, a 1536-dimensional vector requires 6KB of memory. Ten million vectors would require 60GB of RAM.

To reduce memory consumption, vector databases compress vectors using quantization:

  • Scalar Quantization (SQ): Translates floating-point numbers into 8-bit integers (int8). This reduces the memory footprint by 75% while retaining most of the vector's relative distance details.
  • Product Quantization (PQ): Divides a high-dimensional vector space into smaller sub-vectors, runs a clustering algorithm (like K-Means) to identify the centroids of these spaces, and represents each sub-vector as a 1-byte pointer to the nearest centroid. This compresses high-dimensional vectors by up to 95%, permitting massive datasets to fit in RAM at the cost of minor recall precision.

Hybrid Search Filtering

Real-world applications rarely query vectors in isolation. Applications often pair semantic searches with specific metadata filters, such as:

"Find posts semantically similar to 'databases', but only those published in the last 24 hours by premium users."

Vector databases coordinate this hybrid search using one of two execution paths:

Pre-Filtering

The database first filters the dataset based on metadata criteria, leaving a small subset of candidate nodes. The engine then runs the vector search on this subset.

  • Problem: If the metadata filter is restrictive (e.g. only 5 out of 100,000 documents match), the HNSW graph connections may be broken on those 5 nodes, preventing routing algorithms from finding the target vectors.

Post-Filtering

The database first runs the vector search on the entire dataset to retrieve the top 100 candidates. It then discards candidates that fail the metadata criteria.

  • Problem: If the metadata filter is restrictive, all top 100 vector candidates might be discarded, returning zero results to the user even though matching documents exist in the database.

To balance this trade-off, modern vector databases implement Single-Stage Filtering. This approach traverses the HNSW graph while evaluating metadata criteria on each step, pruning paths dynamically to ensure valid results are returned without breaking graph connectivity.


Further Reading

Code Examples

Core Literature References

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs

by Yu A. Malkov and D. A. Yashunin — IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 824-835

View source

Product Quantization for Nearest Neighbor Search

by Herve Jegou, Matthijs Douze, and Cordelia Schmid — IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 117-128

View source