Vector Databases
Explore the mathematics of high-dimensional spaces, metric calculations, and indexing designs like HNSW and Product Quantization that power vector databases.
The Concept
Traditionally, databases indexed simple, scalar data types: integers, strings, dates, and booleans. Finding a record required matches using structures like B+ Trees.
However, artificial intelligence and machine learning models analyze unstructured data: text documents, raw images, audio clips, and video streams. To process this unstructured data, models convert it into dense vectors (embeddings). An embedding is a high-dimensional array of floating-point numbers, representing the semantic meaning of the source object.
Searching for semantically similar items requires a database that can query high-dimensional spaces. Standard B+ Trees or Hash indexes cannot search these spaces. This requirement led to the development of the Vector Database.
High-Dimensional Vector Search
A vector database stores arrays of numbers representing items. The dimensionality of these arrays is high, often ranging from 128 to over 1,536 elements.
To determine if two items are semantically similar, the database calculates the mathematical distance between their embeddings.
Distance Metrics Formulas
Euclidean Distance (L2): Measures the straight-line distance between two points in Euclidean space:
L2 = √ Σ (ui - vi)2
Cosine Similarity: Measures the cosine of the angle between two vectors, evaluating their directional alignment rather than their absolute magnitude:
Similarity = (u · v) / (||u|| ||v||)
Dot Product: Multiplies corresponding components of the vectors. If vectors are normalized to a length of 1, the dot product matches cosine similarity:
Dot Product = Σ ui · vi
The Dimensional Bottleneck
In a 2D or 3D space, dividing the space with trees (like KD-trees) is efficient. However, as the number of dimensions increases, KD-tree partition boundaries fail to prune irrelevant search spaces.
This performance drop is called the curse of dimensionality. In high dimensions, the distance between any two vectors converges: almost all vectors become equidistant from one another. This renders partitioning indexes useless, forcing queries to fallback to a brute-force full table scan.
Running a brute-force scan requires calculating the distance between the query vector and every vector in the database. When storing millions of embeddings, this calculation is too slow.
Approximate Nearest Neighbor (ANN) Indexing
To bypass the dimensional bottleneck, vector databases trade mathematical precision for query speed using Approximate Nearest Neighbor (ANN) indexing. ANN algorithms do not guarantee finding the absolute nearest neighbor, but they locate an approximate match with high probability in logarithmic time.
Hierarchical Navigable Small World (HNSW) Graphs
The industry standard ANN index is the Hierarchical Navigable Small World (HNSW) graph. HNSW applies the skip-list data structure concept to a multi-layer graph model.
In an HNSW index:
- Layer 2 (Sparse Layer): Contains a sparse selection of vectors. Traversal hops across long distances to find the general neighborhood of the query.
- Layer 1 (Medium Layer): Houses more vectors. Traversal performs finer local search adjustments.
- Layer 0 (Dense Layer): Contains every vector in the index. The search performs small adjustments to identify the nearest neighbor.
This layered structure allows HNSW graphs to route queries in O(log N) search time.
Vector Compression: Quantization
Storing millions of high-dimensional floating-point vectors in RAM requires significant memory. If each float requires 4 bytes, a 1536-dimensional vector requires 6KB of memory. Ten million vectors would require 60GB of RAM.
To reduce memory consumption, vector databases compress vectors using quantization:
- Scalar Quantization (SQ): Translates floating-point numbers into 8-bit integers (int8). This reduces the memory footprint by 75% while retaining most of the vector's relative distance details.
- Product Quantization (PQ): Divides a high-dimensional vector space into smaller sub-vectors, runs a clustering algorithm (like K-Means) to identify the centroids of these spaces, and represents each sub-vector as a 1-byte pointer to the nearest centroid. This compresses high-dimensional vectors by up to 95%, permitting massive datasets to fit in RAM at the cost of minor recall precision.
Hybrid Search Filtering
Real-world applications rarely query vectors in isolation. Applications often pair semantic searches with specific metadata filters, such as:
"Find posts semantically similar to 'databases', but only those published in the last 24 hours by premium users."
Vector databases coordinate this hybrid search using one of two execution paths:
Pre-Filtering
The database first filters the dataset based on metadata criteria, leaving a small subset of candidate nodes. The engine then runs the vector search on this subset.
- Problem: If the metadata filter is restrictive (e.g. only 5 out of 100,000 documents match), the HNSW graph connections may be broken on those 5 nodes, preventing routing algorithms from finding the target vectors.
Post-Filtering
The database first runs the vector search on the entire dataset to retrieve the top 100 candidates. It then discards candidates that fail the metadata criteria.
- Problem: If the metadata filter is restrictive, all top 100 vector candidates might be discarded, returning zero results to the user even though matching documents exist in the database.
To balance this trade-off, modern vector databases implement Single-Stage Filtering. This approach traverses the HNSW graph while evaluating metadata criteria on each step, pruning paths dynamically to ensure valid results are returned without breaking graph connectivity.
Further Reading
- The HNSW Paper — Yu A. Malkov's research introducing hierarchical navigable small world graphs.
- Pinecone: What is a Vector Database? — A primer on embeddings, distance metrics, and vector indexes.
- Product Quantization for Similarity Search — An illustrated explanation of product quantization concepts and code implementations.
Prerequisites
Code Examples
Core Literature References
Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs
by Yu A. Malkov and D. A. Yashunin — IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 824-835
View sourceProduct Quantization for Nearest Neighbor Search
by Herve Jegou, Matthijs Douze, and Cordelia Schmid — IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 117-128
View sourceContinue learning
ACID & Isolation Levels
Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.
API Gateways
Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.
API Security & OAuth 2.0
Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.