Vector Databases

Explore the mathematics of high-dimensional spaces, metric calculations, and indexing designs like HNSW and Product Quantization that power vector databases.

AdvancedDatabasesChapter: Database Systems15 min read

The Concept

Traditionally, databases indexed simple, scalar data types: integers, strings, dates, and booleans. Finding a record required matches using structures like B+ Trees.

However, artificial intelligence and machine learning models analyze unstructured data: text documents, raw images, audio clips, and video streams. To process this unstructured data, models convert it into dense vectors (embeddings). An embedding is a high-dimensional array of floating-point numbers, representing the semantic meaning of the source object.

Searching for semantically similar items requires a database that can query high-dimensional spaces. Standard B+ Trees or Hash indexes cannot search these spaces. This requirement led to the development of the Vector Database.

High-Dimensional Vector Search

A vector database stores arrays of numbers representing items. The dimensionality of these arrays is high, often ranging from 128 to over 1,536 elements.

To determine if two items are semantically similar, the database calculates the mathematical distance between their embeddings.

Distance Metrics Formulas

Euclidean Distance (L2): Measures the straight-line distance between two points in Euclidean space:

L2 = √ Σ (u_i - v_i)²
Cosine Similarity: Measures the cosine of the angle between two vectors, evaluating their directional alignment rather than their absolute magnitude:

Similarity = (u · v) / (||u|| ||v||)
Dot Product: Multiplies corresponding components of the vectors. If vectors are normalized to a length of 1, the dot product matches cosine similarity:

Dot Product = Σ u_i · v_i

The Dimensional Bottleneck

In a 2D or 3D space, dividing the space with trees (like KD-trees) is efficient. However, as the number of dimensions increases, KD-tree partition boundaries fail to prune irrelevant search spaces.

This performance drop is called the curse of dimensionality. In high dimensions, the distance between any two vectors converges: almost all vectors become equidistant from one another. This renders partitioning indexes useless, forcing queries to fallback to a brute-force full table scan.

Running a brute-force scan requires calculating the distance between the query vector and every vector in the database. When storing millions of embeddings, this calculation is too slow.

Approximate Nearest Neighbor (ANN) Indexing

To bypass the dimensional bottleneck, vector databases trade mathematical precision for query speed using Approximate Nearest Neighbor (ANN) indexing. ANN algorithms do not guarantee finding the absolute nearest neighbor, but they locate an approximate match with high probability in logarithmic time.

Hierarchical Navigable Small World (HNSW) Graphs

The industry standard ANN index is the Hierarchical Navigable Small World (HNSW) graph. HNSW applies the skip-list data structure concept to a multi-layer graph model.

In an HNSW index:

Layer 2 (Sparse Layer): Contains a sparse selection of vectors. Traversal hops across long distances to find the general neighborhood of the query.
Layer 1 (Medium Layer): Houses more vectors. Traversal performs finer local search adjustments.
Layer 0 (Dense Layer): Contains every vector in the index. The search performs small adjustments to identify the nearest neighbor.

This layered structure allows HNSW graphs to route queries in O(log N) search time.

Vector Compression: Quantization

Storing millions of high-dimensional floating-point vectors in RAM requires significant memory. If each float requires 4 bytes, a 1536-dimensional vector requires 6KB of memory. Ten million vectors would require 60GB of RAM.

To reduce memory consumption, vector databases compress vectors using quantization:

Scalar Quantization (SQ): Translates floating-point numbers into 8-bit integers (int8). This reduces the memory footprint by 75% while retaining most of the vector's relative distance details.
Product Quantization (PQ): Divides a high-dimensional vector space into smaller sub-vectors, runs a clustering algorithm (like K-Means) to identify the centroids of these spaces, and represents each sub-vector as a 1-byte pointer to the nearest centroid. This compresses high-dimensional vectors by up to 95%, permitting massive datasets to fit in RAM at the cost of minor recall precision.

Hybrid Search Filtering

Real-world applications rarely query vectors in isolation. Applications often pair semantic searches with specific metadata filters, such as:

"Find posts semantically similar to 'databases', but only those published in the last 24 hours by premium users."

Vector databases coordinate this hybrid search using one of two execution paths:

Pre-Filtering

The database first filters the dataset based on metadata criteria, leaving a small subset of candidate nodes. The engine then runs the vector search on this subset.

Problem: If the metadata filter is restrictive (e.g. only 5 out of 100,000 documents match), the HNSW graph connections may be broken on those 5 nodes, preventing routing algorithms from finding the target vectors.

Post-Filtering

The database first runs the vector search on the entire dataset to retrieve the top 100 candidates. It then discards candidates that fail the metadata criteria.

Problem: If the metadata filter is restrictive, all top 100 vector candidates might be discarded, returning zero results to the user even though matching documents exist in the database.

To balance this trade-off, modern vector databases implement Single-Stage Filtering. This approach traverses the HNSW graph while evaluating metadata criteria on each step, pruning paths dynamically to ensure valid results are returned without breaking graph connectivity.

Prerequisites

Hash Functions Database Indexes

Code Examples

Core Literature References

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs

by Yu A. Malkov and D. A. Yashunin — IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 824-835

View source

Product Quantization for Nearest Neighbor Search

by Herve Jegou, Matthijs Douze, and Cordelia Schmid — IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 117-128

View source

Continue learning

ACID & Isolation Levels

Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.

API Gateways

Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.

API Security & OAuth 2.0

Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.