Hash Functions
The mathematical algorithms mapping arbitrary data to fixed-size signatures, powering HashTables, load balancing, and cryptographic security.
Hash Function vs. Hash Table
Before diving into the mathematics, we must draw a clear line between the function and the structure:
- Hash Function: A pure mathematical algorithm. It takes raw input data of any size (like a string, a file, or an object) and translates it into a fixed-size number (e.g., a 32-bit integer). It holds no data; it is just a translator.
- Hash Table (HashMap): A data structure that stores key-value pairs in memory. It uses a hash function to convert keys into indices of a physical array, allowing you to store and read values in near-instant (
O(1)) time.
The Hashing Pipeline
Here is how a hash function translates keys to locate memory slots:
The 3 Hard Rules of Hashing
To be useful for computers, a hash function must obey three core laws:
1. Absolute Determinism
Same input must always yield the exact same hash output.
- Why? If you store "Alice" at memory index
8today, but the hash function translates "Alice" to25tomorrow, you will look in the wrong slot and your data is lost forever.
2. Fixed-Length Output
No matter if the input is a single character ("A") or a 4GB raw video file, the output size must remain identical (e.g. always a 32-bit integer or a 64-character hash string).
- Why? Computer memory relies on predictable layouts. Fixed-length outputs ensure database buckets and array memory frames can be pre-allocated with uniform spacing.
3. One-Way Transmission (Pre-Image Resistance)
You can easily go from input to output, but it must be mathematically impossible to work backward from the output hash to reconstruct the input.
- Why? Security. If database passwords are saved as hashes, an attacker stealing the database only sees the output. If the function is one-way, they cannot reconstruct the plain-text passwords.
The Pigeonhole Principle & Collisions
A Collision occurs when two completely different inputs yield the identical hash output.
This is governed by the Pigeonhole Principle: If you have 10 pigeons but only 9 nesting holes, at least one hole must contain more than one pigeon. Because the possible inputs to a hash function are infinite (any string of any length) but the output hash range is finite (e.g. 232 combinations for a 32-bit integer), collisions are mathematically guaranteed to happen.
Resolution Strategies
- Separate Chaining: If keys collide at Index 2, store them inside a linked list nested under Index 2.
- Open Addressing: If Index 2 is full, search linearly down the array until you find the next empty slot.
Hashing Tradeoffs: Fast vs. Slow Functions
Backend engineers choose hash functions based on a critical tradeoff: Execution Speed vs. Cryptographic Security.
Fast Hash Functions (Non-Cryptographic)
- Algorithms: MurmurHash, FNV-1a, xxHash.
- Use Case: Router load balancers, database index lookups, cache keys.
- Pros: Incredibly fast. Can compute hashes for millions of requests per second with negligible CPU usage.
- Cons: Vulnerable to Hash Collision DoS (Denial of Service) attacks. An attacker can craft a list of payload keys that purposefully collide on the same array index, turning a fast
O(1)lookup hash table into a slowO(N)linked list, consuming 100% CPU and crashing the API server.
Slow Hash Functions (Cryptographic & Key Derivation)
- Algorithms: bcrypt, Argon2, PBKDF2, SHA-256.
- Use Case: Password hashing, credential storage, integrity checks.
- Pros: Incredibly secure. They are designed to be computationally expensive (taking 100ms+ per hash) and utilize salt values.
- Cons: Heavy CPU cost. Running a slow hash function inside a cache lookup index would instantly bottleneck database throughput and slow the application to a crawl.
Prerequisites
Continue learning
ACID & Isolation Levels
Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.
API Gateways
Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.
API Security & OAuth 2.0
Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.