Hash Functions

The mathematical algorithms mapping arbitrary data to fixed-size signatures, powering HashTables, load balancing, and cryptographic security.

BeginnerAlgorithmsChapter: Foundations10 min read

Hash Function vs. Hash Table

Before diving into the mathematics, we must draw a clear line between the function and the structure:

  • Hash Function: A pure mathematical algorithm. It takes raw input data of any size (like a string, a file, or an object) and translates it into a fixed-size number (e.g., a 32-bit integer). It holds no data; it is just a translator.
  • Hash Table (HashMap): A data structure that stores key-value pairs in memory. It uses a hash function to convert keys into indices of a physical array, allowing you to store and read values in near-instant (O(1)) time.

The Hashing Pipeline

Here is how a hash function translates keys to locate memory slots:

"Alice" "Bob" "Charlie" Hash Function hash(key) % Buckets Index 01: [Alice] Index 02: [Bob] [Charlie] (Collision!) Index 03: Empty

The 3 Hard Rules of Hashing

To be useful for computers, a hash function must obey three core laws:

1. Absolute Determinism

Same input must always yield the exact same hash output.

  • Why? If you store "Alice" at memory index 8 today, but the hash function translates "Alice" to 25 tomorrow, you will look in the wrong slot and your data is lost forever.

2. Fixed-Length Output

No matter if the input is a single character ("A") or a 4GB raw video file, the output size must remain identical (e.g. always a 32-bit integer or a 64-character hash string).

  • Why? Computer memory relies on predictable layouts. Fixed-length outputs ensure database buckets and array memory frames can be pre-allocated with uniform spacing.

3. One-Way Transmission (Pre-Image Resistance)

You can easily go from input to output, but it must be mathematically impossible to work backward from the output hash to reconstruct the input.

  • Why? Security. If database passwords are saved as hashes, an attacker stealing the database only sees the output. If the function is one-way, they cannot reconstruct the plain-text passwords.

The Pigeonhole Principle & Collisions

A Collision occurs when two completely different inputs yield the identical hash output.

This is governed by the Pigeonhole Principle: If you have 10 pigeons but only 9 nesting holes, at least one hole must contain more than one pigeon. Because the possible inputs to a hash function are infinite (any string of any length) but the output hash range is finite (e.g. 232 combinations for a 32-bit integer), collisions are mathematically guaranteed to happen.

Resolution Strategies

  • Separate Chaining: If keys collide at Index 2, store them inside a linked list nested under Index 2.
  • Open Addressing: If Index 2 is full, search linearly down the array until you find the next empty slot.

Hashing Tradeoffs: Fast vs. Slow Functions

Backend engineers choose hash functions based on a critical tradeoff: Execution Speed vs. Cryptographic Security.

Fast Hash Functions (Non-Cryptographic)

  • Algorithms: MurmurHash, FNV-1a, xxHash.
  • Use Case: Router load balancers, database index lookups, cache keys.
  • Pros: Incredibly fast. Can compute hashes for millions of requests per second with negligible CPU usage.
  • Cons: Vulnerable to Hash Collision DoS (Denial of Service) attacks. An attacker can craft a list of payload keys that purposefully collide on the same array index, turning a fast O(1) lookup hash table into a slow O(N) linked list, consuming 100% CPU and crashing the API server.

Slow Hash Functions (Cryptographic & Key Derivation)

  • Algorithms: bcrypt, Argon2, PBKDF2, SHA-256.
  • Use Case: Password hashing, credential storage, integrity checks.
  • Pros: Incredibly secure. They are designed to be computationally expensive (taking 100ms+ per hash) and utilize salt values.
  • Cons: Heavy CPU cost. Running a slow hash function inside a cache lookup index would instantly bottleneck database throughput and slow the application to a crawl.

Prerequisites