Rate Limiting
Understand API rate limiting: Token Bucket, Leaky Bucket, Sliding Window algorithms, distributed synchronization, and HTTP client signaling.
Why Rate Limit?
In web infrastructure, APIs must handle fluctuating traffic volumes. An unexpected surge in requests can consume server threads, exhaust database connections, and degrade performance for all users. Rate limiting is the architectural pattern of throttling the rate of incoming requests to protect backend resources.
We rate limit APIs for three primary reasons:
- Denial of Service (DoS) Mitigation: Prevents malicious actors or buggy clients from executing volumetric resource exhaustion attacks against endpoints.
- API Monetization: Enforces usage boundaries for tiered pricing models, restricting basic accounts while allocating higher capacity to premium users.
- Resource Preservation: Protects downstream dependencies, such as third-party payment gateways, from being overwhelmed by app-side spikes.
Rate limiting works by inspecting request metadata (such as IP addresses, authorization tokens, or API keys), evaluating usage history against a tracking algorithm, and either allowing the request or rejecting it.
The Token Bucket Algorithm
The Token Bucket algorithm is a widely used rate limiting strategy that permits traffic bursts up to a defined limit.
The algorithm uses a simple analogy: Imagine a self-serve bakery where customers must grab a physical ticket (token) from a dispenser to buy bread.
- The ticket dispenser has a maximum capacity
B. - A timer adds new tickets to the dispenser at a constant rate of
rtokens per second. - When a customer arrives, they try to pull a ticket. If a ticket is available, they proceed with their purchase.
- If a line forms, customers consume tickets immediately, allowing a sudden burst of sales until the dispenser is empty.
- If no tickets are left, customers are turned away until the dispenser generates a new ticket.
token_count = min(capacity, token_count + rate * elapsed_time)
If a client makes a request and the current token_count is at least 1, the controller decrements the count by 1 and allows the request. If the count is 0, the request is blocked.
This algorithm allows apps to handle transient traffic spikes gracefully: if a user is idle, their bucket fills to capacity, allowing them to make a burst of requests immediately when they return.
The Leaky Bucket Algorithm
The Leaky Bucket algorithm smooths out traffic bursts, transmitting requests at a strictly constant rate.
This algorithm can be compared to a queue line at an amusement park ride:
- Customers enter a staging queue (the bucket) of maximum capacity
Q. - Customers arrive in random clumps and bursts, piling into the queue.
- At the exit of the queue, a turnstile lets exactly one customer through to the ride every 5 seconds (constant leak rate
L). - If the staging queue fills to capacity, subsequent customers are immediately rejected.
Unlike the Token Bucket, which allows bursts to exhaust tokens instantly, the Leaky Bucket forces all requests through a first-in, first-out (FIFO) queue. The queue leaks requests to the backend at a uniform rate. This makes the Leaky Bucket ideal for traffic-shaping configurations where downstream services require a highly predictable, flat request volume.
Fixed Window vs Sliding Window Algorithms
Runtimes can track request limits using window-based counting algorithms.
Fixed Window Counter
The Fixed Window Counter partition time into static windows (e.g., 1-minute blocks from 12:00 to 12:01). Every request increments a counter associated with the current window. Once the counter hits the limit, all subsequent requests are blocked until the next window boundary begins.
This algorithm has a major flaw: the double-dipping spike. If a user is allowed 10 requests per minute, they can transmit 10 requests at 12:00:59, wait one second for the window to reset at 12:01:00, and transmit 10 more requests immediately. This allows the user to execute 20 requests in a 2-second window, bypassing the intended limit.
Sliding Window Log
The Sliding Window Log resolves this by storing a sorted log of request timestamps for each user. When a request arrives, the algorithm:
- Deletes all timestamps older than
now - window_width. - Counts the remaining timestamps in the log.
- If the count is less than the limit, appends the current timestamp to the log and allows the request.
- Pros: Extremely precise, eliminating the double-dipping vulnerability.
- Cons: Storing individual timestamps for millions of active users consumes a significant amount of memory.
Sliding Window Counter
The Sliding Window Counter combines the low memory foot-print of Fixed Window with the precision of Sliding Log. It stores only the request counts of the current and preceding windows. When a request arrives, the algorithm computes a weighted sum of requests over the rolling window:
Requests = (Preceding_Count * (1 - ratio)) + Current_Count
Here, ratio is the progress percentage through the current window. If this calculated sum is below the limit, the current window's count is incremented.
Distributed Rate Limiting and Race Conditions
When scaling an application horizontally across multiple server nodes, rate limiting state must be stored in a centralized cache like Redis. A naive implementation can suffer from race conditions (specifically, read-modify-write conflicts).
If two server instances concurrently evaluate a user's rate limit:
- Both read the user's current token count (e.g., 1 token left).
- Both confirm the request is allowed.
- Both write back a decremented token count of
0.
The user has successfully executed two requests, even though only one token remained.
Atomic Redis Lua Scripting
To prevent these conflicts, systems execute rate-limiting logic inside Redis Lua scripts. Redis runs Lua scripts atomically on its main execution thread. No other commands can run while the script is executing, ensuring that the read, calculation, and write steps are performed as a single, atomic operation without distributed locking overhead.
HTTP Headers and Client Signaling
When a request is blocked by a rate limiter, the server must communicate this status to the client using standardized HTTP protocols:
- HTTP Status Code: The server returns
429 Too Many Requests. - Retry-After: A header indicating how many seconds the client must wait before retrying (e.g.,
Retry-After: 30).
Additionally, servers include metadata headers on successful responses to help clients manage their rate of requests:
X-RateLimit-Limit: The maximum number of allowed requests in the current window.X-RateLimit-Remaining: The number of remaining requests the client can make in the current window.X-RateLimit-Reset: The Unix epoch timestamp indicating when the current window resets.
System Performance Trade-offs
Choosing a rate limiting architecture requires evaluating trade-offs between memory overhead, CPU performance, and precision:
| Algorithm | Memory Cost | CPU Complexity | Spiky Traffic Support | Implementation Complexity |
|---|---|---|---|---|
| Token Bucket | Low (2 variables per key) | O(1) (lazy updates) |
Yes (bursts up to B) |
Low |
| Leaky Bucket | Medium (FIFO Queue space) | O(1) |
No (forces smooth flow) | Medium |
| Fixed Window | Low (1 counter per key) | O(1) |
Yes (allows double-dipping) | Low |
| Sliding Log | High (list of timestamps) | O(N) (unbounded logs) |
Yes (highly precise) | High |
| Sliding Counter | Low (2 variables per key) | O(1) |
Yes (weighted math) | Medium |
Further Reading
- Computer Networks — Classic networks textbook chapter outlining token and leaky bucket logic.
- RFC 6585: Additional HTTP Status Codes — The official IETF RFC establishing status code
429 Too Many Requests. - Redis Rate Limiter Pattern — Guide on atomic operations for counter-based rate limit patterns.
- Scaling Rate Limiting at Stripe — Engineering write-up on real-world multi-tier distributed rate limiter configurations.
Prerequisites
Code Examples
Core Literature References
Computer Networks
by Andrew S. Tanenbaum and David J. Wetherall — Chapter 5: The Network Layer - Traffic Shaping (Token and Leaky Buckets), pp. 394-400
View sourceContinue learning
ACID & Isolation Levels
Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.
API Gateways
Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.
API Security & OAuth 2.0
Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.