Rate Limiting

Understand API rate limiting: Token Bucket, Leaky Bucket, Sliding Window algorithms, distributed synchronization, and HTTP client signaling.

IntermediateAPI DesignChapter: API Design12 min read

Why Rate Limit?

In web infrastructure, APIs must handle fluctuating traffic volumes. An unexpected surge in requests can consume server threads, exhaust database connections, and degrade performance for all users. Rate limiting is the architectural pattern of throttling the rate of incoming requests to protect backend resources.

We rate limit APIs for three primary reasons:

  • Denial of Service (DoS) Mitigation: Prevents malicious actors or buggy clients from executing volumetric resource exhaustion attacks against endpoints.
  • API Monetization: Enforces usage boundaries for tiered pricing models, restricting basic accounts while allocating higher capacity to premium users.
  • Resource Preservation: Protects downstream dependencies, such as third-party payment gateways, from being overwhelmed by app-side spikes.

Rate limiting works by inspecting request metadata (such as IP addresses, authorization tokens, or API keys), evaluating usage history against a tracking algorithm, and either allowing the request or rejecting it.


The Token Bucket Algorithm

The Token Bucket algorithm is a widely used rate limiting strategy that permits traffic bursts up to a defined limit.

The algorithm uses a simple analogy: Imagine a self-serve bakery where customers must grab a physical ticket (token) from a dispenser to buy bread.

  • The ticket dispenser has a maximum capacity B.
  • A timer adds new tickets to the dispenser at a constant rate of r tokens per second.
  • When a customer arrives, they try to pull a ticket. If a ticket is available, they proceed with their purchase.
  • If a line forms, customers consume tickets immediately, allowing a sudden burst of sales until the dispenser is empty.
  • If no tickets are left, customers are turned away until the dispenser generates a new ticket.
text
token_count = min(capacity, token_count + rate * elapsed_time)

If a client makes a request and the current token_count is at least 1, the controller decrements the count by 1 and allows the request. If the count is 0, the request is blocked.

This algorithm allows apps to handle transient traffic spikes gracefully: if a user is idle, their bucket fills to capacity, allowing them to make a burst of requests immediately when they return.


The Leaky Bucket Algorithm

The Leaky Bucket algorithm smooths out traffic bursts, transmitting requests at a strictly constant rate.

This algorithm can be compared to a queue line at an amusement park ride:

  • Customers enter a staging queue (the bucket) of maximum capacity Q.
  • Customers arrive in random clumps and bursts, piling into the queue.
  • At the exit of the queue, a turnstile lets exactly one customer through to the ride every 5 seconds (constant leak rate L).
  • If the staging queue fills to capacity, subsequent customers are immediately rejected.

Unlike the Token Bucket, which allows bursts to exhaust tokens instantly, the Leaky Bucket forces all requests through a first-in, first-out (FIFO) queue. The queue leaks requests to the backend at a uniform rate. This makes the Leaky Bucket ideal for traffic-shaping configurations where downstream services require a highly predictable, flat request volume.


Fixed Window vs Sliding Window Algorithms

Runtimes can track request limits using window-based counting algorithms.

Fixed Window Counter

The Fixed Window Counter partition time into static windows (e.g., 1-minute blocks from 12:00 to 12:01). Every request increments a counter associated with the current window. Once the counter hits the limit, all subsequent requests are blocked until the next window boundary begins.

This algorithm has a major flaw: the double-dipping spike. If a user is allowed 10 requests per minute, they can transmit 10 requests at 12:00:59, wait one second for the window to reset at 12:01:00, and transmit 10 more requests immediately. This allows the user to execute 20 requests in a 2-second window, bypassing the intended limit.

Sliding Window Log

The Sliding Window Log resolves this by storing a sorted log of request timestamps for each user. When a request arrives, the algorithm:

  1. Deletes all timestamps older than now - window_width.
  2. Counts the remaining timestamps in the log.
  3. If the count is less than the limit, appends the current timestamp to the log and allows the request.
  • Pros: Extremely precise, eliminating the double-dipping vulnerability.
  • Cons: Storing individual timestamps for millions of active users consumes a significant amount of memory.

Sliding Window Counter

The Sliding Window Counter combines the low memory foot-print of Fixed Window with the precision of Sliding Log. It stores only the request counts of the current and preceding windows. When a request arrives, the algorithm computes a weighted sum of requests over the rolling window:

text
Requests = (Preceding_Count * (1 - ratio)) + Current_Count

Here, ratio is the progress percentage through the current window. If this calculated sum is below the limit, the current window's count is incremented.


Distributed Rate Limiting and Race Conditions

When scaling an application horizontally across multiple server nodes, rate limiting state must be stored in a centralized cache like Redis. A naive implementation can suffer from race conditions (specifically, read-modify-write conflicts).

If two server instances concurrently evaluate a user's rate limit:

  1. Both read the user's current token count (e.g., 1 token left).
  2. Both confirm the request is allowed.
  3. Both write back a decremented token count of 0.

The user has successfully executed two requests, even though only one token remained.

Atomic Redis Lua Scripting

To prevent these conflicts, systems execute rate-limiting logic inside Redis Lua scripts. Redis runs Lua scripts atomically on its main execution thread. No other commands can run while the script is executing, ensuring that the read, calculation, and write steps are performed as a single, atomic operation without distributed locking overhead.


HTTP Headers and Client Signaling

When a request is blocked by a rate limiter, the server must communicate this status to the client using standardized HTTP protocols:

  • HTTP Status Code: The server returns 429 Too Many Requests.
  • Retry-After: A header indicating how many seconds the client must wait before retrying (e.g., Retry-After: 30).

Additionally, servers include metadata headers on successful responses to help clients manage their rate of requests:

  • X-RateLimit-Limit: The maximum number of allowed requests in the current window.
  • X-RateLimit-Remaining: The number of remaining requests the client can make in the current window.
  • X-RateLimit-Reset: The Unix epoch timestamp indicating when the current window resets.

System Performance Trade-offs

Choosing a rate limiting architecture requires evaluating trade-offs between memory overhead, CPU performance, and precision:

Algorithm Memory Cost CPU Complexity Spiky Traffic Support Implementation Complexity
Token Bucket Low (2 variables per key) O(1) (lazy updates) Yes (bursts up to B) Low
Leaky Bucket Medium (FIFO Queue space) O(1) No (forces smooth flow) Medium
Fixed Window Low (1 counter per key) O(1) Yes (allows double-dipping) Low
Sliding Log High (list of timestamps) O(N) (unbounded logs) Yes (highly precise) High
Sliding Counter Low (2 variables per key) O(1) Yes (weighted math) Medium
Comparison of Key Rate Limiting Algorithms Token Bucket Tokens fill at rate 'r' Max capacity limit 'B' Requests consume tokens Spiky Bursts: Allowed Idle Accumulation: Yes Memory: Low CPU: O(1) lazy math Leaky Bucket Requests join queue 'Q' Steady leak rate 'L' Overflow: Discarded Spiky Bursts: Throttled Idle Accumulation: No Memory: Medium CPU: O(1) queue shift Sliding Counter Current window count Preceding window count Weighted rolling sum Spiky Bursts: Controlled Idle Accumulation: No Memory: Low CPU: O(1) math only Token Bucket allows instant bursts; Leaky Bucket flattens traffic flow; Sliding Counter balances accuracy and memory cost.

Further Reading

Prerequisites

Code Examples

Core Literature References

Computer Networks

by Andrew S. Tanenbaum and David J. Wetherall — Chapter 5: The Network Layer - Traffic Shaping (Token and Leaky Buckets), pp. 394-400

View source