Circuit Breakers
Preventing cascading failures by detecting and isolating faulting downstream services before they take the whole system down.
The Cascading Failure Problem
Imagine Service A calls Service B on every incoming request. Service B starts timing out after a deployment issue. Each call from A waits for the full 30-second timeout before giving up. With 100 concurrent users, A now has 100 threads all blocked waiting on B. A's thread pool exhausts. A starts timing out too. Every service calling A now faces the same problem. The failure cascades up the entire call graph.
A circuit breaker is a safety switch that prevents this. When it detects B is failing, it stops sending requests to B immediately and fails fast, freeing A's threads to handle other work.
The Three States
- CLOSED: Normal operation. Every request passes through to the downstream service. The circuit breaker tracks failures in a sliding window.
- OPEN: The failure threshold was exceeded. The circuit is tripped. All requests fail immediately without touching the downstream service. This is the fast-fail behaviour that protects your thread pool.
- HALF-OPEN: After a configured timeout, the breaker allows a small number of probe requests through to test if the downstream has recovered. If they succeed, it resets to CLOSED. If they fail, it snaps back to OPEN.
Fallback Responses
A circuit breaker is most useful when paired with a fallback. When the circuit is OPEN, instead of propagating an error, return something useful:
- Serve stale cached data from a previous successful response
- Return a default/empty response (e.g. an empty product list instead of an error)
- Queue the operation for retry later via a message queue
- Return a user-friendly degraded UI response
This is the difference between "the recommendations section is empty" and "the entire page is broken".
Configuration Tradeoffs
- Threshold too low: the circuit trips on transient errors (a single slow second), causing unnecessary outages for healthy services
- Threshold too high: the circuit takes too long to trip, allowing a failing service to exhaust your thread pool before protection kicks in
- Timeout too short: the circuit returns to HALF-OPEN before the downstream has actually recovered, immediately re-trips, and you get rapid oscillation
- Sliding window vs consecutive count: consecutive failure count is simpler but sensitive to intermittent errors. A percentage-based sliding window (e.g. 50% of requests in the last 10 seconds fail) is more robust in production
Further Reading
- Martin Fowler on the Circuit Breaker pattern — the canonical description of the pattern
- Resilience4j documentation — the most popular JVM circuit breaker library
- AWS re:Post: implementing circuit breakers — practical guide for cloud environments
- Release It! by Michael Nygard — the book that popularised the circuit breaker pattern
Prerequisites
Code Examples
Continue learning
ACID & Isolation Levels
Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.
API Gateways
Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.
API Security & OAuth 2.0
Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.