Circuit Breakers

Preventing cascading failures by detecting and isolating faulting downstream services before they take the whole system down.

AdvancedReliabilityChapter: Reliability & Scalability14 min read

The Cascading Failure Problem

Imagine Service A calls Service B on every incoming request. Service B starts timing out after a deployment issue. Each call from A waits for the full 30-second timeout before giving up. With 100 concurrent users, A now has 100 threads all blocked waiting on B. A's thread pool exhausts. A starts timing out too. Every service calling A now faces the same problem. The failure cascades up the entire call graph.

A circuit breaker is a safety switch that prevents this. When it detects B is failing, it stops sending requests to B immediately and fails fast, freeing A's threads to handle other work.


The Three States

Circuit Breaker State Machine CLOSED Normal operation. Requests pass through. OPEN Tripped. All requests fail immediately. HALF-OPEN Probing: allows a few test requests through. failures exceed threshold timeout expires probe succeeds probe fails
  • CLOSED: Normal operation. Every request passes through to the downstream service. The circuit breaker tracks failures in a sliding window.
  • OPEN: The failure threshold was exceeded. The circuit is tripped. All requests fail immediately without touching the downstream service. This is the fast-fail behaviour that protects your thread pool.
  • HALF-OPEN: After a configured timeout, the breaker allows a small number of probe requests through to test if the downstream has recovered. If they succeed, it resets to CLOSED. If they fail, it snaps back to OPEN.

Fallback Responses

A circuit breaker is most useful when paired with a fallback. When the circuit is OPEN, instead of propagating an error, return something useful:

  • Serve stale cached data from a previous successful response
  • Return a default/empty response (e.g. an empty product list instead of an error)
  • Queue the operation for retry later via a message queue
  • Return a user-friendly degraded UI response

This is the difference between "the recommendations section is empty" and "the entire page is broken".


Configuration Tradeoffs

  • Threshold too low: the circuit trips on transient errors (a single slow second), causing unnecessary outages for healthy services
  • Threshold too high: the circuit takes too long to trip, allowing a failing service to exhaust your thread pool before protection kicks in
  • Timeout too short: the circuit returns to HALF-OPEN before the downstream has actually recovered, immediately re-trips, and you get rapid oscillation
  • Sliding window vs consecutive count: consecutive failure count is simpler but sensitive to intermittent errors. A percentage-based sliding window (e.g. 50% of requests in the last 10 seconds fail) is more robust in production

Further Reading

Prerequisites

Code Examples