Circuit Breakers

Preventing cascading failures by detecting and isolating faulting downstream services before they take the whole system down.

AdvancedReliabilityChapter: Reliability & Scalability14 min read

The Cascading Failure Problem

Imagine Service A calls Service B on every incoming request. Service B starts timing out after a deployment issue. Each call from A waits for the full 30-second timeout before giving up. With 100 concurrent users, A now has 100 threads all blocked waiting on B. A's thread pool exhausts. A starts timing out too. Every service calling A now faces the same problem. The failure cascades up the entire call graph.

A circuit breaker is a safety switch that prevents this. When it detects B is failing, it stops sending requests to B immediately and fails fast, freeing A's threads to handle other work.

The Three States

CLOSED: Normal operation. Every request passes through to the downstream service. The circuit breaker tracks failures in a sliding window.
OPEN: The failure threshold was exceeded. The circuit is tripped. All requests fail immediately without touching the downstream service. This is the fast-fail behaviour that protects your thread pool.
HALF-OPEN: After a configured timeout, the breaker allows a small number of probe requests through to test if the downstream has recovered. If they succeed, it resets to CLOSED. If they fail, it snaps back to OPEN.

Fallback Responses

A circuit breaker is most useful when paired with a fallback. When the circuit is OPEN, instead of propagating an error, return something useful:

Serve stale cached data from a previous successful response
Return a default/empty response (e.g. an empty product list instead of an error)
Queue the operation for retry later via a message queue
Return a user-friendly degraded UI response

This is the difference between "the recommendations section is empty" and "the entire page is broken".

Configuration Tradeoffs

Threshold too low: the circuit trips on transient errors (a single slow second), causing unnecessary outages for healthy services
Threshold too high: the circuit takes too long to trip, allowing a failing service to exhaust your thread pool before protection kicks in
Timeout too short: the circuit returns to HALF-OPEN before the downstream has actually recovered, immediately re-trips, and you get rapid oscillation
Sliding window vs consecutive count: consecutive failure count is simpler but sensitive to intermittent errors. A percentage-based sliding window (e.g. 50% of requests in the last 10 seconds fail) is more robust in production

Prerequisites

HTTP Idempotency

Code Examples

Continue learning

ACID & Isolation Levels

Deep dive into database transaction guarantees, isolation levels, concurrency anomalies like write skew, and control mechanisms such as MVCC, 2PL, and SSI.

API Gateways

Understand the API Gateway pattern as the central ingress point for microservices, handling routing, auth, rate limiting, and protocol translation.

API Security & OAuth 2.0

Understand API authentication and authorization mechanisms, JWT security, and the OAuth 2.0 framework including Authorization Code Flow with PKCE.