Failure Scenarios in Production Systems
4 ways production systems fail at scale
Some of the nastiest outages aren’t total blackouts like us-east-1, they’re weird, emergent behaviors under load. Here are four you should know cold (plus how to tame them).
-
Thundering herd (when things go down)
- What happens: A service hiccups or dies. All traffic shifts to what’s left. The “survivors” get slammed and often fail too.
How to fix:
- Stagger risky events: slow rolling deploys, offset health checks, add TTL jitter per shard, support multi-leader failover.
- Coalesce work: let one worker rebuild a missing result while others wait for it.
- Serve stale briefly: return slightly old data (SWR/soft TTL) instead of hammering the source.
-
Brownouts
- What happens: The system is up but slow and flaky. Error rate and p95–p99 latency jump.
How to fix:
- Degrade gracefully: turn off nice-to-have features first.
- Shed load by priority: fail fast on low-value requests.
- Limit concurrency per downstream; apply backpressure instead of growing queues.
-
Flapping
- What happens: A service flips between healthy and unhealthy. Autoscaling and caches never settle.
How to fix:
- Use different “fail” vs “recover” thresholds.
- Back off restarts; use circuit breakers with cooldowns.
- Damp health signals in the load balancer; drain connections gracefully.
-
Retry storms
- What happens: Clients retry at the same time and DDoS your own backend. A tiny blip turns into a wave.
How to fix:
- Cap retries with exponential backoff + full jitter; send your clients “Retry-After” so they know when to call back
- Use per-request budgets: retries must fit within the deadline; trip circuit breakers if not.
- Use idempotency keys and request coalescing to avoid duplicate work.