Failure Scenarios in Production Systems

4 ways production systems fail at scale

Some of the nastiest outages aren’t total blackouts like us-east-1, they’re weird, emergent behaviors under load. Here are four you should know cold (plus how to tame them).

  1. Thundering herd (when things go down)

    • What happens: A service hiccups or dies. All traffic shifts to what’s left. The “survivors” get slammed and often fail too.

    How to fix:

    • Stagger risky events: slow rolling deploys, offset health checks, add TTL jitter per shard, support multi-leader failover.
    • Coalesce work: let one worker rebuild a missing result while others wait for it.
    • Serve stale briefly: return slightly old data (SWR/soft TTL) instead of hammering the source.
  2. Brownouts

    • What happens: The system is up but slow and flaky. Error rate and p95–p99 latency jump.

    How to fix:

    • Degrade gracefully: turn off nice-to-have features first.
    • Shed load by priority: fail fast on low-value requests.
    • Limit concurrency per downstream; apply backpressure instead of growing queues.
  3. Flapping

    • What happens: A service flips between healthy and unhealthy. Autoscaling and caches never settle.

    How to fix:

    • Use different “fail” vs “recover” thresholds.
    • Back off restarts; use circuit breakers with cooldowns.
    • Damp health signals in the load balancer; drain connections gracefully.
  4. Retry storms

    • What happens: Clients retry at the same time and DDoS your own backend. A tiny blip turns into a wave.

    How to fix:

    • Cap retries with exponential backoff + full jitter; send your clients “Retry-After” so they know when to call back
    • Use per-request budgets: retries must fit within the deadline; trip circuit breakers if not.
    • Use idempotency keys and request coalescing to avoid duplicate work.