Skip to main content

Command Palette

Search for a command to run...

quality attributes

Availability: Designing for Uptime

Architectural tactics and patterns for building highly available systems that minimize downtime and ensure business continuity.

Availability: Designing for Uptime

TL;DR

Availability measures the proportion of time a system is operational and accessible. Design for availability by eliminating single points of failure, implementing redundancy, detecting failures quickly, and recovering automatically. Target availability should be driven by business impact, not technical pride.

Key Takeaways

  • Availability = MTBF / (MTBF + MTTR): Maximize uptime, minimize recovery time
  • Redundancy is the foundation: No single point of failure should bring down the system
  • Detection matters: You can't fix what you can't see—monitoring is critical
  • Recovery automation: Manual recovery extends downtime; automate where possible
  • Cost increases exponentially: Each additional nine of availability costs significantly more

Why This Matters

Downtime has real business consequences: lost revenue, damaged reputation, regulatory penalties, and customer churn. A 2020 study found that the average cost of IT downtime is $5,600 per minute. However, designing for 99.999% availability when 99.9% is sufficient wastes resources. Understanding availability requirements and the tactics to achieve them enables informed trade-off decisions.

The Nines

Availability is often expressed in "nines": 99% = "two nines" (87.6 hours downtime/year), 99.99% = "four nines" (52.6 minutes downtime/year). Each additional nine is roughly 10x harder and more expensive to achieve.


Availability Fundamentals

Calculating Availability

AVAILABILITY FORMULA

                        MTBF
Availability = ─────────────────────
               MTBF + MTTR

MTBF = Mean Time Between Failures
MTTR = Mean Time To Repair/Recover

IMPROVING AVAILABILITY
├── Increase MTBF: Prevent failures (harder)
└── Decrease MTTR: Recover faster (often easier)

The Nines Table

AvailabilityAnnual DowntimeMonthly DowntimeWeekly Downtime
99% (two 9s)3.65 days7.31 hours1.68 hours
99.9% (three 9s)8.77 hours43.83 minutes10.08 minutes
99.95%4.38 hours21.92 minutes5.04 minutes
99.99% (four 9s)52.6 minutes4.38 minutes1.01 minutes
99.999% (five 9s)5.26 minutes26.3 seconds6.05 seconds

Series vs Parallel Availability

SERIES (All must work)
A ─── B ─── C

Availability = A × B × C
Example: 0.99 × 0.99 × 0.99 = 0.97 (97%)

PARALLEL (Any one works)
    ┌── A ──┐
────┤       ├────
    └── A' ─┘

Availability = 1 - (1-A) × (1-A')
Example: 1 - (0.01 × 0.01) = 0.9999 (99.99%)

PRACTICAL IMPLICATION
Adding components in series decreases availability
Adding redundancy in parallel increases availability

Availability Tactics

Goal

Reduce the frequency of failures before they occur.

Tactics

TacticDescriptionImplementation
Capacity PlanningEnsure sufficient resourcesLoad testing, auto-scaling
Exception PreventionEliminate common error sourcesInput validation, defensive coding
Resource PoolingPrevent resource exhaustionConnection pools, thread pools
Health ChecksRemove unhealthy instancesLiveness/readiness probes

Example: Resource Exhaustion Prevention

CONNECTION POOL CONFIGURATION

Without Pooling:
Request → New Connection → Database → Close Connection
Problem: Connection creation is expensive, can exhaust DB connections

With Pooling:
┌─────────────────────────────────┐
│        Connection Pool          │
│  ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│  │ C │ │ C │ │ C │ │ C │ │ C │ │
│  └───┘ └───┘ └───┘ └───┘ └───┘ │
└─────────────────────────────────┘
         │
Request → Borrow → Use → Return
Benefits: Reuse connections, limit concurrent access

Pool Sizing

Set pool size based on: (connections needed per request) × (peak concurrent requests) × (safety factor). Monitor for pool exhaustion.


Availability by Architecture

Cloud-Native Availability

MULTI-AZ DEPLOYMENT

Region: us-east-1
├── AZ-a
│   ├── App Instance 1
│   └── DB Primary
├── AZ-b
│   ├── App Instance 2
│   └── DB Replica
└── AZ-c
    ├── App Instance 3
    └── DB Replica

Single AZ failure: No impact (other AZs serve traffic)
Typical availability: 99.99%

Multi-Region Availability

MULTI-REGION ACTIVE-ACTIVE

┌──────────────────────────────────────────────────┐
│                Global Load Balancer               │
│               (Route 53, CloudFlare)             │
└─────────────────────┬────────────────────────────┘
                      │
        ┌─────────────┴─────────────┐
        ▼                           ▼
┌───────────────┐           ┌───────────────┐
│   US-EAST-1   │           │   EU-WEST-1   │
│  ┌─────────┐  │           │  ┌─────────┐  │
│  │   App   │  │           │  │   App   │  │
│  └────┬────┘  │           │  └────┬────┘  │
│       │       │           │       │       │
│  ┌────▼────┐  │  sync     │  ┌────▼────┐  │
│  │   DB    │◄─┼───────────┼─►│   DB    │  │
│  └─────────┘  │           │  └─────────┘  │
└───────────────┘           └───────────────┘

Complexity: High (data consistency challenges)
Typical availability: 99.999%

Measuring Availability

SLIs, SLOs, and SLAs

SERVICE LEVEL HIERARCHY

SLI (Indicator): What you measure
├── Uptime percentage
├── Request success rate
├── Latency percentiles
└── Error rate

SLO (Objective): Internal target
├── "99.9% of requests succeed"
├── "p99 latency < 200ms"
└── Buffer above SLA for safety

SLA (Agreement): External commitment
├── Legal/contractual obligation
├── Penalties for missing
└── Conservative (achievable) targets

Error Budget

ERROR BUDGET MODEL

If SLO = 99.9% availability per month:
Error budget = 0.1% = 43.83 minutes downtime allowed

Error Budget Remaining = Budget - Actual Downtime

┌────────────────────────────────────────────────┐
│ Month Progress: ████████████░░░░░░░░ 60%       │
│ Error Budget:   ████████░░░░░░░░░░░░ 40% used  │
│ Status: ✓ On track                             │
└────────────────────────────────────────────────┘

Budget exhausted → Freeze deployments, focus on reliability
Budget available → Deploy new features

Quick Reference Card

┌─────────────────────────────────────────────────────────────┐
│               AVAILABILITY CHEAT SHEET                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  FORMULA: Availability = MTBF / (MTBF + MTTR)               │
│                                                             │
│  THE NINES                                                  │
│  ─────────────────────────────────────────────────────────  │
│  99%    = 3.65 days/year    │  99.99%  = 52.6 min/year     │
│  99.9%  = 8.77 hours/year   │  99.999% = 5.26 min/year     │
│                                                             │
│  TACTICS                                                    │
│  ─────────────────────────────────────────────────────────  │
│  PREVENT    → Capacity planning, input validation           │
│  DETECT     → Health checks, monitoring, alerting           │
│  RECOVER    → Redundancy, failover, retry                   │
│  DEGRADE    → Circuit breaker, fallback, load shed          │
│                                                             │
│  REDUNDANCY                                                 │
│  ─────────────────────────────────────────────────────────  │
│  Series:   A × B × C         (multiply)                     │
│  Parallel: 1 - (1-A)(1-A')   (complement)                   │
│                                                             │
│  QUICK WINS                                                 │
│  ─────────────────────────────────────────────────────────  │
│  1. Eliminate single points of failure                      │
│  2. Add health checks and monitoring                        │
│  3. Implement circuit breakers                              │
│  4. Automate failover                                       │
│  5. Define and track error budgets                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘


Sources