quality attributes
Availability: Designing for Uptime
Architectural tactics and patterns for building highly available systems that minimize downtime and ensure business continuity.
Availability: Designing for Uptime
TL;DR
Availability measures the proportion of time a system is operational and accessible. Design for availability by eliminating single points of failure, implementing redundancy, detecting failures quickly, and recovering automatically. Target availability should be driven by business impact, not technical pride.
Key Takeaways
- Availability = MTBF / (MTBF + MTTR): Maximize uptime, minimize recovery time
- Redundancy is the foundation: No single point of failure should bring down the system
- Detection matters: You can't fix what you can't see—monitoring is critical
- Recovery automation: Manual recovery extends downtime; automate where possible
- Cost increases exponentially: Each additional nine of availability costs significantly more
Why This Matters
Downtime has real business consequences: lost revenue, damaged reputation, regulatory penalties, and customer churn. A 2020 study found that the average cost of IT downtime is $5,600 per minute. However, designing for 99.999% availability when 99.9% is sufficient wastes resources. Understanding availability requirements and the tactics to achieve them enables informed trade-off decisions.
The Nines
Availability is often expressed in "nines": 99% = "two nines" (87.6 hours downtime/year), 99.99% = "four nines" (52.6 minutes downtime/year). Each additional nine is roughly 10x harder and more expensive to achieve.
Availability Fundamentals
Calculating Availability
AVAILABILITY FORMULA
MTBF
Availability = ─────────────────────
MTBF + MTTR
MTBF = Mean Time Between Failures
MTTR = Mean Time To Repair/Recover
IMPROVING AVAILABILITY
├── Increase MTBF: Prevent failures (harder)
└── Decrease MTTR: Recover faster (often easier)
The Nines Table
| Availability | Annual Downtime | Monthly Downtime | Weekly Downtime |
|---|---|---|---|
| 99% (two 9s) | 3.65 days | 7.31 hours | 1.68 hours |
| 99.9% (three 9s) | 8.77 hours | 43.83 minutes | 10.08 minutes |
| 99.95% | 4.38 hours | 21.92 minutes | 5.04 minutes |
| 99.99% (four 9s) | 52.6 minutes | 4.38 minutes | 1.01 minutes |
| 99.999% (five 9s) | 5.26 minutes | 26.3 seconds | 6.05 seconds |
Series vs Parallel Availability
SERIES (All must work)
A ─── B ─── C
Availability = A × B × C
Example: 0.99 × 0.99 × 0.99 = 0.97 (97%)
PARALLEL (Any one works)
┌── A ──┐
────┤ ├────
└── A' ─┘
Availability = 1 - (1-A) × (1-A')
Example: 1 - (0.01 × 0.01) = 0.9999 (99.99%)
PRACTICAL IMPLICATION
Adding components in series decreases availability
Adding redundancy in parallel increases availability
Availability Tactics
Goal
Reduce the frequency of failures before they occur.
Tactics
| Tactic | Description | Implementation |
|---|---|---|
| Capacity Planning | Ensure sufficient resources | Load testing, auto-scaling |
| Exception Prevention | Eliminate common error sources | Input validation, defensive coding |
| Resource Pooling | Prevent resource exhaustion | Connection pools, thread pools |
| Health Checks | Remove unhealthy instances | Liveness/readiness probes |
Example: Resource Exhaustion Prevention
CONNECTION POOL CONFIGURATION
Without Pooling:
Request → New Connection → Database → Close Connection
Problem: Connection creation is expensive, can exhaust DB connections
With Pooling:
┌─────────────────────────────────┐
│ Connection Pool │
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│ │ C │ │ C │ │ C │ │ C │ │ C │ │
│ └───┘ └───┘ └───┘ └───┘ └───┘ │
└─────────────────────────────────┘
│
Request → Borrow → Use → Return
Benefits: Reuse connections, limit concurrent access
Pool Sizing
Set pool size based on: (connections needed per request) × (peak concurrent requests) × (safety factor). Monitor for pool exhaustion.
Availability by Architecture
Cloud-Native Availability
MULTI-AZ DEPLOYMENT
Region: us-east-1
├── AZ-a
│ ├── App Instance 1
│ └── DB Primary
├── AZ-b
│ ├── App Instance 2
│ └── DB Replica
└── AZ-c
├── App Instance 3
└── DB Replica
Single AZ failure: No impact (other AZs serve traffic)
Typical availability: 99.99%
Multi-Region Availability
MULTI-REGION ACTIVE-ACTIVE
┌──────────────────────────────────────────────────┐
│ Global Load Balancer │
│ (Route 53, CloudFlare) │
└─────────────────────┬────────────────────────────┘
│
┌─────────────┴─────────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ US-EAST-1 │ │ EU-WEST-1 │
│ ┌─────────┐ │ │ ┌─────────┐ │
│ │ App │ │ │ │ App │ │
│ └────┬────┘ │ │ └────┬────┘ │
│ │ │ │ │ │
│ ┌────▼────┐ │ sync │ ┌────▼────┐ │
│ │ DB │◄─┼───────────┼─►│ DB │ │
│ └─────────┘ │ │ └─────────┘ │
└───────────────┘ └───────────────┘
Complexity: High (data consistency challenges)
Typical availability: 99.999%
Measuring Availability
SLIs, SLOs, and SLAs
SERVICE LEVEL HIERARCHY
SLI (Indicator): What you measure
├── Uptime percentage
├── Request success rate
├── Latency percentiles
└── Error rate
SLO (Objective): Internal target
├── "99.9% of requests succeed"
├── "p99 latency < 200ms"
└── Buffer above SLA for safety
SLA (Agreement): External commitment
├── Legal/contractual obligation
├── Penalties for missing
└── Conservative (achievable) targets
Error Budget
ERROR BUDGET MODEL
If SLO = 99.9% availability per month:
Error budget = 0.1% = 43.83 minutes downtime allowed
Error Budget Remaining = Budget - Actual Downtime
┌────────────────────────────────────────────────┐
│ Month Progress: ████████████░░░░░░░░ 60% │
│ Error Budget: ████████░░░░░░░░░░░░ 40% used │
│ Status: ✓ On track │
└────────────────────────────────────────────────┘
Budget exhausted → Freeze deployments, focus on reliability
Budget available → Deploy new features
Quick Reference Card
┌─────────────────────────────────────────────────────────────┐
│ AVAILABILITY CHEAT SHEET │
├─────────────────────────────────────────────────────────────┤
│ │
│ FORMULA: Availability = MTBF / (MTBF + MTTR) │
│ │
│ THE NINES │
│ ───────────────────────────────────────────────────────── │
│ 99% = 3.65 days/year │ 99.99% = 52.6 min/year │
│ 99.9% = 8.77 hours/year │ 99.999% = 5.26 min/year │
│ │
│ TACTICS │
│ ───────────────────────────────────────────────────────── │
│ PREVENT → Capacity planning, input validation │
│ DETECT → Health checks, monitoring, alerting │
│ RECOVER → Redundancy, failover, retry │
│ DEGRADE → Circuit breaker, fallback, load shed │
│ │
│ REDUNDANCY │
│ ───────────────────────────────────────────────────────── │
│ Series: A × B × C (multiply) │
│ Parallel: 1 - (1-A)(1-A') (complement) │
│ │
│ QUICK WINS │
│ ───────────────────────────────────────────────────────── │
│ 1. Eliminate single points of failure │
│ 2. Add health checks and monitoring │
│ 3. Implement circuit breakers │
│ 4. Automate failover │
│ 5. Define and track error budgets │
│ │
└─────────────────────────────────────────────────────────────┘
Related Topics
- Quality Attributes Overview - All quality attributes
- Performance - Performance and availability trade-offs
- Cloud Architecture - Cloud availability patterns
Sources
- Site Reliability Engineering - Google SRE Book
- Release It! - Michael Nygard
- AWS Reliability Pillar
- Azure Reliability Patterns