quality attributes

Availability: Designing for Uptime

Architectural tactics and patterns for building highly available systems that minimize downtime and ensure business continuity.

Availability: Designing for Uptime

TL;DR

Availability measures the proportion of time a system is operational and accessible. Design for availability by eliminating single points of failure, implementing redundancy, detecting failures quickly, and recovering automatically. Target availability should be driven by business impact, not technical pride.

Key Takeaways

Availability = MTBF / (MTBF + MTTR): Maximize uptime, minimize recovery time
Redundancy is the foundation: No single point of failure should bring down the system
Detection matters: You can't fix what you can't see—monitoring is critical
Recovery automation: Manual recovery extends downtime; automate where possible
Cost increases exponentially: Each additional nine of availability costs significantly more

Why This Matters

Downtime has real business consequences: lost revenue, damaged reputation, regulatory penalties, and customer churn. A 2020 study found that the average cost of IT downtime is $5,600 per minute. However, designing for 99.999% availability when 99.9% is sufficient wastes resources. Understanding availability requirements and the tactics to achieve them enables informed trade-off decisions.

The Nines

Availability is often expressed in "nines": 99% = "two nines" (87.6 hours downtime/year), 99.99% = "four nines" (52.6 minutes downtime/year). Each additional nine is roughly 10x harder and more expensive to achieve.

Availability Fundamentals

Calculating Availability

AVAILABILITY FORMULA

                        MTBF
Availability = ─────────────────────
               MTBF + MTTR

MTBF = Mean Time Between Failures
MTTR = Mean Time To Repair/Recover

IMPROVING AVAILABILITY
├── Increase MTBF: Prevent failures (harder)
└── Decrease MTTR: Recover faster (often easier)

The Nines Table

Availability	Annual Downtime	Monthly Downtime	Weekly Downtime
99% (two 9s)	3.65 days	7.31 hours	1.68 hours
99.9% (three 9s)	8.77 hours	43.83 minutes	10.08 minutes
99.95%	4.38 hours	21.92 minutes	5.04 minutes
99.99% (four 9s)	52.6 minutes	4.38 minutes	1.01 minutes
99.999% (five 9s)	5.26 minutes	26.3 seconds	6.05 seconds

Series vs Parallel Availability

SERIES (All must work)
A ─── B ─── C

Availability = A × B × C
Example: 0.99 × 0.99 × 0.99 = 0.97 (97%)

PARALLEL (Any one works)
    ┌── A ──┐
────┤       ├────
    └── A' ─┘

Availability = 1 - (1-A) × (1-A')
Example: 1 - (0.01 × 0.01) = 0.9999 (99.99%)

PRACTICAL IMPLICATION
Adding components in series decreases availability
Adding redundancy in parallel increases availability

Tactic	Description	Implementation
Capacity Planning	Ensure sufficient resources	Load testing, auto-scaling
Exception Prevention	Eliminate common error sources	Input validation, defensive coding
Resource Pooling	Prevent resource exhaustion	Connection pools, thread pools
Health Checks	Remove unhealthy instances	Liveness/readiness probes

Example: Resource Exhaustion Prevention

CONNECTION POOL CONFIGURATION

Without Pooling:
Request → New Connection → Database → Close Connection
Problem: Connection creation is expensive, can exhaust DB connections

With Pooling:
┌─────────────────────────────────┐
│        Connection Pool          │
│  ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│  │ C │ │ C │ │ C │ │ C │ │ C │ │
│  └───┘ └───┘ └───┘ └───┘ └───┘ │
└─────────────────────────────────┘
         │
Request → Borrow → Use → Return
Benefits: Reuse connections, limit concurrent access

Pool Sizing

Set pool size based on: (connections needed per request) × (peak concurrent requests) × (safety factor). Monitor for pool exhaustion.

Availability by Architecture

Cloud-Native Availability

MULTI-AZ DEPLOYMENT

Region: us-east-1
├── AZ-a
│   ├── App Instance 1
│   └── DB Primary
├── AZ-b
│   ├── App Instance 2
│   └── DB Replica
└── AZ-c
    ├── App Instance 3
    └── DB Replica

Single AZ failure: No impact (other AZs serve traffic)
Typical availability: 99.99%

Multi-Region Availability

MULTI-REGION ACTIVE-ACTIVE

┌──────────────────────────────────────────────────┐
│                Global Load Balancer               │
│               (Route 53, CloudFlare)             │
└─────────────────────┬────────────────────────────┘
                      │
        ┌─────────────┴─────────────┐
        ▼                           ▼
┌───────────────┐           ┌───────────────┐
│   US-EAST-1   │           │   EU-WEST-1   │
│  ┌─────────┐  │           │  ┌─────────┐  │
│  │   App   │  │           │  │   App   │  │
│  └────┬────┘  │           │  └────┬────┘  │
│       │       │           │       │       │
│  ┌────▼────┐  │  sync     │  ┌────▼────┐  │
│  │   DB    │◄─┼───────────┼─►│   DB    │  │
│  └─────────┘  │           │  └─────────┘  │
└───────────────┘           └───────────────┘

Complexity: High (data consistency challenges)
Typical availability: 99.999%

Measuring Availability

SLIs, SLOs, and SLAs

SERVICE LEVEL HIERARCHY

SLI (Indicator): What you measure
├── Uptime percentage
├── Request success rate
├── Latency percentiles
└── Error rate

SLO (Objective): Internal target
├── "99.9% of requests succeed"
├── "p99 latency < 200ms"
└── Buffer above SLA for safety

SLA (Agreement): External commitment
├── Legal/contractual obligation
├── Penalties for missing
└── Conservative (achievable) targets

Error Budget

ERROR BUDGET MODEL

If SLO = 99.9% availability per month:
Error budget = 0.1% = 43.83 minutes downtime allowed

Error Budget Remaining = Budget - Actual Downtime

┌────────────────────────────────────────────────┐
│ Month Progress: ████████████░░░░░░░░ 60%       │
│ Error Budget:   ████████░░░░░░░░░░░░ 40% used  │
│ Status: ✓ On track                             │
└────────────────────────────────────────────────┘

Budget exhausted → Freeze deployments, focus on reliability
Budget available → Deploy new features

Quick Reference Card

┌─────────────────────────────────────────────────────────────┐
│               AVAILABILITY CHEAT SHEET                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  FORMULA: Availability = MTBF / (MTBF + MTTR)               │
│                                                             │
│  THE NINES                                                  │
│  ─────────────────────────────────────────────────────────  │
│  99%    = 3.65 days/year    │  99.99%  = 52.6 min/year     │
│  99.9%  = 8.77 hours/year   │  99.999% = 5.26 min/year     │
│                                                             │
│  TACTICS                                                    │
│  ─────────────────────────────────────────────────────────  │
│  PREVENT    → Capacity planning, input validation           │
│  DETECT     → Health checks, monitoring, alerting           │
│  RECOVER    → Redundancy, failover, retry                   │
│  DEGRADE    → Circuit breaker, fallback, load shed          │
│                                                             │
│  REDUNDANCY                                                 │
│  ─────────────────────────────────────────────────────────  │
│  Series:   A × B × C         (multiply)                     │
│  Parallel: 1 - (1-A)(1-A')   (complement)                   │
│                                                             │
│  QUICK WINS                                                 │
│  ─────────────────────────────────────────────────────────  │
│  1. Eliminate single points of failure                      │
│  2. Add health checks and monitoring                        │
│  3. Implement circuit breakers                              │
│  4. Automate failover                                       │
│  5. Define and track error budgets                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quality Attributes Overview - All quality attributes
Performance - Performance and availability trade-offs
Cloud Architecture - Cloud availability patterns

Sources

Site Reliability Engineering - Google SRE Book
Release It! - Michael Nygard
AWS Reliability Pillar
Azure Reliability Patterns

Availability: Designing for Uptime

Availability: Designing for Uptime

TL;DR

Key Takeaways

Why This Matters

Availability Fundamentals

Calculating Availability

The Nines Table

Series vs Parallel Availability

Availability Tactics

Goal

Tactics

Example: Resource Exhaustion Prevention

Availability by Architecture

Cloud-Native Availability

Multi-Region Availability

Measuring Availability

SLIs, SLOs, and SLAs

Error Budget

Quick Reference Card

Sources

Related Topics

Quality Attributes: Design for Non-Functional Requirements

Performance: Designing for Speed

Security: Designing for Protection