cloud architecture

Cloud Architecture Patterns

Essential patterns for designing scalable, resilient, and cost-effective cloud-native applications.

Cloud Architecture Patterns

TL;DR

Cloud architecture patterns solve recurring challenges in distributed, cloud-native systems: scalability, resilience, data management, and messaging. These patterns leverage cloud capabilities (elasticity, managed services, global infrastructure) while addressing cloud-specific challenges (network latency, eventual consistency, cost optimization).

Key Takeaways

Design for failure: Assume everything fails; design to recover automatically
Prefer managed services: Reduce operational burden, leverage provider expertise
Embrace elasticity: Scale out rather than up; pay for what you use
Decouple components: Use async messaging to reduce dependencies
Optimize for cost: Cloud flexibility enables—and requires—cost awareness

Why This Matters

Cloud computing fundamentally changes how we architect systems. Infinite (apparent) resources, pay-per-use pricing, and managed services enable patterns impossible in traditional infrastructure. But cloud also introduces new challenges: network partitions, eventual consistency, and the complexity of distributed systems. Understanding cloud patterns helps you leverage benefits while avoiding pitfalls.

Cloud-Native

Cloud-native doesn't mean "runs in the cloud." It means designed to exploit cloud characteristics: elasticity, automation, managed services, and geographic distribution.

Pattern Categories

CLOUD ARCHITECTURE PATTERNS
├── COMPUTE PATTERNS
│   ├── Serverless
│   ├── Containers
│   └── Auto-scaling
│
├── DATA PATTERNS
│   ├── Event Sourcing
│   ├── CQRS
│   └── Polyglot Persistence
│
├── MESSAGING PATTERNS
│   ├── Queue-Based Load Leveling
│   ├── Publisher-Subscriber
│   └── Event-Driven
│
├── RESILIENCE PATTERNS
│   ├── Retry
│   ├── Circuit Breaker
│   └── Bulkhead
│
└── DEPLOYMENT PATTERNS
    ├── Blue-Green
    ├── Canary
    └── Feature Flags

Compute Patterns

What It Is

Execute code without managing servers. Cloud provider handles infrastructure, scaling, and availability.

When to Use

Good Fit	Poor Fit
Event-driven workloads	Long-running processes
Variable/unpredictable traffic	Consistent high throughput
Rapid prototyping	Complex stateful workflows
Scheduled tasks	Low-latency requirements (cold start)

Architecture Pattern

SERVERLESS EVENT-DRIVEN

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│ API GW  │───▶│ Lambda  │───▶│ DynamoDB│    │   S3    │
└─────────┘    └─────────┘    └─────────┘    └─────────┘
                    │                             │
                    └─────────────────────────────┘
                           Trigger on upload

EVENT SOURCES
├── HTTP (API Gateway)
├── Queue (SQS)
├── Stream (Kinesis, DynamoDB Streams)
├── Schedule (CloudWatch Events)
├── Storage (S3 events)
└── Database (change streams)

Cost Considerations

SERVERLESS COST MODEL

CHARGED FOR:
├── Number of invocations
├── Duration (GB-seconds)
└── Memory allocated

NOT CHARGED FOR:
├── Idle time
├── Infrastructure management
└── Scaling infrastructure

OPTIMIZE BY:
├── Right-size memory allocation
├── Minimize cold starts (provisioned concurrency)
├── Optimize code execution time
└── Use efficient runtimes

Cold Starts

Functions not recently invoked require initialization (cold start), adding latency. For latency-sensitive applications, consider provisioned concurrency or keep-warm strategies.

DATA STORE SELECTION

RELATIONAL (PostgreSQL, MySQL)
├── Structured data with relationships
├── ACID transactions required
├── Complex queries with joins
└── Strong consistency needed

DOCUMENT (MongoDB, DynamoDB)
├── Semi-structured data
├── Flexible schema
├── Hierarchical data
└── Scale-out requirements

KEY-VALUE (Redis, DynamoDB)
├── Simple lookup by key
├── Session storage
├── Caching
└── High throughput, low latency

GRAPH (Neo4j, Neptune)
├── Highly connected data
├── Relationship traversal
├── Recommendations, fraud detection
└── Social networks

TIME-SERIES (InfluxDB, Timescale)
├── Timestamped data
├── Metrics and monitoring
├── IoT sensor data
└── Time-based aggregations

SEARCH (Elasticsearch, OpenSearch)
├── Full-text search
├── Log analytics
├── Faceted search
└── Real-time indexing

Example Architecture

POLYGLOT PERSISTENCE EXAMPLE (E-commerce)

┌─────────────────────────────────────────────────────────────┐
│                     Application Layer                        │
└─────────────────────────────────────────────────────────────┘
         │              │              │              │
         ▼              ▼              ▼              ▼
    ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
    │PostgreSQL│   │ MongoDB │   │  Redis  │   │Elastic- │
    │         │   │         │   │         │   │search   │
    │ Orders  │   │ Product │   │ Sessions│   │ Search  │
    │ Users   │   │ Catalog │   │ Cart    │   │ Index   │
    │ Payments│   │         │   │ Cache   │   │         │
    └─────────┘   └─────────┘   └─────────┘   └─────────┘

Each store optimized for its access patterns

WITHOUT QUEUE (Direct Coupling)
Producer ──────────────────▶ Consumer
         High load = overwhelmed

WITH QUEUE (Load Leveling)
Producer ──▶ [  Queue  ] ──▶ Consumer
              Buffer absorbs spikes
              Consumer processes at own pace

BEHAVIOR DURING SPIKE
       Incoming  │ ████████████████████
         Load    │ ████████████
                 │ █████████
                 │ ██████████████
                 └─────────────────────▶ Time

       Queue     │     ████
       Depth     │  █████████
                 │ ██████████████
                 │ █████████████████
                 └─────────────────────▶ Time
                 (Queue absorbs, drains gradually)

Implementation

AWS SQS EXAMPLE

Producer (Lambda/EC2):
sqs.sendMessage({
  QueueUrl: 'https://sqs.../my-queue',
  MessageBody: JSON.stringify(event),
  MessageGroupId: 'orders' // for FIFO
});

Consumer (Lambda trigger or polling):
exports.handler = async (event) => {
  for (const record of event.Records) {
    const message = JSON.parse(record.body);
    await processMessage(message);
  }
};

CONFIGURATION
├── Visibility timeout: Time to process before retry
├── Message retention: How long unprocessed messages kept
├── Dead letter queue: Where failed messages go
└── Batch size: Messages processed per invocation

EXPONENTIAL BACKOFF

Attempt 1: Immediate
Attempt 2: Wait 1 second
Attempt 3: Wait 2 seconds
Attempt 4: Wait 4 seconds
Attempt 5: Wait 8 seconds
         → Give up, return error

WITH JITTER (Recommended)
delay = base * 2^attempt + random(0, base)

Jitter prevents thundering herd when many clients
retry simultaneously after an outage.

Configuration

// AWS SDK v3 default retry configuration
const client = new S3Client({
  maxAttempts: 3,
  retryStrategy: new StandardRetryStrategy(async () => 3, {
    retryDecider: (error) => {
      // Retry on throttling, transient errors
      return error.$retryable?.throttling ||
             error.$fault === 'server';
    },
    delayDecider: (delayBase, attempts) => {
      return delayBase * Math.pow(2, attempts - 1);
    }
  })
});

BLUE-GREEN DEPLOYMENT

BEFORE DEPLOYMENT
                    ┌─────────────────────┐
     Traffic ──────▶│   Blue (v1.0)       │ ◀── Active
                    └─────────────────────┘
                    ┌─────────────────────┐
                    │   Green (idle)      │ ◀── Inactive
                    └─────────────────────┘

DEPLOY TO GREEN
                    ┌─────────────────────┐
     Traffic ──────▶│   Blue (v1.0)       │ ◀── Active
                    └─────────────────────┘
                    ┌─────────────────────┐
        Deploy ────▶│   Green (v1.1)      │ ◀── Deploy here
                    └─────────────────────┘

SWITCH TRAFFIC
                    ┌─────────────────────┐
                    │   Blue (v1.0)       │ ◀── Standby
                    └─────────────────────┘
                    ┌─────────────────────┐
     Traffic ──────▶│   Green (v1.1)      │ ◀── Active
                    └─────────────────────┘

ROLLBACK = Switch traffic back to Blue

Pros and Cons

Pros	Cons
Instant rollback	Double infrastructure cost
Zero downtime	Database migrations complex
Full testing before switch	Session management needed
Clean cutover	Configuration must be in sync

Quick Reference Card

┌─────────────────────────────────────────────────────────────┐
│            CLOUD ARCHITECTURE CHEAT SHEET                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  CLOUD-NATIVE PRINCIPLES                                    │
│  ─────────────────────────────────────────────────────────  │
│  • Design for failure (everything fails eventually)         │
│  • Prefer managed services (reduce operational burden)      │
│  • Embrace elasticity (scale out, not up)                   │
│  • Decouple with messaging (async over sync)                │
│  • Automate everything (infrastructure as code)             │
│                                                             │
│  COMPUTE SELECTION                                          │
│  ─────────────────────────────────────────────────────────  │
│  Serverless → Event-driven, variable load, < 15 min         │
│  Containers → Full control, long-running, predictable       │
│  VMs        → Legacy apps, specific OS requirements         │
│                                                             │
│  DATA STORE SELECTION                                       │
│  ─────────────────────────────────────────────────────────  │
│  Relational  → ACID, complex queries, relationships         │
│  Document    → Flexible schema, hierarchical data           │
│  Key-Value   → Simple lookup, caching, sessions             │
│  Graph       → Highly connected data, traversals            │
│                                                             │
│  RESILIENCE PATTERNS                                        │
│  ─────────────────────────────────────────────────────────  │
│  Retry       → Transient failures, exponential backoff      │
│  Circuit     → Fail fast, prevent cascade                   │
│  Bulkhead    → Isolate failures, limit blast radius         │
│  Timeout     → Don't wait forever, fail gracefully          │
│                                                             │
│  DEPLOYMENT PATTERNS                                        │
│  ─────────────────────────────────────────────────────────  │
│  Blue-Green  → Instant rollback, double infrastructure      │
│  Canary      → Gradual rollout, metric-based progression    │
│  Feature Flag→ Decouple deploy from release                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

AWS Well-Architected - AWS best practices
Azure Well-Architected - Azure best practices
Twelve-Factor App - Cloud-native principles
Microservices Patterns - Service patterns

Cloud Architecture Patterns

Cloud Architecture Patterns

TL;DR

Key Takeaways

Why This Matters

Pattern Categories

Compute Patterns

What It Is

When to Use

Architecture Pattern

Cost Considerations

Data Patterns

What It Is

Database Selection Guide

Example Architecture

Messaging Patterns

What It Is

Pattern Structure

Implementation

Resilience Patterns

What It Is

Implementation

Configuration

Deployment Patterns

What It Is

Process

Pros and Cons

Quick Reference Card

Sources