Scalable System Design

Core Concepts for Building at Scale


Distributed Systems   High Availability   Performance


From 10 users to 10 million — the engineering decisions that matter.

Agenda

  1. Scalability Fundamentals — Vertical vs. Horizontal
  2. Load Balancing — Distributing Traffic
  3. Caching Strategies — Speed at Every Layer
  4. Database Scaling — Sharding & Replication
  5. CAP Theorem — Tradeoffs in Distributed Systems
  6. Message Queues — Async & Decoupling
  7. Rate Limiting & Throttling
  8. CDNs & Edge Caching
  9. Microservices vs. Monolith
  10. Observability — Metrics, Logs, Traces

Scalability Fundamentals

The Two Axes of Scale

StrategyDescriptionBest For
Vertical (Scale Up)Bigger machine, more CPU/RAMSimpler ops, stateful services
Horizontal (Scale Out)More machines, same sizeStateless services, web tiers

Key Definitions

  • Throughput — Requests processed per second (RPS)
  • Latency — Time to handle a single request (p50, p99)
  • Availabilityuptime / total time — e.g., 99.99% = ~52 min/year downtime
  • Elasticity — Ability to auto-scale in response to load

A system is scalable if adding resources increases capacity proportionally.

Load Balancing

Why It Matters

A load balancer distributes incoming traffic across multiple servers to prevent any single node from becoming a bottleneck.


Algorithms

AlgorithmHow It WorksUse Case
Round RobinCycles through servers in orderEqual-capacity servers
Least ConnectionsRoutes to server with fewest active connectionsVarying request durations
IP HashHashes client IP → sticky sessionsSession-sensitive apps
Weighted RRRoutes proportional to server weightMixed server capacities

Layers

  • L4 (Transport) — TCP/UDP level, fast, no content inspection
  • L7 (Application) — HTTP-aware, supports routing by path/header (e.g., NGINX, HAProxy, AWS ALB)

Load Balancing — Architecture

                    ┌──────────────────┐
  Clients  ──────►  │   Load Balancer  │  ◄── Health Checks
                    └────────┬─────────┘
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
        ┌──────────┐  ┌──────────┐  ┌──────────┐
        │ Server 1 │  │ Server 2 │  │ Server 3 │
        └──────────┘  └──────────┘  └──────────┘
              │              │              │
              └──────────────┼──────────────┘
                             ▼
                    ┌─────────────────┐
                    │   Shared Cache  │
                    │   + Database    │
                    └─────────────────┘

Active-Passive failover vs. Active-Active — both are valid HA strategies.

Caching Strategies

The Cache Hierarchy

Browser Cache  →  CDN  →  API Gateway Cache  →  App Cache  →  DB Cache
  (client)       (edge)     (L7 proxy)         (Redis)       (query)

Write Strategies

PatternBehaviorTradeoff
Write-ThroughWrite to cache and DB simultaneouslyConsistent, higher write latency
Write-Behind (Async)Write to cache first, DB asynchronouslyFaster writes, risk of data loss
Write-AroundSkip cache, write directly to DBAvoids cache pollution

Read Strategies

  • Cache-Aside (Lazy Loading) — App checks cache first, fills on miss (most common)
  • Read-Through — Cache layer handles DB reads transparently

Caching — Eviction & Pitfalls

Eviction Policies

  • LRU — Least Recently Used (general purpose)
  • LFU — Least Frequently Used (for popularity-skewed data)
  • TTL — Time To Live expiry (freshness guarantee)
  • FIFO — First In, First Out (simple, less optimal)

Common Problems

ProblemDescriptionSolution
Cache StampedeMany requests hit DB on simultaneous cache missMutex lock / probabilistic early expiry
Cache PenetrationRequests for non-existent keys bypass cacheBloom filter + cache null results
Cache AvalancheMany keys expire at the same timeJitter TTL values

Rule of thumb: Cache data that is read-heavy, expensive to compute, and tolerant of slight staleness.

Database Scaling

Replication

Primary-Replica (Leader-Follower):

  • All writes go to Primary
  • Reads distributed to Replicas
  • Enables read scaling + failover

Watch out for replication lag — replicas may serve stale reads.


Sharding (Horizontal Partitioning)

Split data across multiple DB nodes by a shard key.

StrategyHowRisk
Range-basedShard by ID rangeHot spots on sequential writes
Hash-basedhash(key) % N shardsRebalancing on resize
Directory-basedLookup table maps keys → shardsLookup overhead

Database Scaling — Patterns

Indexing Tips

  • Index high-cardinality columns used in WHERE, JOIN, ORDER BY
  • Composite indexes — column order matters (leftmost prefix rule)
  • Covering indexes — include all columns needed by the query

SQL vs. NoSQL at Scale

SQL (RDBMS)NoSQL
SchemaRigid, relationalFlexible, document/KV/graph
ScalingVertical (primarily)Horizontal (built-in)
TransactionsACIDEventual consistency (usually)
Best forComplex queries, integrityHigh write throughput, flexibility

Don't abandon SQL prematurely. PostgreSQL + read replicas handles enormous scale.

CAP Theorem

You Can Only Guarantee Two of Three

              Consistency
                  ▲
                  │
           CP ◄───┼───► CA
                  │
                  ▼
  Partition ──────────── Availability
  Tolerance
GuaranteeMeaning
Consistency (C)Every read returns the most recent write
Availability (A)Every request receives a response (may be stale)
Partition Tolerance (P)System continues despite network partitions

In practice, network partitions are inevitable in distributed systems. You're choosing between CP (strong consistency) or AP (high availability).

CAP — Real World Examples

CP Systems (Consistency over Availability)

  • HBase, Zookeeper, etcd — refuse to serve stale data
  • Trade: may reject reads/writes during a partition

AP Systems (Availability over Consistency)

  • Cassandra, DynamoDB, CouchDB — always respond, may be stale
  • Trade: eventual consistency, conflict resolution needed

PACELC Extension

Even when there's no partition, there's a tradeoff between latency and consistency.

SystemPartitionElse
DynamoDBAPEL (low latency)
PostgreSQLCPEC (consistent)
CassandraAPEL

Message Queues & Event Streaming

Why Async Matters

Synchronous request chains create fragile dependencies — one slow service degrades everything.

Async messaging decouples producers from consumers:

Producer  ──►  Queue / Topic  ──►  Consumer(s)
  (API)        (Kafka, SQS)         (workers)

Use Cases

  • Task queues — Email sending, image processing, report generation
  • Event streaming — Audit logs, analytics pipelines, CDC
  • Fan-out — One event → many consumers (pub/sub)
  • Backpressure — Queue absorbs traffic spikes, consumers process at their own pace

Message Queues — Kafka vs. SQS

FeatureApache KafkaAWS SQS
ModelLog-based streamingQueue (pull-based)
RetentionDays/weeks (configurable)14 days max
Consumer modelConsumer groups, offset trackingCompeting consumers
OrderingOrdered per partitionFIFO queue option
ThroughputExtremely highHigh
Best forEvent sourcing, streaming pipelinesTask queues, decoupling

Delivery Guarantees

  • At-most-once — May lose messages, never duplicates
  • At-least-once — May duplicate, never loses (most common)
  • Exactly-once — Hard, requires idempotency on consumer side

Rate Limiting & Throttling

Why Rate Limit?

  • Protect backend services from abuse or traffic spikes
  • Enforce fair usage in multi-tenant APIs
  • Prevent DDoS amplification

Algorithms

AlgorithmHow It WorksPros / Cons
Token BucketTokens refill at fixed rate; request consumes oneAllows bursts; smooth average rate
Leaky BucketRequests drain at fixed rateSmooth output; drops bursts
Fixed WindowCount reqs per time windowSimple; boundary burst problem
Sliding Window LogLog timestamps, count within rolling windowAccurate; memory-intensive
Sliding Window CounterHybrid of fixed windows with interpolationEfficient + accurate

Redis + atomic INCR + EXPIRE is a common distributed rate limiter implementation.

CDNs & Edge Caching

Content Delivery Networks

A CDN is a globally distributed network of edge servers that caches content close to users.

User (Tokyo) ──► CDN Edge (Tokyo) ──► Origin Server (US-East)
                   Cache HIT ✓            (only on miss)

What to Put on a CDN

  • ✅ Static assets: JS, CSS, images, fonts
  • ✅ Video/media streaming
  • ✅ API responses with proper Cache-Control headers
  • ❌ Authenticated, user-specific responses
  • ❌ Real-time data (unless edge computing is used)

Cache Invalidation Strategies

  • Versioned URLsapp.v3.2.1.js (recommended)
  • Purge API — Explicit CDN cache invalidation on deploy
  • Short TTL — Frequent but automatic freshness

Microservices vs. Monolith

The Spectrum

Monolith ──────────────────────────────────► Microservices
  │                   │                             │
Simple deploy    Modular Monolith           Independent services
Single codebase  (recommended starting pt)  per domain/team

When to Split

SignalAction
Teams blocked by shared codeSplit the bounded context
One component needs different scalingExtract as a service
Independent deployment neededService boundary
Distinct data ownershipSeparate DB + service

Start with a well-structured monolith. Premature decomposition creates distributed complexity without the scale to justify it.

Microservices — Communication Patterns

Synchronous (Request/Response)

  • REST — Simple, widely supported, stateless
  • gRPC — Binary, strongly typed, efficient (ideal for internal services)
  • GraphQL — Flexible queries, reduces over-fetching

Asynchronous (Event-Driven)

  • Services emit events; others react independently
  • Enables loose coupling and temporal decoupling

Service Discovery

Service A  ──►  Service Registry  ──►  Service B
               (Consul / Eureka)       (resolved IP)
  • Client-side discovery — Service A queries registry directly
  • Server-side discovery — Load balancer queries registry

Observability: Metrics, Logs, Traces

The Three Pillars

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   METRICS    │  │     LOGS     │  │   TRACES     │
│              │  │              │  │              │
│ Aggregated   │  │ Timestamped  │  │ Request path │
│ numeric data │  │ event records│  │ across svcs  │
│              │  │              │  │              │
│ Prometheus   │  │ ELK / Loki   │  │ Jaeger /     │
│ Datadog      │  │ Datadog      │  │ Zipkin / OTEL│
└──────────────┘  └──────────────┘  └──────────────┘

Key Metrics to Track (USE Method)

  • Utilization — % time resource is busy (CPU, disk)
  • Saturation — Work queued but not yet served
  • Errors — Rate of failed requests

Observability — SLOs & Alerting

SLI / SLO / SLA

TermMeaningExample
SLI (Indicator)The metric being measuredRequest success rate
SLO (Objective)Internal target for the SLI99.9% success over 30 days
SLA (Agreement)External contractual commitment99.5% with penalty clauses

Error Budget

Error Budget = 100% - SLO
At 99.9% SLO → you have 43.8 minutes/month of allowed downtime.

Spend your error budget on:

  • Planned maintenance windows
  • Risky deployments
  • Experiments & chaos engineering

Putting It All Together

A Scalable Web System

         ┌─────────────────────────────────────────────────┐
         │                    CDN / Edge                   │
         └────────────────────┬────────────────────────────┘
                              ▼
                   ┌─────────────────┐
                   │  Load Balancer  │  ◄── Rate Limiting
                   └────────┬────────┘
             ┌──────────────┼──────────────┐
             ▼              ▼              ▼
       ┌──────────┐   ┌──────────┐   ┌──────────┐
       │ App Svc  │   │ App Svc  │   │ App Svc  │
       └──────────┘   └──────────┘   └──────────┘
             │              │              │
             ▼              ▼              ▼
      ┌─────────┐    ┌────────────┐  ┌──────────────┐
      │  Redis  │    │  Primary   │  │ Message Queue│
      │  Cache  │    │  DB + Reps │  │   (Kafka)    │
      └─────────┘    └────────────┘  └──────────────┘

Key Takeaways


  • Scale horizontally — stateless services + shared state layer
  • Cache aggressively — but understand eviction & invalidation
  • Decouple with queues — absorb spikes, enable async workflows
  • Choose consistency tradeoffs — CAP is real; design explicitly
  • Observe everything — metrics, logs, traces before you scale
  • Start simpler than you think — premature optimization is costly

"The goal is not to build the most complex system. It's to build the simplest system that meets the current and near-future requirements."

Reference: Back-of-Envelope Estimation

Useful Numbers

ResourceApproximate Latency
L1 cache read~1 ns
L2 cache read~10 ns
RAM read~100 ns
SSD random read~100 µs
HDD random read~10 ms
Network (same DC)~0.5 ms
Network (cross-region)~100–300 ms

Storage Units

1 KB = 10³ B  |  1 MB = 10⁶ B  |  1 GB = 10⁹ B  |  1 TB = 10¹² B

Traffic Estimation

  • 1M DAU × 10 req/day = ~116 RPS average
  • Apply 2–3× peak multiplier for capacity planning