System Design Interview Preparation Guide

A comprehensive, interview-ready guide covering 75 system design topics with step-by-step walkthroughs, architecture diagrams (Mermaid), production-quality code examples in Java, Python, and Go, and hypothetical interview transcripts.

Live Site: https://spawn08.github.io/system-design-interview

Code Language Conventions

Section	Primary Languages	Rationale
Essential Topics (`basics/`)	Java	Core CS concepts with Java idioms
Software System Design (`software_system_design/`)	Java, Python, Go	Multi-language production examples
ML System Design (`ml_system_design/`)	Python	ML ecosystem standard
GenAI System Design (`genai_ml_system_design/`)	Python	GenAI/LLM system design
GenAI/ML Fundamentals (`genai_ml_basics/`)	Python	ML/AI ecosystem standard
Advanced Topics (`advanced/`)	Java, Go, Python	Infrastructure-focused examples

Coverage Summary

Section	Topics	Status
Essential System Design	11	Complete
Advanced Topics	13	Complete
Software System Design	24	Complete
GenAI/ML Fundamentals	7	Complete
ML System Design	10	Complete
GenAI System Design	10	Complete
Total	75	All complete

This guide provides a comprehensive overview of topics and example questions for system design interviews, particularly for roles in GenAI/ML and Senior Software Engineering.

I. Essential System Design Topics

These topics are fundamental to system design. A strong understanding of these concepts is crucial, regardless of your specific role. Topics are ordered from foundational to advanced.

1. Interview Framework

4-Step Approach: Requirements, High-Level Design, Deep Dive, Trade-offs
Time Management: 35-45 minute interview breakdown
Back-of-Envelope Estimation: Quick reference formulas
Communication: Driving the conversation, handling "what if" questions
Common Mistakes: Anti-patterns to avoid

2. Load Balancing

Types: Round Robin, Least Connections, IP Hash, Weighted Round Robin, etc.
Hardware vs. Software Load Balancers
Session Management: Sticky Sessions
Health Checks
Pros and Cons of different algorithms

3. Caching

Cache Types: In-memory (Redis, Memcached), CDN, Browser Cache, Database Cache
Cache Eviction Policies: LRU, LFU, FIFO, TTL
Cache Invalidation Strategies
Write Policies: Write-through, Write-back, Write-around
Cache Coherency

4. Databases

Relational Databases (SQL):
- ACID properties
- Normalization
- Indexing
- Transactions
- Sharding
- Replication
NoSQL Databases:
- Key-Value, Document, Column-family, Graph databases
- CAP Theorem, BASE properties
- Use cases for each type
Database Scaling:
- Vertical vs. Horizontal Scaling
- Read Replicas
- Master-Slave, Master-Master
Data Modeling

5. Networking

TCP/IP, UDP
HTTP/HTTPS, REST, gRPC
DNS
Proxies: Forward and Reverse
WebSockets
Key Metrics: Latency, Bandwidth, Throughput

6. Concurrency

Threads, Processes
Locks, Mutexes, Semaphores
Deadlocks, Race Conditions
Concurrency Patterns: e.g., Producer-Consumer

7. Distributed Systems Concepts

Consistency and Availability: CAP Theorem
Distributed Consensus: Paxos, Raft
Eventual Consistency
Message Queues: Kafka, RabbitMQ, SQS
Distributed Hash Tables (DHTs)
Leader Election

8. API Design

RESTful APIs
GraphQL
API Versioning
Rate Limiting
Authentication and Authorization: OAuth, JWT

9. Security

Common Vulnerabilities: SQL Injection, XSS, CSRF
Encryption: Symmetric, Asymmetric
Hashing
TLS/SSL

10. Scalability, Availability, and Reliability

Horizontal vs. Vertical Scaling
Redundancy and Failover
Monitoring and Alerting
Disaster Recovery

11. Estimation and Capacity Planning

Ability to estimate storage, bandwidth, and compute needs based on user numbers, request rates, and data sizes.
Back-of-the-envelope calculations.

II. Advanced Topics

These topics are generally more relevant for Senior/Staff roles and specialized areas.

1. Message Queues and Stream Processing

Kafka, RabbitMQ, SQS, Pulsar
Stream Processing Frameworks: Apache Flink, Apache Spark Streaming

2. Search Systems

Inverted Indexes
Elasticsearch, Solr

3. Data Warehousing and Data Lakes

Data Warehousing Concepts: ETL, Star Schema, Snowflake Schema
Data Lake Concepts: Hadoop, Spark

4. Microservices Architecture

Service Discovery
API Gateways
Circuit Breakers
Containerization: Docker, Kubernetes

5. Consistency Patterns

Strong Consistency
Eventual Consistency
Causal Consistency

6. Object Storage & CDN

Object Storage: S3-compatible APIs, buckets, multi-part uploads, storage classes
CDN: Edge caching, origin pull/push, geo-routing, cache invalidation
Pre-signed URLs: Temporary access, security patterns
Edge Compute: Lambda@Edge, Cloudflare Workers

7. Distributed Locking

Redis-Based: SET NX PX, Redlock algorithm, fencing tokens
ZooKeeper-Based: Ephemeral nodes, watch mechanism
Database-Based: SELECT FOR UPDATE, advisory locks, optimistic locking
etcd-Based: Lease-based approach, compare-and-swap

8. Observability

Logging: Structured logging, ELK stack, correlation IDs
Metrics: RED/USE methods, Prometheus, time-series databases
Distributed Tracing: OpenTelemetry, Jaeger, sampling strategies
SLIs/SLOs/SLAs: Error budgets, alerting best practices

9. Event Sourcing & CQRS

Event Sourcing: Events as source of truth, event store, replay
CQRS: Separate read/write models, eventual consistency
Projections: Materialized views, rebuilding from events
Use Cases: Financial ledgers, audit trails, order lifecycle

III. GenAI/ML Specific Topics

These topics are particularly important for system design interviews focused on Generative AI and Machine Learning.

1. Model Serving

REST APIs for model inference
Batch vs. Online Prediction
Model Versioning
A/B Testing of Models
Model Monitoring: drift detection, performance metrics
Serving Frameworks: TensorFlow Serving, TorchServe, Triton Inference Server

2. Feature Stores

Centralized management of features for training and inference
Consistency between training and serving data
Feature versioning

3. Data Pipelines for ML

Data Ingestion, Transformation, and Validation
Workflow Orchestration: Airflow, Kubeflow

4. Large Language Models (LLMs)

Prompt Engineering
Fine-tuning
Retrieval-Augmented Generation (RAG)
Vector Databases: for similarity search
Model Deployment and Scaling for LLMs

5. Distributed Training

Data Parallelism
Model Parallelism
Parameter Servers

IV. Top 49 System Design Interview Questions

These questions are categorized and cover a range of difficulty levels. Remember that the process of how you approach the problem is often more important than finding a "perfect" solution. All questions have full walkthroughs in this guide.

General System Design (24 designs)

Design a URL Shortener (TinyURL): Hashing, databases, scaling.
Design a Rate Limiter: Algorithms (token bucket, leaky bucket), distributed systems.
Design a Key-Value Store: Consistent hashing, replication, conflict resolution.
Design a Distributed Cache: Caching strategies, consistency, eviction policies.
Design a Web Crawler: Concurrency, distributed processing, politeness policies.
Design a Notification System: Message queues, push vs. pull, scalability.
Design a Chat System: WebSockets, message ordering, presence tracking.
Design a Social Media Feed: Fan-out strategies, ranking, caching.
Design Search Autocomplete: Trie data structure, ranking, real-time updates.
Design a Voting System: Consistency, duplicate prevention, real-time results.
Design YouTube / Video Streaming: CDN, transcoding, adaptive bitrate.
Design Instagram / Photo Sharing: Object storage, feed, image processing.
Design Google Docs / Collaborative Editor: OT/CRDTs, WebSocket, conflict resolution.
Design Uber/Lyft / Ride Sharing: Geospatial indexing, matching, real-time tracking.
Design Google Drive / Cloud Storage: File sync, chunking, deduplication.
Design Ticketmaster / Event Booking: Inventory locking, virtual queues, flash crowds.
Design a Distributed Task Scheduler: Priority queues, lease-based execution, timing wheels.
Design a Payment System: Idempotency, double-entry ledger, PCI compliance.
Design a Proximity Service: Geohash, quadtree, spatial indexing.
Design a Distributed Message Queue (Kafka): Append-only log, partitioning, consumer groups, zero-copy I/O.
Design a Metrics & Monitoring System (Datadog): Time-series storage, Gorilla compression, alerting pipeline, federation.
Design an Email Delivery System (SendGrid/SES): SMTP, DKIM/SPF/DMARC, IP reputation, deliverability, bounce handling.
Design a Distributed File System (GFS/HDFS): Master-chunk architecture, replication, leases, consistency model, garbage collection.
Design an Ad Click Event Aggregator: Real-time aggregation, exactly-once counting, Flink/Kafka, click fraud detection, reconciliation.

ML System Design (10 designs)

Design a Recommendation System (Netflix/Amazon): Collaborative filtering, Two-Tower models, cold start, A/B testing.
Design Real-time Fraud Detection: Feature engineering, velocity features, class imbalance, ensemble models.
Design Image Search: CLIP embeddings, vector databases, ANN indexes, re-ranking.
Design Image Caption Generation: Encoder-decoder, attention, Triton serving.
Design Search Ranking: BM25, LambdaMART, retrieval + ranking + re-ranking, NDCG.
Design Real-time Personalization: Session models, contextual bandits, multi-task ranking.
Design an Ads Ranking System (Google/Meta): CTR prediction, auction mechanics, budget pacing, calibration.
Design a Real-time Feature Platform (Feast/Tecton): Streaming features, point-in-time joins, train-serve consistency, feature monitoring.
Design a Machine Translation System (Google Translate): Transformer, multilingual NMT, quality estimation, low-resource languages, beam search.
Design a Speech Recognition System (Google STT/Whisper): CTC/RNN-T, streaming ASR, speaker diarization, mel spectrograms, language model fusion.

GenAI System Design (10 designs — with interview transcripts)

Design an LLM-Powered Chatbot: KV-cache, PagedAttention, speculative decoding, RLHF, guardrails, streaming.
Design an Enterprise RAG System: Chunking, hybrid retrieval, re-ranking, ACL-aware search, citation grounding.
Design an AI Code Assistant: Fill-in-the-middle, speculative decoding, repository-level context, telemetry.
Design an LLM Content Moderation System: Cascade architecture, adversarial robustness, human-in-the-loop, fairness.
Design an ML Training Platform: Gang scheduling, checkpointing, distributed training, GPU cluster management.
Design a Multi-Modal Search System: CLIP/SigLIP embeddings, cross-modal retrieval, ScaNN, video search.
Design an AI Agent System: ReAct pattern, tool calling, planning, memory architecture, multi-agent orchestration.
Design an LLM Gateway / AI Proxy: Multi-model routing, semantic caching, cost control, PII scrubbing.
Design a Text-to-Image Generation System: Diffusion models, latent space, CFG, safety, content provenance.
Design a Vector Database (Pinecone/Qdrant): HNSW, IVF-PQ, hybrid search, billion-scale ANN, sharding.

Senior Software Engineer System Design (Focus on Architecture & Trade-offs)

Design a system to handle a sudden surge in traffic (e.g., a viral event). Load balancing, auto-scaling, caching, circuit breakers.
You are tasked with migrating a monolithic application to a microservices architecture. Describe your approach. Service decomposition, API design, data consistency, deployment.
Design a system that needs to be highly available and fault-tolerant across multiple data centers. Replication, consistency, disaster recovery, network considerations.

V. Key Tips for System Design Interviews

Clarify Requirements: Ask clarifying questions! Don't make assumptions. Understand the scale, constraints, and non-functional requirements (availability, consistency, latency, etc.).
Start Simple: Begin with a high-level design and gradually add details.
Think Out Loud: Explain your thought process, trade-offs, and design choices.
Use Diagrams: Draw diagrams to illustrate your design.
Consider Trade-offs: There's rarely a single "right" answer. Discuss pros and cons.
Scale Incrementally: Start with a design for a smaller scale, then discuss scaling.
Handle Failure: Discuss how your system would handle failures.
Data Modeling: Pay attention to data storage and access. Choose appropriate databases.
Bottlenecks: Identify potential bottlenecks and discuss solutions.
Practice: The more you practice, the better you'll become.

Good luck with your interviews!

VI. GitHub Pages Deployment

This guide is published as a static site using Jekyll and GitHub Pages. Below are the setup and deployment instructions.

Live Site

URL: https://spawn08.github.io/system-design-interview

Technology Stack

Component	Technology
Static Site Generator	Jekyll 4.3+
Theme	Just the Docs v0.8.2
Color Scheme	Dark
Diagrams	Mermaid.js (client-side rendering)
CI/CD	GitHub Actions
Hosting	GitHub Pages

Prerequisites

Ruby 3.1+
Bundler (gem install bundler)
Git

Local Development

# Clone the repository
git clone https://github.com/spawn08/system-design-interview.git
cd system-design-interview

# Install dependencies
bundle install

# Serve locally with live reload
bundle exec jekyll serve --livereload

# Site will be available at http://localhost:4000/system-design-interview/

Project Structure

system-design-interview/
├── .github/workflows/
│   └── deploy.yml              # GitHub Actions CI/CD pipeline
├── _includes/
│   ├── footer_custom.html      # Custom footer
│   └── head_custom.html        # Fonts, Mermaid.js, custom styles
├── _sass/custom/
│   └── custom.scss             # Theme overrides and custom styles
├── basics/                     # Essential System Design Topics (11 topics)
│   ├── index.md
│   ├── interview_framework.md  # NEW - How to approach any design question
│   ├── estimation.md
│   ├── networking.md
│   ├── databases.md
│   ├── caching.md
│   ├── load_balancer.md
│   ├── api_design.md
│   ├── concurrency.md
│   ├── security.md
│   ├── scalability.md
│   └── distributed_systems.md
├── advanced/                   # Advanced Topics (9 topics, Senior/Staff level)
│   ├── index.md
│   ├── message_queues.md
│   ├── search_systems.md
│   ├── consistency_patterns.md
│   ├── microservices.md
│   ├── data_warehousing.md
│   ├── object_storage_cdn.md   # NEW
│   ├── distributed_locking.md  # NEW
│   ├── observability.md        # NEW
│   └── event_sourcing_cqrs.md  # NEW
├── software_system_design/     # System Design Problems (24 designs)
│   ├── index.md
│   ├── url_shortening.md
│   ├── rate_limiter.md
│   ├── key_value_store.md
│   ├── distributed_cache.md
│   ├── notification_system.md
│   ├── web_crawler.md
│   ├── chat_system.md
│   ├── news_feed.md
│   ├── search_autocomplete.md
│   ├── voting-system-design.md
│   ├── video_streaming.md
│   ├── photo_sharing.md
│   ├── collaborative_editor.md
│   ├── ride_sharing.md
│   ├── cloud_storage.md
│   ├── event_booking.md
│   ├── task_scheduler.md
│   ├── payment_system.md
│   ├── proximity_service.md
│   ├── message_queue.md        # NEW - Distributed Message Queue (Kafka)
│   ├── metrics_monitoring.md   # NEW - Metrics & Monitoring System (Datadog)
│   ├── email_delivery.md       # NEW - Email Delivery System (SendGrid/SES)
│   ├── distributed_file_system.md # NEW - Distributed File System (GFS/HDFS)
│   ├── ad_click_aggregator.md  # NEW - Ad Click Event Aggregator
│   └── staff_engineer_expectations.md
├── genai_ml_basics/            # GenAI/ML Fundamentals (7 building blocks)
│   ├── index.md
│   ├── model_serving.md
│   ├── feature_stores.md
│   ├── data_pipelines.md
│   ├── llm_systems.md
│   ├── distributed_training.md
│   ├── llm_evaluation.md      # NEW - LLM Evaluation & Benchmarking
│   └── rlhf_alignment.md      # NEW - RLHF / DPO Alignment
├── ml_system_design/           # ML System Design (10 designs)
│   ├── index.md
│   ├── recommendation_system.md
│   ├── fraud_detection.md
│   ├── image_search.md
│   ├── image_caption_generator.md
│   ├── search_ranking.md
│   ├── realtime_personalization.md
│   ├── ads_ranking.md          # NEW - Ads Ranking System (Google/Meta)
│   ├── feature_platform.md    # NEW - Real-time Feature Platform
│   ├── machine_translation.md # NEW - Machine Translation (Google Translate)
│   └── speech_recognition.md  # NEW - Speech Recognition (ASR)
├── genai_ml_system_design/    # GenAI System Design (10 designs)
│   ├── index.md
│   ├── llm_chatbot.md
│   ├── enterprise_rag.md
│   ├── ai_code_assistant.md
│   ├── content_moderation.md
│   ├── ml_training_platform.md
│   ├── multimodal_search.md
│   ├── ai_agent_system.md      # NEW - AI Agent System (ReAct, tools, memory)
│   ├── llm_gateway.md          # NEW - LLM Gateway / AI Proxy
│   ├── text_to_image.md        # NEW - Text-to-Image Generation (Imagen/DALL-E)
│   └── vector_database.md     # NEW - Vector Database (Pinecone/Qdrant)
├── _config.yml                 # Jekyll site configuration
├── Gemfile                     # Ruby dependencies
├── index.md                    # Home page
└── README.md                   # This file

Deployment Pipeline

The site is automatically deployed via GitHub Actions on every push to main:

Trigger: Push to main branch or manual workflow dispatch
Build: GitHub Actions checks out the code, sets up Ruby 3.1, installs dependencies via Bundler, and builds the Jekyll site
Deploy: The built site is uploaded as a GitHub Pages artifact and deployed to the github-pages environment

GitHub Actions Workflow (`.github/workflows/deploy.yml`)

The pipeline uses the following actions:

actions/checkout@v4 — checks out repository
ruby/setup-ruby@v1 — installs Ruby with bundler caching
actions/configure-pages@v4 — configures GitHub Pages
actions/upload-pages-artifact@v3 — uploads the built _site directory
actions/deploy-pages@v4 — deploys to GitHub Pages

Required GitHub Repository Settings

Go to Settings → Pages
Under Build and deployment, select GitHub Actions as the source
Ensure the repository has Pages enabled under Settings → Pages
The workflow requires these permissions (already configured in deploy.yml):
- contents: read
- pages: write
- id-token: write

Adding New Content

Create a new .md file in the appropriate directory (basics/, advanced/, genai_ml_basics/, software_system_design/, ml_system_design/, or genai_ml_system_design/)

Add the Jekyll front matter:

---
layout: default
title: Your Topic Title
parent: Fundamentals    # or "Advanced Topics", "GenAI/ML Fundamentals", "System Design Examples", "ML System Design", "GenAI System Design"
nav_order: N            # determines position in navigation
---

Use Mermaid for diagrams (rendered client-side):

```mermaid
flowchart TD
    A[Start] --> B[End]
```

Use Just the Docs callouts for emphasis:

{: .note }
> This is a note callout.

{: .tip }
> This is a tip callout.

{: .warning }
> This is a warning callout.

Commit and push to main — the site will auto-deploy in ~2 minutes

Troubleshooting

Issue	Solution
Build fails on GitHub	Check the Actions tab for error logs; usually a Gemfile or front matter issue
Mermaid diagrams not rendering	Ensure `head_custom.html` includes the Mermaid CDN script
Navigation order wrong	Adjust `nav_order` in the page's front matter
Page not appearing	Verify `parent` in front matter matches the parent page's `title` exactly
Local serve fails	Run `bundle update` to update gems, ensure Ruby 3.1+
Pages 404 after deploy	Verify `baseurl` in `_config.yml` matches your repo name
CSS/styles broken locally	Run `bundle exec jekyll clean` then rebuild

Content Style Guide

Each system design topic follows a consistent interview-ready structure:

Software System Design pages include:

What We're Building — problem statement with real-world scale
Step 1: Requirements — functional/non-functional requirements tables, API design
Step 2: Estimation — back-of-envelope calculations
Step 3: High-Level Design — Mermaid architecture diagram
Step 4: Deep Dive — 6-10 subsections with code in Java, Python, and Go
Step 5: Scaling & Production — failure handling, monitoring, trade-offs
Interview Tips — common follow-up questions

ML System Design pages include:

What We're Building — problem statement with business impact
ML Concepts Primer — key ML concepts needed for the design
Step 1: Requirements — functional/non-functional + metrics (online and offline)
Step 2: Estimation — QPS, storage, model inference budget
Step 3: High-Level Design — online + offline pipeline diagrams
Step 4: Deep Dive — 8-10 subsections with Python code examples
Step 5: Scaling & Production — failure handling, privacy, monitoring

GenAI System Design pages include:

What We're Building — problem statement with Google-scale metrics
Key Concepts Primer — GenAI-specific concepts (KV-cache, PagedAttention, CLIP, etc.)
Step 1: Requirements — functional/non-functional with GenAI-specific NFRs
Step 2: Estimation — GPU compute, KV-cache memory, inference cost modeling
Step 3: High-Level Design — Mermaid architecture diagrams
Step 4: Deep Dive — 6-8 subsections with Python code examples
Step 5: Scaling & Production — failure handling, monitoring, trade-offs
Hypothetical Interview Transcript — full 45-minute Google-style interview simulation

Contributing

Fork the repository
Create a feature branch: git checkout -b feature/new-topic
Add your topic following the structure above
Test locally with bundle exec jekyll serve
Submit a pull request

License

This project is open source and available for educational purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github/workflows		.github/workflows
.mermaid_validate		.mermaid_validate
docs		docs
overrides		overrides
site		site
README.md		README.md
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

System Design Interview Preparation Guide

Code Language Conventions

Coverage Summary

I. Essential System Design Topics

1. Interview Framework

2. Load Balancing

3. Caching

4. Databases

5. Networking

6. Concurrency

7. Distributed Systems Concepts

8. API Design

9. Security

10. Scalability, Availability, and Reliability

11. Estimation and Capacity Planning

II. Advanced Topics

1. Message Queues and Stream Processing

2. Search Systems

3. Data Warehousing and Data Lakes

4. Microservices Architecture

5. Consistency Patterns

6. Object Storage & CDN

7. Distributed Locking

8. Observability

9. Event Sourcing & CQRS

III. GenAI/ML Specific Topics

1. Model Serving

2. Feature Stores

3. Data Pipelines for ML

4. Large Language Models (LLMs)

5. Distributed Training

IV. Top 49 System Design Interview Questions

General System Design (24 designs)

ML System Design (10 designs)

GenAI System Design (10 designs — with interview transcripts)

Senior Software Engineer System Design (Focus on Architecture & Trade-offs)

V. Key Tips for System Design Interviews

VI. GitHub Pages Deployment

Live Site

Technology Stack

Prerequisites

Local Development

Project Structure

Deployment Pipeline

GitHub Actions Workflow (.github/workflows/deploy.yml)

Required GitHub Repository Settings

Adding New Content

Troubleshooting

Content Style Guide

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

GitHub Actions Workflow (`.github/workflows/deploy.yml`)

Packages