A comprehensive, interview-ready guide covering 75 system design topics with step-by-step walkthroughs, architecture diagrams (Mermaid), production-quality code examples in Java, Python, and Go, and hypothetical interview transcripts.
Live Site: https://spawn08.github.io/system-design-interview
| Section | Primary Languages | Rationale |
|---|---|---|
Essential Topics (basics/) |
Java | Core CS concepts with Java idioms |
Software System Design (software_system_design/) |
Java, Python, Go | Multi-language production examples |
ML System Design (ml_system_design/) |
Python | ML ecosystem standard |
GenAI System Design (genai_ml_system_design/) |
Python | GenAI/LLM system design |
GenAI/ML Fundamentals (genai_ml_basics/) |
Python | ML/AI ecosystem standard |
Advanced Topics (advanced/) |
Java, Go, Python | Infrastructure-focused examples |
| Section | Topics | Status |
|---|---|---|
| Essential System Design | 11 | Complete |
| Advanced Topics | 13 | Complete |
| Software System Design | 24 | Complete |
| GenAI/ML Fundamentals | 7 | Complete |
| ML System Design | 10 | Complete |
| GenAI System Design | 10 | Complete |
| Total | 75 | All complete |
This guide provides a comprehensive overview of topics and example questions for system design interviews, particularly for roles in GenAI/ML and Senior Software Engineering.
These topics are fundamental to system design. A strong understanding of these concepts is crucial, regardless of your specific role. Topics are ordered from foundational to advanced.
- 4-Step Approach: Requirements, High-Level Design, Deep Dive, Trade-offs
- Time Management: 35-45 minute interview breakdown
- Back-of-Envelope Estimation: Quick reference formulas
- Communication: Driving the conversation, handling "what if" questions
- Common Mistakes: Anti-patterns to avoid
- Types: Round Robin, Least Connections, IP Hash, Weighted Round Robin, etc.
- Hardware vs. Software Load Balancers
- Session Management: Sticky Sessions
- Health Checks
- Pros and Cons of different algorithms
- Cache Types: In-memory (Redis, Memcached), CDN, Browser Cache, Database Cache
- Cache Eviction Policies: LRU, LFU, FIFO, TTL
- Cache Invalidation Strategies
- Write Policies: Write-through, Write-back, Write-around
- Cache Coherency
- Relational Databases (SQL):
- ACID properties
- Normalization
- Indexing
- Transactions
- Sharding
- Replication
- NoSQL Databases:
- Key-Value, Document, Column-family, Graph databases
- CAP Theorem, BASE properties
- Use cases for each type
- Database Scaling:
- Vertical vs. Horizontal Scaling
- Read Replicas
- Master-Slave, Master-Master
- Data Modeling
- TCP/IP, UDP
- HTTP/HTTPS, REST, gRPC
- DNS
- Proxies: Forward and Reverse
- WebSockets
- Key Metrics: Latency, Bandwidth, Throughput
- Threads, Processes
- Locks, Mutexes, Semaphores
- Deadlocks, Race Conditions
- Concurrency Patterns: e.g., Producer-Consumer
- Consistency and Availability: CAP Theorem
- Distributed Consensus: Paxos, Raft
- Eventual Consistency
- Message Queues: Kafka, RabbitMQ, SQS
- Distributed Hash Tables (DHTs)
- Leader Election
- RESTful APIs
- GraphQL
- API Versioning
- Rate Limiting
- Authentication and Authorization: OAuth, JWT
- Common Vulnerabilities: SQL Injection, XSS, CSRF
- Encryption: Symmetric, Asymmetric
- Hashing
- TLS/SSL
- Horizontal vs. Vertical Scaling
- Redundancy and Failover
- Monitoring and Alerting
- Disaster Recovery
- Ability to estimate storage, bandwidth, and compute needs based on user numbers, request rates, and data sizes.
- Back-of-the-envelope calculations.
These topics are generally more relevant for Senior/Staff roles and specialized areas.
- Kafka, RabbitMQ, SQS, Pulsar
- Stream Processing Frameworks: Apache Flink, Apache Spark Streaming
- Inverted Indexes
- Elasticsearch, Solr
- Data Warehousing Concepts: ETL, Star Schema, Snowflake Schema
- Data Lake Concepts: Hadoop, Spark
- Service Discovery
- API Gateways
- Circuit Breakers
- Containerization: Docker, Kubernetes
- Strong Consistency
- Eventual Consistency
- Causal Consistency
- Object Storage: S3-compatible APIs, buckets, multi-part uploads, storage classes
- CDN: Edge caching, origin pull/push, geo-routing, cache invalidation
- Pre-signed URLs: Temporary access, security patterns
- Edge Compute: Lambda@Edge, Cloudflare Workers
- Redis-Based: SET NX PX, Redlock algorithm, fencing tokens
- ZooKeeper-Based: Ephemeral nodes, watch mechanism
- Database-Based: SELECT FOR UPDATE, advisory locks, optimistic locking
- etcd-Based: Lease-based approach, compare-and-swap
- Logging: Structured logging, ELK stack, correlation IDs
- Metrics: RED/USE methods, Prometheus, time-series databases
- Distributed Tracing: OpenTelemetry, Jaeger, sampling strategies
- SLIs/SLOs/SLAs: Error budgets, alerting best practices
- Event Sourcing: Events as source of truth, event store, replay
- CQRS: Separate read/write models, eventual consistency
- Projections: Materialized views, rebuilding from events
- Use Cases: Financial ledgers, audit trails, order lifecycle
These topics are particularly important for system design interviews focused on Generative AI and Machine Learning.
- REST APIs for model inference
- Batch vs. Online Prediction
- Model Versioning
- A/B Testing of Models
- Model Monitoring: drift detection, performance metrics
- Serving Frameworks: TensorFlow Serving, TorchServe, Triton Inference Server
- Centralized management of features for training and inference
- Consistency between training and serving data
- Feature versioning
- Data Ingestion, Transformation, and Validation
- Workflow Orchestration: Airflow, Kubeflow
- Prompt Engineering
- Fine-tuning
- Retrieval-Augmented Generation (RAG)
- Vector Databases: for similarity search
- Model Deployment and Scaling for LLMs
- Data Parallelism
- Model Parallelism
- Parameter Servers
These questions are categorized and cover a range of difficulty levels. Remember that the process of how you approach the problem is often more important than finding a "perfect" solution. All questions have full walkthroughs in this guide.
- Design a URL Shortener (TinyURL): Hashing, databases, scaling.
- Design a Rate Limiter: Algorithms (token bucket, leaky bucket), distributed systems.
- Design a Key-Value Store: Consistent hashing, replication, conflict resolution.
- Design a Distributed Cache: Caching strategies, consistency, eviction policies.
- Design a Web Crawler: Concurrency, distributed processing, politeness policies.
- Design a Notification System: Message queues, push vs. pull, scalability.
- Design a Chat System: WebSockets, message ordering, presence tracking.
- Design a Social Media Feed: Fan-out strategies, ranking, caching.
- Design Search Autocomplete: Trie data structure, ranking, real-time updates.
- Design a Voting System: Consistency, duplicate prevention, real-time results.
- Design YouTube / Video Streaming: CDN, transcoding, adaptive bitrate.
- Design Instagram / Photo Sharing: Object storage, feed, image processing.
- Design Google Docs / Collaborative Editor: OT/CRDTs, WebSocket, conflict resolution.
- Design Uber/Lyft / Ride Sharing: Geospatial indexing, matching, real-time tracking.
- Design Google Drive / Cloud Storage: File sync, chunking, deduplication.
- Design Ticketmaster / Event Booking: Inventory locking, virtual queues, flash crowds.
- Design a Distributed Task Scheduler: Priority queues, lease-based execution, timing wheels.
- Design a Payment System: Idempotency, double-entry ledger, PCI compliance.
- Design a Proximity Service: Geohash, quadtree, spatial indexing.
- Design a Distributed Message Queue (Kafka): Append-only log, partitioning, consumer groups, zero-copy I/O.
- Design a Metrics & Monitoring System (Datadog): Time-series storage, Gorilla compression, alerting pipeline, federation.
- Design an Email Delivery System (SendGrid/SES): SMTP, DKIM/SPF/DMARC, IP reputation, deliverability, bounce handling.
- Design a Distributed File System (GFS/HDFS): Master-chunk architecture, replication, leases, consistency model, garbage collection.
- Design an Ad Click Event Aggregator: Real-time aggregation, exactly-once counting, Flink/Kafka, click fraud detection, reconciliation.
- Design a Recommendation System (Netflix/Amazon): Collaborative filtering, Two-Tower models, cold start, A/B testing.
- Design Real-time Fraud Detection: Feature engineering, velocity features, class imbalance, ensemble models.
- Design Image Search: CLIP embeddings, vector databases, ANN indexes, re-ranking.
- Design Image Caption Generation: Encoder-decoder, attention, Triton serving.
- Design Search Ranking: BM25, LambdaMART, retrieval + ranking + re-ranking, NDCG.
- Design Real-time Personalization: Session models, contextual bandits, multi-task ranking.
- Design an Ads Ranking System (Google/Meta): CTR prediction, auction mechanics, budget pacing, calibration.
- Design a Real-time Feature Platform (Feast/Tecton): Streaming features, point-in-time joins, train-serve consistency, feature monitoring.
- Design a Machine Translation System (Google Translate): Transformer, multilingual NMT, quality estimation, low-resource languages, beam search.
- Design a Speech Recognition System (Google STT/Whisper): CTC/RNN-T, streaming ASR, speaker diarization, mel spectrograms, language model fusion.
- Design an LLM-Powered Chatbot: KV-cache, PagedAttention, speculative decoding, RLHF, guardrails, streaming.
- Design an Enterprise RAG System: Chunking, hybrid retrieval, re-ranking, ACL-aware search, citation grounding.
- Design an AI Code Assistant: Fill-in-the-middle, speculative decoding, repository-level context, telemetry.
- Design an LLM Content Moderation System: Cascade architecture, adversarial robustness, human-in-the-loop, fairness.
- Design an ML Training Platform: Gang scheduling, checkpointing, distributed training, GPU cluster management.
- Design a Multi-Modal Search System: CLIP/SigLIP embeddings, cross-modal retrieval, ScaNN, video search.
- Design an AI Agent System: ReAct pattern, tool calling, planning, memory architecture, multi-agent orchestration.
- Design an LLM Gateway / AI Proxy: Multi-model routing, semantic caching, cost control, PII scrubbing.
- Design a Text-to-Image Generation System: Diffusion models, latent space, CFG, safety, content provenance.
- Design a Vector Database (Pinecone/Qdrant): HNSW, IVF-PQ, hybrid search, billion-scale ANN, sharding.
- Design a system to handle a sudden surge in traffic (e.g., a viral event). Load balancing, auto-scaling, caching, circuit breakers.
- You are tasked with migrating a monolithic application to a microservices architecture. Describe your approach. Service decomposition, API design, data consistency, deployment.
- Design a system that needs to be highly available and fault-tolerant across multiple data centers. Replication, consistency, disaster recovery, network considerations.
- Clarify Requirements: Ask clarifying questions! Don't make assumptions. Understand the scale, constraints, and non-functional requirements (availability, consistency, latency, etc.).
- Start Simple: Begin with a high-level design and gradually add details.
- Think Out Loud: Explain your thought process, trade-offs, and design choices.
- Use Diagrams: Draw diagrams to illustrate your design.
- Consider Trade-offs: There's rarely a single "right" answer. Discuss pros and cons.
- Scale Incrementally: Start with a design for a smaller scale, then discuss scaling.
- Handle Failure: Discuss how your system would handle failures.
- Data Modeling: Pay attention to data storage and access. Choose appropriate databases.
- Bottlenecks: Identify potential bottlenecks and discuss solutions.
- Practice: The more you practice, the better you'll become.
Good luck with your interviews!
This guide is published as a static site using Jekyll and GitHub Pages. Below are the setup and deployment instructions.
URL: https://spawn08.github.io/system-design-interview
| Component | Technology |
|---|---|
| Static Site Generator | Jekyll 4.3+ |
| Theme | Just the Docs v0.8.2 |
| Color Scheme | Dark |
| Diagrams | Mermaid.js (client-side rendering) |
| CI/CD | GitHub Actions |
| Hosting | GitHub Pages |
- Ruby 3.1+
- Bundler (
gem install bundler) - Git
# Clone the repository
git clone https://github.com/spawn08/system-design-interview.git
cd system-design-interview
# Install dependencies
bundle install
# Serve locally with live reload
bundle exec jekyll serve --livereload
# Site will be available at http://localhost:4000/system-design-interview/system-design-interview/
├── .github/workflows/
│ └── deploy.yml # GitHub Actions CI/CD pipeline
├── _includes/
│ ├── footer_custom.html # Custom footer
│ └── head_custom.html # Fonts, Mermaid.js, custom styles
├── _sass/custom/
│ └── custom.scss # Theme overrides and custom styles
├── basics/ # Essential System Design Topics (11 topics)
│ ├── index.md
│ ├── interview_framework.md # NEW - How to approach any design question
│ ├── estimation.md
│ ├── networking.md
│ ├── databases.md
│ ├── caching.md
│ ├── load_balancer.md
│ ├── api_design.md
│ ├── concurrency.md
│ ├── security.md
│ ├── scalability.md
│ └── distributed_systems.md
├── advanced/ # Advanced Topics (9 topics, Senior/Staff level)
│ ├── index.md
│ ├── message_queues.md
│ ├── search_systems.md
│ ├── consistency_patterns.md
│ ├── microservices.md
│ ├── data_warehousing.md
│ ├── object_storage_cdn.md # NEW
│ ├── distributed_locking.md # NEW
│ ├── observability.md # NEW
│ └── event_sourcing_cqrs.md # NEW
├── software_system_design/ # System Design Problems (24 designs)
│ ├── index.md
│ ├── url_shortening.md
│ ├── rate_limiter.md
│ ├── key_value_store.md
│ ├── distributed_cache.md
│ ├── notification_system.md
│ ├── web_crawler.md
│ ├── chat_system.md
│ ├── news_feed.md
│ ├── search_autocomplete.md
│ ├── voting-system-design.md
│ ├── video_streaming.md
│ ├── photo_sharing.md
│ ├── collaborative_editor.md
│ ├── ride_sharing.md
│ ├── cloud_storage.md
│ ├── event_booking.md
│ ├── task_scheduler.md
│ ├── payment_system.md
│ ├── proximity_service.md
│ ├── message_queue.md # NEW - Distributed Message Queue (Kafka)
│ ├── metrics_monitoring.md # NEW - Metrics & Monitoring System (Datadog)
│ ├── email_delivery.md # NEW - Email Delivery System (SendGrid/SES)
│ ├── distributed_file_system.md # NEW - Distributed File System (GFS/HDFS)
│ ├── ad_click_aggregator.md # NEW - Ad Click Event Aggregator
│ └── staff_engineer_expectations.md
├── genai_ml_basics/ # GenAI/ML Fundamentals (7 building blocks)
│ ├── index.md
│ ├── model_serving.md
│ ├── feature_stores.md
│ ├── data_pipelines.md
│ ├── llm_systems.md
│ ├── distributed_training.md
│ ├── llm_evaluation.md # NEW - LLM Evaluation & Benchmarking
│ └── rlhf_alignment.md # NEW - RLHF / DPO Alignment
├── ml_system_design/ # ML System Design (10 designs)
│ ├── index.md
│ ├── recommendation_system.md
│ ├── fraud_detection.md
│ ├── image_search.md
│ ├── image_caption_generator.md
│ ├── search_ranking.md
│ ├── realtime_personalization.md
│ ├── ads_ranking.md # NEW - Ads Ranking System (Google/Meta)
│ ├── feature_platform.md # NEW - Real-time Feature Platform
│ ├── machine_translation.md # NEW - Machine Translation (Google Translate)
│ └── speech_recognition.md # NEW - Speech Recognition (ASR)
├── genai_ml_system_design/ # GenAI System Design (10 designs)
│ ├── index.md
│ ├── llm_chatbot.md
│ ├── enterprise_rag.md
│ ├── ai_code_assistant.md
│ ├── content_moderation.md
│ ├── ml_training_platform.md
│ ├── multimodal_search.md
│ ├── ai_agent_system.md # NEW - AI Agent System (ReAct, tools, memory)
│ ├── llm_gateway.md # NEW - LLM Gateway / AI Proxy
│ ├── text_to_image.md # NEW - Text-to-Image Generation (Imagen/DALL-E)
│ └── vector_database.md # NEW - Vector Database (Pinecone/Qdrant)
├── _config.yml # Jekyll site configuration
├── Gemfile # Ruby dependencies
├── index.md # Home page
└── README.md # This file
The site is automatically deployed via GitHub Actions on every push to main:
- Trigger: Push to
mainbranch or manual workflow dispatch - Build: GitHub Actions checks out the code, sets up Ruby 3.1, installs dependencies via Bundler, and builds the Jekyll site
- Deploy: The built site is uploaded as a GitHub Pages artifact and deployed to the
github-pagesenvironment
The pipeline uses the following actions:
actions/checkout@v4— checks out repositoryruby/setup-ruby@v1— installs Ruby with bundler cachingactions/configure-pages@v4— configures GitHub Pagesactions/upload-pages-artifact@v3— uploads the built_sitedirectoryactions/deploy-pages@v4— deploys to GitHub Pages
- Go to Settings → Pages
- Under Build and deployment, select GitHub Actions as the source
- Ensure the repository has Pages enabled under Settings → Pages
- The workflow requires these permissions (already configured in
deploy.yml):contents: readpages: writeid-token: write
- Create a new
.mdfile in the appropriate directory (basics/,advanced/,genai_ml_basics/,software_system_design/,ml_system_design/, orgenai_ml_system_design/) - Add the Jekyll front matter:
--- layout: default title: Your Topic Title parent: Fundamentals # or "Advanced Topics", "GenAI/ML Fundamentals", "System Design Examples", "ML System Design", "GenAI System Design" nav_order: N # determines position in navigation ---
- Use Mermaid for diagrams (rendered client-side):
```mermaid flowchart TD A[Start] --> B[End] ```
- Use Just the Docs callouts for emphasis:
{: .note } > This is a note callout. {: .tip } > This is a tip callout. {: .warning } > This is a warning callout. - Commit and push to
main— the site will auto-deploy in ~2 minutes
| Issue | Solution |
|---|---|
| Build fails on GitHub | Check the Actions tab for error logs; usually a Gemfile or front matter issue |
| Mermaid diagrams not rendering | Ensure head_custom.html includes the Mermaid CDN script |
| Navigation order wrong | Adjust nav_order in the page's front matter |
| Page not appearing | Verify parent in front matter matches the parent page's title exactly |
| Local serve fails | Run bundle update to update gems, ensure Ruby 3.1+ |
| Pages 404 after deploy | Verify baseurl in _config.yml matches your repo name |
| CSS/styles broken locally | Run bundle exec jekyll clean then rebuild |
Each system design topic follows a consistent interview-ready structure:
Software System Design pages include:
- What We're Building — problem statement with real-world scale
- Step 1: Requirements — functional/non-functional requirements tables, API design
- Step 2: Estimation — back-of-envelope calculations
- Step 3: High-Level Design — Mermaid architecture diagram
- Step 4: Deep Dive — 6-10 subsections with code in Java, Python, and Go
- Step 5: Scaling & Production — failure handling, monitoring, trade-offs
- Interview Tips — common follow-up questions
ML System Design pages include:
- What We're Building — problem statement with business impact
- ML Concepts Primer — key ML concepts needed for the design
- Step 1: Requirements — functional/non-functional + metrics (online and offline)
- Step 2: Estimation — QPS, storage, model inference budget
- Step 3: High-Level Design — online + offline pipeline diagrams
- Step 4: Deep Dive — 8-10 subsections with Python code examples
- Step 5: Scaling & Production — failure handling, privacy, monitoring
GenAI System Design pages include:
- What We're Building — problem statement with Google-scale metrics
- Key Concepts Primer — GenAI-specific concepts (KV-cache, PagedAttention, CLIP, etc.)
- Step 1: Requirements — functional/non-functional with GenAI-specific NFRs
- Step 2: Estimation — GPU compute, KV-cache memory, inference cost modeling
- Step 3: High-Level Design — Mermaid architecture diagrams
- Step 4: Deep Dive — 6-8 subsections with Python code examples
- Step 5: Scaling & Production — failure handling, monitoring, trade-offs
- Hypothetical Interview Transcript — full 45-minute Google-style interview simulation
- Fork the repository
- Create a feature branch:
git checkout -b feature/new-topic - Add your topic following the structure above
- Test locally with
bundle exec jekyll serve - Submit a pull request
This project is open source and available for educational purposes.