A curated list of recent papers from top systems and networking conferences (e.g., NSDI, OSDI, SOSP, SIGCOMM) describing the design and implementation of production systems at major tech companies (e.g., Google, Meta, Microsoft, Amazon), starting from 2020.
- Cluster Management & Scheduling
- Distributed Storage & Databases
- Networking
- Machine Learning Infrastructure
- Serverless & Microservices
- Reliability & Debugging
- Autopilot: Workload Autoscaling at Google, EuroSys '20
[Google]- Explains how Google automatically adjusts resources for jobs in Borg to improve utilization without manual tuning.
- Twine: A Unified Cluster Management System for Shared Infrastructure, OSDI '20
[Meta]- Meta's cluster management system that unifies the lifecycle of maximizing hardware utilization, moving beyond simple task packing to managing the entire machine lifecycle.
- Protean: VM Allocation Service at Scale, OSDI '20
[Microsoft]- Microsoft Azure's centralized allocation service that manages millions of VMs across global regions, focusing on meeting varied placement constraints and packing efficiency.
- Global Capacity Management With Flux, OSDI '23
[Meta]- Describes how Meta places services across regions to handle massive scale constraints, power availability, and disaster recovery.
- Millions of Tiny Databasess, NSDI '20
[AWS]- AWS describes the architecture behind EBS, arguing for "blast radius reduction" by using millions of tiny Paxos groups rather than one massive monolithic consensus system.
- Virtual Consensus in Delos, OSDI '20
[Meta]- Meta's control plane storage system that allows hot-swapping the underlying consensus protocol (e.g., swapping ZooKeeper for a new protocol) without downtime.
- Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications, SOSP '21
[Meta]- A framework that manages the placement and migration of shards for hundreds of different stateful services at Meta, decoupling this complex logic from application code.
- HALP: Heuristic Aided Learned Preference Eviction Policy for YouTube Content Delivery Network, NSDI '23
[Google]- An ML-augmented cache eviction policy for YouTube's CDN DRAM cache that combines heuristics with learned preferences, reducing byte miss ratio by 9.1% at peak traffic with only 1.8% CPU overhead in production.
- Accessing Cloud with Disaggregated Software-Defined Router, NSDI '21
[Tencent]- Tencent's cloud gateway architecture that disaggregates router functionality into four independently scalable modules (access, forwarding, routing, SDN control), enabling rapid feature delivery while handling tens of Tbps of traffic.
- Orion: Google's Software-Defined Networking Control Plane, NSDI '21
[Google]- Google's second-generation SDN control plane built on a microservice architecture with a pub-sub database, achieving 40x faster convergence and 1.16M network updates/sec across Jupiter datacenter and B4 WAN networks.
- When Cloud Storage Meets RDMA, NSDI '21
[Alibaba]- Experience report on integrating RDMA into Alibaba's Pangu storage system, using podset-scoped RDMA with TCP fallback to halve client latency while maintaining high availability across exabyte-scale deployments.
- Evolvable Network Telemetry at Facebook, NSDI '22
[Meta]- Presents PCAT, a change-aware telemetry system that tracks and confines changes across Meta's rapidly evolving network (30+ code commits/week), preventing cascading failures in monitoring hundreds of thousands of switches and billions of counters.
- Bluebird: High-performance SDN for Bare-metal Cloud Services, NSDI '22
[Microsoft]- Azure's network virtualization system for bare-metal cloud using programmable switch ASICs with custom P4 programs, delivering full line-rate up to 100Gb/s with sub-microsecond latency per SDN switch hop.
- Cetus: Releasing P4 Programmers from the Chore of Trial and Error Compiling, NSDI '22
[Alibaba]- A compiler that automatically transforms uncompilable P4 programs into functionally equivalent compilable ones by shortening dependency chains, reducing Alibaba's P4 development cycle from days to minutes.
- Norma: Towards Practical Network Load Testing, NSDI '23
[Alibaba]- A programmable-switch-based network load tester capable of generating up to 3 Tbps of stateful TCP traffic, deployed at Alibaba for over two years to detect performance issues in production network devices.
- Empowering Azure Storage with RDMA, NSDI '23
[Microsoft]- Documents Microsoft's regional-scale RDMA deployment for Azure Storage, now carrying 70% of Azure traffic across all public regions, moving exabytes of data daily from TCP to RDMA with significant latency and CPU savings.
- Harnessing WebRTC for Large-Scale Live Streaming, SIGCOMM '25
[ByteDance]- ByteDance's production system for adapting WebRTC to large-scale live streaming, optimizing first-frame delay, startup rebuffering, audio-to-video drift, and per-session CPU usage.
- Intent-Driven Network Management with Multi-Agent LLMs: The Confucius Framework, SIGCOMM '25
[ByteDance]- A multi-agent LLM framework for network management that models workflows as DAGs, integrates LLMs with existing tools via RAG, and has been operational for two years with 60+ applications, reducing developer time by 17 engineer-hours/week.
- Alibaba Stellar: A New Generation RDMA Network for Cloud AI, SIGCOMM '25
[Alibaba]- Alibaba's RDMA virtualization stack for cloud AI, introducing para-virtualized DMA for on-demand memory pinning, an extended memory translation table for GPUDirect RDMA, and RDMA packet spray for efficient multi-path utilization.
- Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models, NSDI '22
[Meta]- A scalable checkpointing system for Meta's terabyte-scale recommendation models that uses incremental checkpointing and quantization to achieve 6-17x reduction in write bandwidth and 2.5-8x reduction in storage capacity.
- MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters, NSDI '22
[Alibaba]- A characterization study of Alibaba's PAI GPU cluster (6,742+ GPUs, 1,300+ users), revealing low utilization and long queueing delays, and proposing GPU sharing and reserving-and-packing scheduling policies to improve efficiency.
- MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale, OSDI '24
[Meta]- Addresses the GPU shortage by enabling Meta to train large models across distributed global datacenters, overcoming massive bandwidth constraints.
- Firecracker: Lightweight Virtualization for Serverless Applications, NSDI '20
[Amazon]- AWS describes the VMM behind Lambda, which stripped down QEMU to create "MicroVMs" that boot in <125ms for true multi-tenant isolation.
- ServiceRouter: Hyperscale and Minimal Cost Service Mesh at Meta, OSDI '23
[Meta]- Meta's take on a service mesh optimized for hyperscale, focusing on minimizing the "sidecar tax" (CPU/RAM overhead) that standard meshes incur.
- Hermes: Enhancing Layer-7 Cloud Load Balancers with Userspace-Directed I/O Event Notification, SIGCOMM '25
[Alibaba]- An eBPF-based framework that enables closed-loop scheduling between userspace workers and the kernel for L7 load balancers, processing 10M+ requests/sec on 100K CPU cores with 19% infrastructure cost reduction, deployed for over two years.
- Towards LLM-Based Failure Localization in Production-Scale Networks, SIGCOMM '25
[Alibaba]- BiAn, an LLM-based framework that processes monitoring data to rank error devices with explanations during network incidents, reducing root-cause time by 20.5% overall (55.2% for high-risk incidents) across 10 months of production deployment.
- SkeletonHunter: Diagnosing and Localizing Network Failures in Containerized Large Model Training, SIGCOMM '25
[Alibaba]- A network diagnosis system that exploits the intrinsic traffic sparsity in large model training to detect and localize failures, achieving 98.2% precision and 95.7% localization accuracy across 40K+ GPUs at Alibaba Cloud.