Skip to content

Romero027/hyperscale-systems-papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Production Systems Reading List

A curated list of recent papers from top systems and networking conferences (e.g., NSDI, OSDI, SOSP, SIGCOMM) describing the design and implementation of production systems at major tech companies (e.g., Google, Meta, Microsoft, Amazon), starting from 2020.

Index


Reading List

Cluster Management & Scheduling

Distributed Storage & Databases

Networking

  • Accessing Cloud with Disaggregated Software-Defined Router, NSDI '21 [Tencent]
    • Tencent's cloud gateway architecture that disaggregates router functionality into four independently scalable modules (access, forwarding, routing, SDN control), enabling rapid feature delivery while handling tens of Tbps of traffic.
  • Orion: Google's Software-Defined Networking Control Plane, NSDI '21 [Google]
    • Google's second-generation SDN control plane built on a microservice architecture with a pub-sub database, achieving 40x faster convergence and 1.16M network updates/sec across Jupiter datacenter and B4 WAN networks.
  • When Cloud Storage Meets RDMA, NSDI '21 [Alibaba]
    • Experience report on integrating RDMA into Alibaba's Pangu storage system, using podset-scoped RDMA with TCP fallback to halve client latency while maintaining high availability across exabyte-scale deployments.
  • Evolvable Network Telemetry at Facebook, NSDI '22 [Meta]
    • Presents PCAT, a change-aware telemetry system that tracks and confines changes across Meta's rapidly evolving network (30+ code commits/week), preventing cascading failures in monitoring hundreds of thousands of switches and billions of counters.
  • Bluebird: High-performance SDN for Bare-metal Cloud Services, NSDI '22 [Microsoft]
    • Azure's network virtualization system for bare-metal cloud using programmable switch ASICs with custom P4 programs, delivering full line-rate up to 100Gb/s with sub-microsecond latency per SDN switch hop.
  • Cetus: Releasing P4 Programmers from the Chore of Trial and Error Compiling, NSDI '22 [Alibaba]
    • A compiler that automatically transforms uncompilable P4 programs into functionally equivalent compilable ones by shortening dependency chains, reducing Alibaba's P4 development cycle from days to minutes.
  • Norma: Towards Practical Network Load Testing, NSDI '23 [Alibaba]
    • A programmable-switch-based network load tester capable of generating up to 3 Tbps of stateful TCP traffic, deployed at Alibaba for over two years to detect performance issues in production network devices.
  • Empowering Azure Storage with RDMA, NSDI '23 [Microsoft]
    • Documents Microsoft's regional-scale RDMA deployment for Azure Storage, now carrying 70% of Azure traffic across all public regions, moving exabytes of data daily from TCP to RDMA with significant latency and CPU savings.
  • Harnessing WebRTC for Large-Scale Live Streaming, SIGCOMM '25 [ByteDance]
    • ByteDance's production system for adapting WebRTC to large-scale live streaming, optimizing first-frame delay, startup rebuffering, audio-to-video drift, and per-session CPU usage.
  • Intent-Driven Network Management with Multi-Agent LLMs: The Confucius Framework, SIGCOMM '25 [ByteDance]
    • A multi-agent LLM framework for network management that models workflows as DAGs, integrates LLMs with existing tools via RAG, and has been operational for two years with 60+ applications, reducing developer time by 17 engineer-hours/week.
  • Alibaba Stellar: A New Generation RDMA Network for Cloud AI, SIGCOMM '25 [Alibaba]
    • Alibaba's RDMA virtualization stack for cloud AI, introducing para-virtualized DMA for on-demand memory pinning, an extended memory translation table for GPUDirect RDMA, and RDMA packet spray for efficient multi-path utilization.

Machine Learning Infrastructure

Serverless & Microservices

Reliability & Debugging

About

A curated collection of papers describing real-world production systems from top systems conferences (2020+).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors