**Disaggregated LLM Serving:** At the core of the system, we decouple the computationally distinct phases of LLM inference—prompt processing (Prefill Instance) and token generation (Decode Instance). This fundamental separation is orchestrated by system-wide optimizations such as [Adrenaline (ArXiv'25)](https://arxiv.org/abs/2406.10198), which disaggregates and offloads memory-bound decode-phase attention to compute-bound prefill instances, and [TaiChi (AiXiv'25)](https://arxiv.org/abs/2508.01989), which unifies PD aggregation and disaggregation under a new serving architecture. To further accelerate the memory-bound decode phase, we propose a new dynamic sparse attention algorithm ([PSA@ArXiv'25](https://arxiv.org/abs/2406.10731)) and **SparseServe** (under review), a long-context LLM serving system that unlocks the parallel potential of DSA through efficient hierarchical HBM-DRAM management.
0 commit comments