Skip to content

Commit 98bdbd1

Browse files
committed
new ICLR'26 paper
1 parent 7f3a740 commit 98bdbd1

4 files changed

Lines changed: 23 additions & 2 deletions

File tree

_bibliography/papers.bib

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,19 @@
22
---
33
44
```bib
5+
6+
@article{yuan2026dualmap,
7+
selected={true},
8+
bibtex_show={true},
9+
pdf={https://openreview.net/pdf?id=zCadrJ32Xn},
10+
code={https://github.com/ASISys/},
11+
title={DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving},
12+
author={Ying Yuan, Pengfei Zuo*, Bo Wang, Zhangyu Chen, Zhipeng Tan*, Zhou Yu},
13+
journal={Proceedings of the 14th International Conference on Learning Representations (ICLR)},
14+
year={2026},
15+
abbr={ICLR}
16+
}
17+
518
@article{Wang2026RelayGR,
619
selected={true},
720
bibtex_show={true},

_news/announcement_16.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
layout: post
3+
date: 2026-01-27 00:00:00+0800
4+
inline: true
5+
related_posts: false
6+
---
7+
8+
Our paper "[DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving](https://openreview.net/pdf?id=zCadrJ32Xn)" was accepted by ICLR 2026. Congratulations to Ying!

_pages/about.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ My research focuses on AI and cloud infrastructure, with an emphasis on machine
4242
### Research
4343

4444
#### AI Systems and Algorithms
45-
* LLM Serving Systems: [CachedAttention (USENIX ATC'24)](https://www.usenix.org/conference/atc24/presentation/gao-bin-cost), [Adrenaline (ArXiv'25)](https://arxiv.org/abs/2406.10198), [TaiChi (ArXiv'25)](https://arxiv.org/abs/2508.01989), [SparseServe (ArXiv'25)](https://arxiv.org/pdf/2509.24626)
45+
* LLM Serving Systems: [CachedAttention (USENIX ATC'24)](https://www.usenix.org/conference/atc24/presentation/gao-bin-cost), [Adrenaline (ArXiv'25)](https://arxiv.org/abs/2406.10198), [TaiChi (ArXiv'25)](https://arxiv.org/abs/2508.01989), [SparseServe (ArXiv'25)](https://arxiv.org/pdf/2509.24626), [DualMap (ICLR'26)](https://openreview.net/pdf?id=zCadrJ32Xn)
4646
* Generative Recommendation: [RelayGR (Technical Report'26)](https://arxiv.org/abs/2601.01712)
4747
* AI Algorithms: [AdaSkip (AAAI'25)](https://arxiv.org/abs/2405.19583), [Progressive Sparse Attention (ArXiv'25)](https://arxiv.org/abs/2406.10731)
4848
* AI Hardware Architectures: [DeepSniffer (ASPLOS'20)](https://dl.acm.org/doi/10.1145/3373376.3378487), [SEAL (DAC'21)](https://dl.acm.org/doi/10.1109/DAC18074.2021.9586256), [Memory Trojaning (TCAD'21)](https://ieeexplore.ieee.org/document/9345491), [CloudMatrix384 (Technical Report'25)](https://arxiv.org/abs/2506.12708)

_pages/projects.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ horizontal: false
2323

2424
This project targets the fundamental challenges of building highly efficient, scalable, and cost-effective LLM serving systems. As illustrated in the figure, we introduce key innovations across every layer of the serving stack.
2525

26-
**Distributed Request Router:** At the top of the stack, our intelligent router manages incoming traffic. Going beyond simple load balancing, our work, **Achieving Both Cache Affinity and Load Balance** (under review), ensures that requests are routed not only to available instances but preferentially to those that may already have the required context cached, significantly reducing redundant prefill computation.
26+
**Distributed Request Router:** At the top of the stack, our intelligent router manages incoming traffic. Going beyond simple load balancing, our work, [DualMap (ICLR'26)](https://openreview.net/pdf?id=zCadrJ32Xn) achieves both cache affinity and load balance for distributed LLM serving. It ensures that requests are routed not only to available instances but preferentially to those that may already have the required context cached, significantly reducing redundant prefill computation.
2727

2828
**Disaggregated LLM Serving:** At the core of the system, we decouple the computationally distinct phases of LLM inference—prompt processing (Prefill Instance) and token generation (Decode Instance). This fundamental separation is orchestrated by system-wide optimizations such as [Adrenaline (ArXiv'25)](https://arxiv.org/abs/2406.10198), which disaggregates and offloads memory-bound decode-phase attention to compute-bound prefill instances, and [TaiChi (AiXiv'25)](https://arxiv.org/abs/2508.01989), which unifies PD aggregation and disaggregation under a new serving architecture. To further accelerate the memory-bound decode phase, we propose a new dynamic sparse attention algorithm ([PSA@ArXiv'25](https://arxiv.org/abs/2406.10731)) and **SparseServe** (under review), a long-context LLM serving system that unlocks the parallel potential of DSA through efficient hierarchical HBM-DRAM management.
2929

0 commit comments

Comments
 (0)