new ICLR'26 paper

Pfzuo · Pfzuo · commit 98bdbd15d1aa · 2026-01-27T17:20:56.000+08:00
diff --git a/_bibliography/papers.bib b/_bibliography/papers.bib
@@ -2,6 +2,19 @@
 ---
 
 ```bib
+
+@article{yuan2026dualmap,
+      selected={true},
+      bibtex_show={true},
+      pdf={https://openreview.net/pdf?id=zCadrJ32Xn},
+      code={https://github.com/ASISys/},
+      title={DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving}, 
+      author={Ying Yuan, Pengfei Zuo*, Bo Wang, Zhangyu Chen, Zhipeng Tan*, Zhou Yu},
+      journal={Proceedings of the 14th International Conference on Learning Representations (ICLR)},
+      year={2026},
+      abbr={ICLR}
+}
+
 @article{Wang2026RelayGR,
   selected={true},
   bibtex_show={true},
diff --git a/_news/announcement_16.md b/_news/announcement_16.md
@@ -0,0 +1,8 @@
+---
+layout: post
+date: 2026-01-27 00:00:00+0800
+inline: true
+related_posts: false
+---
+
+Our paper "[DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving](https://openreview.net/pdf?id=zCadrJ32Xn)" was accepted by ICLR 2026. Congratulations to Ying!
diff --git a/_pages/about.md b/_pages/about.md
@@ -42,7 +42,7 @@ My research focuses on AI and cloud infrastructure, with an emphasis on machine
 ### Research
 
 #### AI Systems and Algorithms  
-* LLM Serving Systems: [CachedAttention (USENIX ATC'24)](https://www.usenix.org/conference/atc24/presentation/gao-bin-cost), [Adrenaline (ArXiv'25)](https://arxiv.org/abs/2406.10198), [TaiChi (ArXiv'25)](https://arxiv.org/abs/2508.01989), [SparseServe (ArXiv'25)](https://arxiv.org/pdf/2509.24626)  
+* LLM Serving Systems: [CachedAttention (USENIX ATC'24)](https://www.usenix.org/conference/atc24/presentation/gao-bin-cost), [Adrenaline (ArXiv'25)](https://arxiv.org/abs/2406.10198), [TaiChi (ArXiv'25)](https://arxiv.org/abs/2508.01989), [SparseServe (ArXiv'25)](https://arxiv.org/pdf/2509.24626), [DualMap (ICLR'26)](https://openreview.net/pdf?id=zCadrJ32Xn)
 * Generative Recommendation: [RelayGR (Technical Report'26)](https://arxiv.org/abs/2601.01712)  
 * AI Algorithms: [AdaSkip (AAAI'25)](https://arxiv.org/abs/2405.19583),  [Progressive Sparse Attention (ArXiv'25)](https://arxiv.org/abs/2406.10731)  
 * AI Hardware Architectures: [DeepSniffer (ASPLOS'20)](https://dl.acm.org/doi/10.1145/3373376.3378487), [SEAL (DAC'21)](https://dl.acm.org/doi/10.1109/DAC18074.2021.9586256), [Memory Trojaning (TCAD'21)](https://ieeexplore.ieee.org/document/9345491), [CloudMatrix384 (Technical Report'25)](https://arxiv.org/abs/2506.12708)  
diff --git a/_pages/projects.md b/_pages/projects.md
@@ -23,7 +23,7 @@ horizontal: false
 
 This project targets the fundamental challenges of building highly efficient, scalable, and cost-effective LLM serving systems. As illustrated in the figure, we introduce key innovations across every layer of the serving stack.
 
-**Distributed Request Router:** At the top of the stack, our intelligent router manages incoming traffic. Going beyond simple load balancing, our work, **Achieving Both Cache Affinity and Load Balance** (under review), ensures that requests are routed not only to available instances but preferentially to those that may already have the required context cached, significantly reducing redundant prefill computation.
+**Distributed Request Router:** At the top of the stack, our intelligent router manages incoming traffic. Going beyond simple load balancing, our work, [DualMap (ICLR'26)](https://openreview.net/pdf?id=zCadrJ32Xn) achieves both cache affinity and load balance for distributed LLM serving. It ensures that requests are routed not only to available instances but preferentially to those that may already have the required context cached, significantly reducing redundant prefill computation.
 
 **Disaggregated LLM Serving:** At the core of the system, we decouple the computationally distinct phases of LLM inference—prompt processing (Prefill Instance) and token generation (Decode Instance). This fundamental separation is orchestrated by system-wide optimizations such as [Adrenaline (ArXiv'25)](https://arxiv.org/abs/2406.10198), which disaggregates and offloads memory-bound decode-phase attention to compute-bound prefill instances, and [TaiChi (AiXiv'25)](https://arxiv.org/abs/2508.01989), which unifies PD aggregation and disaggregation under a new serving architecture. To further accelerate the memory-bound decode phase, we propose a new dynamic sparse attention algorithm ([PSA@ArXiv'25](https://arxiv.org/abs/2406.10731)) and **SparseServe** (under review), a long-context LLM serving system that unlocks the parallel potential of DSA through efficient hierarchical HBM-DRAM management.