22
33## Overview
44
5- This benchmark is designed to identify ** the maximum harmonic mean of user sessions $(C,S)$ that can be kept active while maintaining a steady-state TTFT ≤ 2 s (95-th percentile) ** . By sweeping the concurrency ($C$) and sequential ($S$) independently, it isolates whether compute capacity or KV-cache pressure is the first limiting factor.
5+ This benchmark is designed to explore how TTFT changes across different $(C, S)$ combinations by sweeping concurrency ($C$) and session depth ($S$) independently. This helps isolate whether compute capacity or KV-cache pressure is the primary limiting factor.
66
7-
8- We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time.
7+ We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time. The JSON output from the benchmark includes metrics from vLLM/LMCache.
98
109This benchmark feeds full‑length novels to your LLM server and asks many follow‑up questions, just like a book critic. It is handy for testing long‑context handling and KV‑cache tools such as LMCache.
1110
12- The benchmark is called CxS (pronounced six for simplicity), referring to the product of Concurrent $\times$ Sequential users.
13-
14- ### Definition
15-
16- Let us define the set of candidate pairs:
17-
18- $$
19- \mathcal{D} = \{ (C_i, S_i) \mid \mathrm{TTFT}_{95}^{(i)} \leq 2 \}
20- $$
21-
22- ### Objective
23-
24- More precisely, we aim to find the pair that maximizes the harmonic mean among all candidates in $\mathcal{D}$:
25-
26-
27- $$
28- \underset{(C_i, S_i) \in \mathcal{D}}{\arg\max} \left( \frac{2 C_i S_i}{C_i + S_i} \right)
29- $$
30-
31- We use the harmonic mean to compare scores.
32- As a business metric, we report the product, CxS.
33- For example, we say "Our system can keep up to {C×S} user sessions active!"
11+ The benchmark is called CxS (pronounced six for simplicity), referring to the product of Concurrent $\times$ Session Depth.
3412
3513## Two simple knobs
3614
3715| Option | What it means |
3816| ---- | ---- |
39- | ` --num-users- concurrent ` (C) | How many threads run in parallel. |
40- | ` --num-users-sequential ` (S) | How many users each thread serves in turn. |
17+ | ` --concurrent ` (C) | How many threads run in parallel. |
18+ | ` --session-depth ` (S) | How many sessions each thread serves in turn. |
4119
4220You can:
43- * raise concurrent to test compute-side capability (higher GPU utilization; total KV footprint also rises).
44- * raise sequential to test KV-cache pressure (larger resident KV per GPU, little change in instantaneous GPU utilization).
21+ * raise $C$ to test compute-side capability (higher GPU utilization; total KV footprint also rises).
22+ * raise $S$ to test KV-cache pressure (larger resident KV per GPU, little change in instantaneous GPU utilization).
4523
4624## Execution model
4725
4826```
49- Concurrent USER : {A,B}
50- Sequential USER : {X,Y}
51- All USER : {AX,AY,BX,BY}
27+ Concurrent: {A,B}
28+ Session Depth : {X,Y}
29+ All Session : {AX,AY,BX,BY}
5230
5331Timeline
5432-------------------------------------------------
5533Thread A:
56- Turn 0 → UserAX : Q1 "Read and summarize this novel. {AX novel contents}" → Get Response
57- Turn 0 → UserAY : Q1 "Read and summarize this novel. {AY novel contents}" → Get Response
58- Turn 1 → UserAX : Q2 "Write down the author's feelings." → Get Response
59- Turn 1 → UserAY : Q2 "Write down the author's feelings." → Get Response
34+ Turn 0 → SessionAX : Q1 "Read and summarize this novel. {AX novel contents}" → Get Response
35+ Turn 0 → SessionAY : Q1 "Read and summarize this novel. {AY novel contents}" → Get Response
36+ Turn 1 → SessionAX : Q2 "Write down the author's feelings." → Get Response
37+ Turn 1 → SessionAY : Q2 "Write down the author's feelings." → Get Response
6038 ...
6139Thread B:
62- Turn 0 → UserBX : Q1 "Read and summarize this novel. {BX novel contents}" → Get Response
63- Turn 0 → UserBY : Q1 "Read and summarize this novel. {BY novel contents}" → Get Response
64- Turn 1 → UserBX : Q2 "Write down the author's feelings." → Get Response
65- Turn 1 → UserBY : Q2 "Write down the author's feelings." → Get Response
40+ Turn 0 → SessionBX : Q1 "Read and summarize this novel. {BX novel contents}" → Get Response
41+ Turn 0 → SessionBY : Q1 "Read and summarize this novel. {BY novel contents}" → Get Response
42+ Turn 1 → SessionBX : Q2 "Write down the author's feelings." → Get Response
43+ Turn 1 → SessionBY : Q2 "Write down the author's feelings." → Get Response
6644 ...
6745```
6846
47+ ## For system competition
48+
49+ The CxS benchmark provides a scalar score to encourage healthy competition, but its use is not mandatory.
50+
51+ ### Definition
52+
53+ Let us define the set of candidate pairs:
54+
55+ $$
56+ \mathcal{D} = {\{ (C_i, S_i) \mid \mathrm{TTFT}_{95}^{(i)} \leq 2 \}}
57+ $$
58+
59+ ### Objective
60+
61+ More precisely, we aim to find the pair that maximizes the harmonic mean among all candidates in $\mathcal{D}$:
62+
63+
64+ $$
65+ \underset{(C_i, S_i) \in \mathcal{D}}{\arg\max} \left( \frac{2 C_i S_i}{C_i + S_i} \right)
66+ $$
67+
68+ ## For business metric
69+
70+ As a business metric, we report the product, CxS.
71+ For example, we say "Our system can keep up to {C×S} user sessions active!"
72+
6973## Getting Started
7074
7175``` bash
@@ -75,9 +79,9 @@ python prepare.py --output data --model Qwen/Qwen2.5-7B-Instruct-1M # Models use
7579
7680``` bash
7781# Run the benchmark many times
78- BASE_URL=" http://localhost:8000/v1 "
82+ BASE_URL=" http://localhost:8000"
7983MODEL=" Qwen/Qwen2.5-7B-Instruct-1M"
80- NUM_ROUNDS=3
84+ NUM_ROUNDS=12
8185OUTPUT_DIR=" bench_dir"
8286SRC_DIR=" ./data/128k"
8387mkdir -p " $OUTPUT_DIR "
@@ -86,67 +90,107 @@ for c in {1..4}; do # You can change c and s to any value you like.
8690 for s in {1..4}; do
8791 TIMESTAMP=$( date +%s)
8892 OUTPUT_FILE=" ${OUTPUT_DIR} /bench_c${c} _s${s} _${TIMESTAMP} .json"
89- echo " Running benchmark: concurrent =${c} , sequential =${s} "
90- python multi-round-qa.py --num-users-concurrent " $c " --num-users-sequential " $s " --num-rounds " $NUM_ROUNDS " --model " $MODEL " --base-url " $BASE_URL " --output " $OUTPUT_FILE " --src-dir " $SRC_DIR "
93+ echo " Running benchmark: C =${c} , S =${s} "
94+ python multi-round-qa.py -c " $c " -s " $s " --num-rounds " $NUM_ROUNDS " --model " $MODEL " --base-url " $BASE_URL " --output " $OUTPUT_FILE " --src-dir " $SRC_DIR "
9195 done
9296done
9397```
9498
99+ We compare two systems for demo:
100+
101+ System A
102+ * Model
103+ * Qwen/Qwen2.5-7B-Instruct-1M
104+ * Dataset
105+ * 32k
106+ * CPU/GPU
107+ * NVIDIA GH200 480GB
108+ * vLLM
109+ * v0.9.0.1
110+ * enable prefix-caching
111+ * enable chunked prefill
112+ * LMCache
113+ * local_cpu: True
114+ * max_local_cpu_size: 200
115+ * pipelined_backend: True
116+ * save_decode_cache: True
117+
118+ System B
119+ * Model
120+ * Qwen/Qwen2.5-7B-Instruct-1M
121+ * Dataset
122+ * 32k
123+ * CPU/GPU
124+ * NVIDIA GH200 480GB
125+ * vLLM
126+ * v0.9.0.1
127+ * enable prefix-caching
128+ * enable chunked prefill
129+ * LMCache
130+ * local_cpu: True
131+ * max_local_cpu_size: 200
132+ * pipelined_backend: True
133+ * save_decode_cache: True
134+ * local_disk: file:///data/tmp
135+ * max_local_disk_size: 400
136+ * Storage
137+ * DDN EXAScaler 2.14.0
138+ * stripe count is 8
139+ * stripe size is 1MiB
140+
95141``` bash
96142# Plot and Show Result
97- $ python plot.py ./bench_dir_vllm vllm .png
98- num_users_concurrent num_users_sequential ttft_95
99- 0 4 2 0.498404
100- 1 4 4 33.565437
101- 2 4 3 0.794144
102- 3 1 4 0.311046
103- 4 2 2 0.406148
104- 5 2 4 0.459704
105- 6 2 3 0.326396
106- 7 1 2 0.411317
107- 8 3 3 0.378674
108- 9 2 1 0.445499
109- 10 3 4 42.531053
110- 11 1 3 0.455651
111- 12 4 1 0.504505
112- 13 3 2 0.393902
113- 14 3 1 0.364927
114- 15 1 1 0.379049
115- Max harmonic mean (C,S) where TTFT_95 < = 2s: 3.43
116- = > C=4.0, S=3.0, CxS=12.0
117- $ python plot.py ./bench_dir_lmcache lmcache .png
118- num_users_concurrent num_users_sequential ttft_95
119- 0 1 1 0.524989
120- 1 3 2 0.592148
121- 2 4 4 1.202544
122- 3 3 4 1.286755
123- 4 2 1 0.477370
124- 5 3 3 0.586793
125- 6 2 3 0.627655
126- 7 4 1 0.575724
127- 8 4 3 1.251918
128- 9 2 4 0.446477
129- 10 1 4 0.460711
130- 11 3 1 0.495073
131- 12 1 3 0.329389
132- 13 4 2 0.586223
133- 14 1 2 0.477946
134- 15 2 2 0.457463
135- Max harmonic mean (C,S) where TTFT_95 < = 2s: 4.00
136- = > C=4.0, S=4.0, CxS=16.0
143+ $ python plot.py lmcache_bench_dir-1749973344 lmcache_with_cpu_200g .png
144+ c s ttft_95
145+ 0 8 16 2.674693
146+ 1 12 32 3.268448
147+ 2 4 32 2.496206
148+ 3 16 16 3.310291
149+ 4 4 8 0.146159
150+ 5 8 32 2.801732
151+ 6 12 24 3.283783
152+ 7 12 16 3.185047
153+ 8 12 8 0.390896
154+ 9 4 24 0.217809
155+ 10 16 8 3.799740
156+ 11 8 8 0.347083
157+ 12 16 24 3.171192
158+ 13 16 32 3.032414
159+ 14 8 24 3.383691
160+ 15 4 16 0.253737
161+ Best (C,S) with TTFT_95 ≤ 2 s → C=12.0, S=8.0, HarmonicMean=9.60, C×S=96.0
162+ Saved: lmcache_with_cpu_200g.png
163+ $ python plot.py lmcache_bench_dir-1749897431 lmcache_with_cpu_200g_exa_400g .png
164+ c s ttft_95
165+ 0 4 16 0.255378
166+ 1 8 24 3.213307
167+ 2 16 24 4.067904
168+ 3 4 24 0.612876
169+ 4 8 32 4.389398
170+ 5 4 8 0.158686
171+ 6 12 24 3.939205
172+ 7 12 8 0.634048
173+ 8 4 32 1.191106
174+ 9 12 32 3.475115
175+ 10 16 16 3.156051
176+ 11 8 8 0.264291
177+ 12 12 16 2.739532
178+ 13 16 32 3.853057
179+ 14 8 16 1.424959
180+ 15 16 8 3.470811
181+ Best (C,S) with TTFT_95 ≤ 2 s → C=8.0, S=16.0, HarmonicMean=10.67, C×S=128.0
182+ Saved: lmcache_with_cpu_200g_exa_400g.png
137183```
138-
139- LMCache allows 1.17x increase in the number of user sessions kept active at least.
140-
141- Note: LMCache has not yet reached its limit in this case,
142- so we can aim to further improve the score by changing C and S.
184+ This result shows that adding external storage (DDN EXAScaler) as a tier in the KV cache can increase the number of active sessions.
143185
144186## Viz
145187
146- vllm.png
188+ The white dashed line indicates the TTFT = 2s boundary.
189+
190+ System A result:
147191
148- ![ vLLM Plot] ( vllm .png)
192+ ![ LMCache+CPU Plot] ( lmcache_with_cpu_200g .png)
149193
150- lmcache.png
194+ System B result:
151195
152- ![ LMCache Plot] ( lmcache .png)
196+ ![ LMCache+CPU+Storage400g Plot] ( lmcache_with_cpu_200g_exa_400g .png)
0 commit comments