Skip to content

Commit f4abc35

Browse files
authored
[Feat] CxS: Add new Demo (#26)
* [Misc] cxs: update param and docs Signed-off-by: Sohei Koyama <skoyama@ddn.com> * cxs: record metrics Signed-off-by: Sohei Koyama <skoyama@ddn.com> --------- Signed-off-by: Sohei Koyama <skoyama@ddn.com>
1 parent 95b2939 commit f4abc35

7 files changed

Lines changed: 245 additions & 173 deletions

File tree

real-multi-round-qa/README.md

Lines changed: 137 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -2,70 +2,74 @@
22

33
## Overview
44

5-
This benchmark is designed to identify **the maximum harmonic mean of user sessions $(C,S)$ that can be kept active while maintaining a steady-state TTFT ≤ 2 s (95-th percentile)**. By sweeping the concurrency ($C$) and sequential ($S$) independently, it isolates whether compute capacity or KV-cache pressure is the first limiting factor.
5+
This benchmark is designed to explore how TTFT changes across different $(C, S)$ combinations by sweeping concurrency ($C$) and session depth ($S$) independently. This helps isolate whether compute capacity or KV-cache pressure is the primary limiting factor.
66

7-
8-
We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time.
7+
We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time. The JSON output from the benchmark includes metrics from vLLM/LMCache.
98

109
This benchmark feeds full‑length novels to your LLM server and asks many follow‑up questions, just like a book critic. It is handy for testing long‑context handling and KV‑cache tools such as LMCache.
1110

12-
The benchmark is called CxS (pronounced six for simplicity), referring to the product of Concurrent $\times$ Sequential users.
13-
14-
### Definition
15-
16-
Let us define the set of candidate pairs:
17-
18-
$$
19-
\mathcal{D} = \{ (C_i, S_i) \mid \mathrm{TTFT}_{95}^{(i)} \leq 2 \}
20-
$$
21-
22-
### Objective
23-
24-
More precisely, we aim to find the pair that maximizes the harmonic mean among all candidates in $\mathcal{D}$:
25-
26-
27-
$$
28-
\underset{(C_i, S_i) \in \mathcal{D}}{\arg\max} \left( \frac{2 C_i S_i}{C_i + S_i} \right)
29-
$$
30-
31-
We use the harmonic mean to compare scores.
32-
As a business metric, we report the product, CxS.
33-
For example, we say "Our system can keep up to {C×S} user sessions active!"
11+
The benchmark is called CxS (pronounced six for simplicity), referring to the product of Concurrent $\times$ Session Depth.
3412

3513
## Two simple knobs
3614

3715
| Option | What it means |
3816
| ---- | ---- |
39-
| `--num-users-concurrent` (C) | How many threads run in parallel. |
40-
| `--num-users-sequential` (S) | How many users each thread serves in turn. |
17+
| `--concurrent` (C) | How many threads run in parallel. |
18+
| `--session-depth` (S) | How many sessions each thread serves in turn. |
4119

4220
You can:
43-
* raise concurrent to test compute-side capability (higher GPU utilization; total KV footprint also rises).
44-
* raise sequential to test KV-cache pressure (larger resident KV per GPU, little change in instantaneous GPU utilization).
21+
* raise $C$ to test compute-side capability (higher GPU utilization; total KV footprint also rises).
22+
* raise $S$ to test KV-cache pressure (larger resident KV per GPU, little change in instantaneous GPU utilization).
4523

4624
## Execution model
4725

4826
```
49-
Concurrent USER: {A,B}
50-
Sequential USER: {X,Y}
51-
All USER: {AX,AY,BX,BY}
27+
Concurrent: {A,B}
28+
Session Depth: {X,Y}
29+
All Session: {AX,AY,BX,BY}
5230
5331
Timeline
5432
-------------------------------------------------
5533
Thread A:
56-
Turn 0 → UserAX: Q1 "Read and summarize this novel. {AX novel contents}" → Get Response
57-
Turn 0 → UserAY: Q1 "Read and summarize this novel. {AY novel contents}" → Get Response
58-
Turn 1 → UserAX: Q2 "Write down the author's feelings." → Get Response
59-
Turn 1 → UserAY: Q2 "Write down the author's feelings." → Get Response
34+
Turn 0 → SessionAX: Q1 "Read and summarize this novel. {AX novel contents}" → Get Response
35+
Turn 0 → SessionAY: Q1 "Read and summarize this novel. {AY novel contents}" → Get Response
36+
Turn 1 → SessionAX: Q2 "Write down the author's feelings." → Get Response
37+
Turn 1 → SessionAY: Q2 "Write down the author's feelings." → Get Response
6038
...
6139
Thread B:
62-
Turn 0 → UserBX: Q1 "Read and summarize this novel. {BX novel contents}" → Get Response
63-
Turn 0 → UserBY: Q1 "Read and summarize this novel. {BY novel contents}" → Get Response
64-
Turn 1 → UserBX: Q2 "Write down the author's feelings." → Get Response
65-
Turn 1 → UserBY: Q2 "Write down the author's feelings." → Get Response
40+
Turn 0 → SessionBX: Q1 "Read and summarize this novel. {BX novel contents}" → Get Response
41+
Turn 0 → SessionBY: Q1 "Read and summarize this novel. {BY novel contents}" → Get Response
42+
Turn 1 → SessionBX: Q2 "Write down the author's feelings." → Get Response
43+
Turn 1 → SessionBY: Q2 "Write down the author's feelings." → Get Response
6644
...
6745
```
6846

47+
## For system competition
48+
49+
The CxS benchmark provides a scalar score to encourage healthy competition, but its use is not mandatory.
50+
51+
### Definition
52+
53+
Let us define the set of candidate pairs:
54+
55+
$$
56+
\mathcal{D} = {\{ (C_i, S_i) \mid \mathrm{TTFT}_{95}^{(i)} \leq 2 \}}
57+
$$
58+
59+
### Objective
60+
61+
More precisely, we aim to find the pair that maximizes the harmonic mean among all candidates in $\mathcal{D}$:
62+
63+
64+
$$
65+
\underset{(C_i, S_i) \in \mathcal{D}}{\arg\max} \left( \frac{2 C_i S_i}{C_i + S_i} \right)
66+
$$
67+
68+
## For business metric
69+
70+
As a business metric, we report the product, CxS.
71+
For example, we say "Our system can keep up to {C×S} user sessions active!"
72+
6973
## Getting Started
7074

7175
```bash
@@ -75,9 +79,9 @@ python prepare.py --output data --model Qwen/Qwen2.5-7B-Instruct-1M # Models use
7579

7680
```bash
7781
# Run the benchmark many times
78-
BASE_URL="http://localhost:8000/v1"
82+
BASE_URL="http://localhost:8000"
7983
MODEL="Qwen/Qwen2.5-7B-Instruct-1M"
80-
NUM_ROUNDS=3
84+
NUM_ROUNDS=12
8185
OUTPUT_DIR="bench_dir"
8286
SRC_DIR="./data/128k"
8387
mkdir -p "$OUTPUT_DIR"
@@ -86,67 +90,107 @@ for c in {1..4}; do # You can change c and s to any value you like.
8690
for s in {1..4}; do
8791
TIMESTAMP=$(date +%s)
8892
OUTPUT_FILE="${OUTPUT_DIR}/bench_c${c}_s${s}_${TIMESTAMP}.json"
89-
echo "Running benchmark: concurrent=${c}, sequential=${s}"
90-
python multi-round-qa.py --num-users-concurrent "$c" --num-users-sequential "$s" --num-rounds "$NUM_ROUNDS" --model "$MODEL" --base-url "$BASE_URL" --output "$OUTPUT_FILE" --src-dir "$SRC_DIR"
93+
echo "Running benchmark: C=${c}, S=${s}"
94+
python multi-round-qa.py -c "$c" -s "$s" --num-rounds "$NUM_ROUNDS" --model "$MODEL" --base-url "$BASE_URL" --output "$OUTPUT_FILE" --src-dir "$SRC_DIR"
9195
done
9296
done
9397
```
9498

99+
We compare two systems for demo:
100+
101+
System A
102+
* Model
103+
* Qwen/Qwen2.5-7B-Instruct-1M
104+
* Dataset
105+
* 32k
106+
* CPU/GPU
107+
* NVIDIA GH200 480GB
108+
* vLLM
109+
* v0.9.0.1
110+
* enable prefix-caching
111+
* enable chunked prefill
112+
* LMCache
113+
* local_cpu: True
114+
* max_local_cpu_size: 200
115+
* pipelined_backend: True
116+
* save_decode_cache: True
117+
118+
System B
119+
* Model
120+
* Qwen/Qwen2.5-7B-Instruct-1M
121+
* Dataset
122+
* 32k
123+
* CPU/GPU
124+
* NVIDIA GH200 480GB
125+
* vLLM
126+
* v0.9.0.1
127+
* enable prefix-caching
128+
* enable chunked prefill
129+
* LMCache
130+
* local_cpu: True
131+
* max_local_cpu_size: 200
132+
* pipelined_backend: True
133+
* save_decode_cache: True
134+
* local_disk: file:///data/tmp
135+
* max_local_disk_size: 400
136+
* Storage
137+
* DDN EXAScaler 2.14.0
138+
* stripe count is 8
139+
* stripe size is 1MiB
140+
95141
```bash
96142
# Plot and Show Result
97-
$ python plot.py ./bench_dir_vllm vllm.png
98-
num_users_concurrent num_users_sequential ttft_95
99-
0 4 2 0.498404
100-
1 4 4 33.565437
101-
2 4 3 0.794144
102-
3 1 4 0.311046
103-
4 2 2 0.406148
104-
5 2 4 0.459704
105-
6 2 3 0.326396
106-
7 1 2 0.411317
107-
8 3 3 0.378674
108-
9 2 1 0.445499
109-
10 3 4 42.531053
110-
11 1 3 0.455651
111-
12 4 1 0.504505
112-
13 3 2 0.393902
113-
14 3 1 0.364927
114-
15 1 1 0.379049
115-
Max harmonic mean (C,S) where TTFT_95 <= 2s: 3.43
116-
=> C=4.0, S=3.0, CxS=12.0
117-
$ python plot.py ./bench_dir_lmcache lmcache.png
118-
num_users_concurrent num_users_sequential ttft_95
119-
0 1 1 0.524989
120-
1 3 2 0.592148
121-
2 4 4 1.202544
122-
3 3 4 1.286755
123-
4 2 1 0.477370
124-
5 3 3 0.586793
125-
6 2 3 0.627655
126-
7 4 1 0.575724
127-
8 4 3 1.251918
128-
9 2 4 0.446477
129-
10 1 4 0.460711
130-
11 3 1 0.495073
131-
12 1 3 0.329389
132-
13 4 2 0.586223
133-
14 1 2 0.477946
134-
15 2 2 0.457463
135-
Max harmonic mean (C,S) where TTFT_95 <= 2s: 4.00
136-
=> C=4.0, S=4.0, CxS=16.0
143+
$ python plot.py lmcache_bench_dir-1749973344 lmcache_with_cpu_200g.png
144+
c s ttft_95
145+
0 8 16 2.674693
146+
1 12 32 3.268448
147+
2 4 32 2.496206
148+
3 16 16 3.310291
149+
4 4 8 0.146159
150+
5 8 32 2.801732
151+
6 12 24 3.283783
152+
7 12 16 3.185047
153+
8 12 8 0.390896
154+
9 4 24 0.217809
155+
10 16 8 3.799740
156+
11 8 8 0.347083
157+
12 16 24 3.171192
158+
13 16 32 3.032414
159+
14 8 24 3.383691
160+
15 4 16 0.253737
161+
Best (C,S) with TTFT_95 ≤ 2 s → C=12.0, S=8.0, HarmonicMean=9.60, C×S=96.0
162+
Saved: lmcache_with_cpu_200g.png
163+
$ python plot.py lmcache_bench_dir-1749897431 lmcache_with_cpu_200g_exa_400g.png
164+
c s ttft_95
165+
0 4 16 0.255378
166+
1 8 24 3.213307
167+
2 16 24 4.067904
168+
3 4 24 0.612876
169+
4 8 32 4.389398
170+
5 4 8 0.158686
171+
6 12 24 3.939205
172+
7 12 8 0.634048
173+
8 4 32 1.191106
174+
9 12 32 3.475115
175+
10 16 16 3.156051
176+
11 8 8 0.264291
177+
12 12 16 2.739532
178+
13 16 32 3.853057
179+
14 8 16 1.424959
180+
15 16 8 3.470811
181+
Best (C,S) with TTFT_95 ≤ 2 s → C=8.0, S=16.0, HarmonicMean=10.67, C×S=128.0
182+
Saved: lmcache_with_cpu_200g_exa_400g.png
137183
```
138-
139-
LMCache allows 1.17x increase in the number of user sessions kept active at least.
140-
141-
Note: LMCache has not yet reached its limit in this case,
142-
so we can aim to further improve the score by changing C and S.
184+
This result shows that adding external storage (DDN EXAScaler) as a tier in the KV cache can increase the number of active sessions.
143185

144186
## Viz
145187

146-
vllm.png
188+
The white dashed line indicates the TTFT = 2s boundary.
189+
190+
System A result:
147191

148-
![vLLM Plot](vllm.png)
192+
![LMCache+CPU Plot](lmcache_with_cpu_200g.png)
149193

150-
lmcache.png
194+
System B result:
151195

152-
![LMCache Plot](lmcache.png)
196+
![LMCache+CPU+Storage400g Plot](lmcache_with_cpu_200g_exa_400g.png)

real-multi-round-qa/lmcache.png

-74.7 KB
Binary file not shown.
56 KB
Loading
57.7 KB
Loading

real-multi-round-qa/multi-round-qa.py

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
from dataclasses import dataclass, asdict
99
from typing import List
1010
import json
11+
import requests
1112

1213
FIRST_PROMPT = "Read and summarize this novel.\n\n{}"
1314
FOLLOWUP_PROMPTS = [
@@ -57,11 +58,13 @@
5758
class Result:
5859
session_id: str
5960
turn: int
61+
start_time: float
6062
latency: float
6163
ttft: float
6264
generation_time: float
6365
prompt_tokens: int
6466
completion_tokens: int
67+
metrics: str
6568
status: str
6669

6770
class ChatSession:
@@ -98,7 +101,7 @@ def append_assistant_message(self, content):
98101
self.messages.append({"role": "assistant", "content": content})
99102
self.turns += 1
100103

101-
async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
104+
async def run_turn(session: ChatSession, client: openai.AsyncOpenAI, base_url: str) -> Result:
102105
prompt = session.get_next_prompt()
103106
session.append_user_message(prompt)
104107

@@ -109,6 +112,10 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
109112
prompt_tokens = 0
110113

111114
print(f"Session {session.session_id}, Turn {session.turns}: {prompt[:50]}...")
115+
116+
resp = requests.get(f"{base_url}/metrics")
117+
resp.raise_for_status()
118+
112119
response = await client.chat.completions.create(
113120
model=session.model,
114121
messages=session.messages,
@@ -137,41 +144,43 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
137144
result = Result(
138145
session_id=session.session_id,
139146
turn=session.turns,
147+
start_time=start_time,
140148
latency=latency,
141149
ttft=ttft,
142150
generation_time=generation_time,
143151
prompt_tokens=prompt_tokens,
144152
completion_tokens=completion_tokens,
153+
metrics=resp.text,
145154
status="success",
146155
)
147-
156+
148157
session.append_assistant_message(content)
149158

150159
return result
151160

152161
async def run_group(args) -> List[Result]:
153-
client = openai.AsyncOpenAI(base_url=args.base_url, api_key="EMPTY")
154-
sessions = [ChatSession(args) for _ in range(args.num_users_sequential)]
162+
client = openai.AsyncOpenAI(base_url=f"{args.base_url}/v1", api_key="EMPTY")
163+
sessions = [ChatSession(args) for _ in range(args.session_depth)]
155164
results = []
156165

157166
while any(not s.is_finished() for s in sessions):
158167
for session in sessions:
159168
if session.is_finished():
160169
continue
161-
result = await run_turn(session, client)
170+
result = await run_turn(session, client, args.base_url)
162171
results.append(result)
163172

164173
return results
165174

166175
async def run_all_concurrent(args):
167-
tasks = [run_group(args) for _ in range(args.num_users_concurrent)]
176+
tasks = [run_group(args) for _ in range(args.concurrent)]
168177
all_results = await asyncio.gather(*tasks)
169178
return [asdict(r) for group in all_results for r in group]
170179

171180
def parse_args():
172181
parser = argparse.ArgumentParser()
173-
parser.add_argument("--num-users-concurrent", type=int, required=True)
174-
parser.add_argument("--num-users-sequential", type=int, required=True)
182+
parser.add_argument("-c", "--concurrent", type=int, required=True)
183+
parser.add_argument("-s", "--session-depth", type=int, required=True)
175184
parser.add_argument("--model", type=str, required=True)
176185
parser.add_argument("--base-url", type=str, required=True)
177186
parser.add_argument("--num-rounds", type=int, default=10)

0 commit comments

Comments
 (0)