Skip to content

Commit 73ac65d

Browse files
committed
cxs: record metrics
Signed-off-by: Sohei Koyama <skoyama@ddn.com>
1 parent 30a006b commit 73ac65d

7 files changed

Lines changed: 189 additions & 132 deletions

File tree

real-multi-round-qa/README.md

Lines changed: 89 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
This benchmark is designed to explore how TTFT changes across different $(C, S)$ combinations by sweeping concurrency ($C$) and session depth ($S$) independently. This helps isolate whether compute capacity or KV-cache pressure is the primary limiting factor.
66

7-
We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time.
7+
We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time. The JSON output from the benchmark includes metrics from vLLM/LMCache.
88

99
This benchmark feeds full‑length novels to your LLM server and asks many follow‑up questions, just like a book critic. It is handy for testing long‑context handling and KV‑cache tools such as LMCache.
1010

@@ -79,9 +79,9 @@ python prepare.py --output data --model Qwen/Qwen2.5-7B-Instruct-1M # Models use
7979

8080
```bash
8181
# Run the benchmark many times
82-
BASE_URL="http://localhost:8000/v1"
82+
BASE_URL="http://localhost:8000"
8383
MODEL="Qwen/Qwen2.5-7B-Instruct-1M"
84-
NUM_ROUNDS=3
84+
NUM_ROUNDS=12
8585
OUTPUT_DIR="bench_dir"
8686
SRC_DIR="./data/128k"
8787
mkdir -p "$OUTPUT_DIR"
@@ -96,61 +96,99 @@ for c in {1..4}; do # You can change c and s to any value you like.
9696
done
9797
```
9898

99+
We compare two systems for demo:
100+
101+
System A
102+
* Model
103+
* Qwen/Qwen2.5-7B-Instruct-1M
104+
* Dataset
105+
* 32k
106+
* vLLM
107+
* v0.9.0.1
108+
* enable prefix-caching
109+
* enable chunked prefill
110+
* LMCache
111+
* local_cpu: True
112+
* max_local_cpu_size: 200
113+
* pipelined_backend: True
114+
* save_decode_cache: True
115+
116+
System B
117+
* Model
118+
* Qwen/Qwen2.5-7B-Instruct-1M
119+
* Dataset
120+
* 32k
121+
* vLLM
122+
* v0.9.0.1
123+
* enable prefix-caching
124+
* enable chunked prefill
125+
* LMCache
126+
* local_cpu: True
127+
* max_local_cpu_size: 200
128+
* pipelined_backend: True
129+
* save_decode_cache: True
130+
* local_disk: file:///data/tmp
131+
* max_local_disk_size: 400
132+
* Storage
133+
* DDN EXAScaler 2.14.0
134+
* stripe count is 8
135+
* stripe size is 1MiB
136+
99137
```bash
100138
# Plot and Show Result
101-
$ python plot.py ./bench_dir_vllm vllm.png
102-
c s ttft_95
103-
0 4 2 0.498404
104-
1 4 4 33.565437
105-
2 4 3 0.794144
106-
3 1 4 0.311046
107-
4 2 2 0.406148
108-
5 2 4 0.459704
109-
6 2 3 0.326396
110-
7 1 2 0.411317
111-
8 3 3 0.378674
112-
9 2 1 0.445499
113-
10 3 4 42.531053
114-
11 1 3 0.455651
115-
12 4 1 0.504505
116-
13 3 2 0.393902
117-
14 3 1 0.364927
118-
15 1 1 0.379049
119-
Max harmonic mean (C,S) where TTFT_95 <= 2s: 3.43
120-
=> C=4.0, S=3.0, CxS=12.0
121-
$ python plot.py ./bench_dir_lmcache lmcache.png
122-
c s ttft_95
123-
0 1 1 0.524989
124-
1 3 2 0.592148
125-
2 4 4 1.202544
126-
3 3 4 1.286755
127-
4 2 1 0.477370
128-
5 3 3 0.586793
129-
6 2 3 0.627655
130-
7 4 1 0.575724
131-
8 4 3 1.251918
132-
9 2 4 0.446477
133-
10 1 4 0.460711
134-
11 3 1 0.495073
135-
12 1 3 0.329389
136-
13 4 2 0.586223
137-
14 1 2 0.477946
138-
15 2 2 0.457463
139-
Max harmonic mean (C,S) where TTFT_95 <= 2s: 4.00
140-
=> C=4.0, S=4.0, CxS=16.0
139+
$ python plot.py lmcache_bench_dir-1749973344 lmcache_with_cpu_200g.png
140+
c s ttft_95
141+
0 8 16 2.674693
142+
1 12 32 3.268448
143+
2 4 32 2.496206
144+
3 16 16 3.310291
145+
4 4 8 0.146159
146+
5 8 32 2.801732
147+
6 12 24 3.283783
148+
7 12 16 3.185047
149+
8 12 8 0.390896
150+
9 4 24 0.217809
151+
10 16 8 3.799740
152+
11 8 8 0.347083
153+
12 16 24 3.171192
154+
13 16 32 3.032414
155+
14 8 24 3.383691
156+
15 4 16 0.253737
157+
Best (C,S) with TTFT_95 ≤ 2 s → C=12.0, S=8.0, HarmonicMean=9.60, C×S=96.0
158+
Saved: lmcache_with_cpu_200g.png
159+
$ python plot.py lmcache_bench_dir-1749897431 lmcache_with_cpu_200g_exa_400g.png
160+
c s ttft_95
161+
0 4 16 0.255378
162+
1 8 24 3.213307
163+
2 16 24 4.067904
164+
3 4 24 0.612876
165+
4 8 32 4.389398
166+
5 4 8 0.158686
167+
6 12 24 3.939205
168+
7 12 8 0.634048
169+
8 4 32 1.191106
170+
9 12 32 3.475115
171+
10 16 16 3.156051
172+
11 8 8 0.264291
173+
12 12 16 2.739532
174+
13 16 32 3.853057
175+
14 8 16 1.424959
176+
15 16 8 3.470811
177+
Best (C,S) with TTFT_95 ≤ 2 s → C=8.0, S=16.0, HarmonicMean=10.67, C×S=128.0
178+
Saved: lmcache_with_cpu_200g_exa_400g.png
141179
```
180+
This result shows that adding external storage (DDN EXAScaler) as a tier in the KV cache can increase the number of active sessions.
142181

143-
LMCache allows 1.17x increase in the number of user sessions kept active at least.
144-
145-
Note: LMCache has not yet reached its limit in this case,
146-
so we can aim to further improve the score by changing C and S.
182+
Note: the metrics included in the JSON indicate that TTFT degradation was caused by the storage running out of capacity.
147183

148184
## Viz
149185

150-
vllm.png
186+
The white dashed line indicates the TTFT = 2s boundary.
187+
188+
System A result:
151189

152-
![vLLM Plot](vllm.png)
190+
![LMCache+CPU Plot](lmcache_with_cpu_200g.png)
153191

154-
lmcache.png
192+
System B result:
155193

156-
![LMCache Plot](lmcache.png)
194+
![LMCache+CPU+Storage400g Plot](lmcache_with_cpu_200g_exa_400g.png)

real-multi-round-qa/lmcache.png

-46.1 KB
Binary file not shown.
56 KB
Loading
57.7 KB
Loading

real-multi-round-qa/multi-round-qa.py

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
from dataclasses import dataclass, asdict
99
from typing import List
1010
import json
11+
import requests
1112

1213
FIRST_PROMPT = "Read and summarize this novel.\n\n{}"
1314
FOLLOWUP_PROMPTS = [
@@ -57,11 +58,13 @@
5758
class Result:
5859
session_id: str
5960
turn: int
61+
start_time: float
6062
latency: float
6163
ttft: float
6264
generation_time: float
6365
prompt_tokens: int
6466
completion_tokens: int
67+
metrics: str
6568
status: str
6669

6770
class ChatSession:
@@ -98,7 +101,7 @@ def append_assistant_message(self, content):
98101
self.messages.append({"role": "assistant", "content": content})
99102
self.turns += 1
100103

101-
async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
104+
async def run_turn(session: ChatSession, client: openai.AsyncOpenAI, base_url: str) -> Result:
102105
prompt = session.get_next_prompt()
103106
session.append_user_message(prompt)
104107

@@ -109,6 +112,10 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
109112
prompt_tokens = 0
110113

111114
print(f"Session {session.session_id}, Turn {session.turns}: {prompt[:50]}...")
115+
116+
resp = requests.get(f"{base_url}/metrics")
117+
resp.raise_for_status()
118+
112119
response = await client.chat.completions.create(
113120
model=session.model,
114121
messages=session.messages,
@@ -137,28 +144,30 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
137144
result = Result(
138145
session_id=session.session_id,
139146
turn=session.turns,
147+
start_time=start_time,
140148
latency=latency,
141149
ttft=ttft,
142150
generation_time=generation_time,
143151
prompt_tokens=prompt_tokens,
144152
completion_tokens=completion_tokens,
153+
metrics=resp.text,
145154
status="success",
146155
)
147-
156+
148157
session.append_assistant_message(content)
149158

150159
return result
151160

152161
async def run_group(args) -> List[Result]:
153-
client = openai.AsyncOpenAI(base_url=args.base_url, api_key="EMPTY")
162+
client = openai.AsyncOpenAI(base_url=f"{args.base_url}/v1", api_key="EMPTY")
154163
sessions = [ChatSession(args) for _ in range(args.session_depth)]
155164
results = []
156165

157166
while any(not s.is_finished() for s in sessions):
158167
for session in sessions:
159168
if session.is_finished():
160169
continue
161-
result = await run_turn(session, client)
170+
result = await run_turn(session, client, args.base_url)
162171
results.append(result)
163172

164173
return results

0 commit comments

Comments
 (0)