Skip to content

Commit b32c222

Browse files
committed
cxs: record metrics
Signed-off-by: Sohei Koyama <skoyama@ddn.com>
1 parent 30a006b commit b32c222

7 files changed

Lines changed: 181 additions & 132 deletions

File tree

real-multi-round-qa/README.md

Lines changed: 81 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
This benchmark is designed to explore how TTFT changes across different $(C, S)$ combinations by sweeping concurrency ($C$) and session depth ($S$) independently. This helps isolate whether compute capacity or KV-cache pressure is the primary limiting factor.
66

7-
We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time.
7+
We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time. The JSON output from the benchmark includes metrics from vLLM/LMCache.
88

99
This benchmark feeds full‑length novels to your LLM server and asks many follow‑up questions, just like a book critic. It is handy for testing long‑context handling and KV‑cache tools such as LMCache.
1010

@@ -79,9 +79,9 @@ python prepare.py --output data --model Qwen/Qwen2.5-7B-Instruct-1M # Models use
7979

8080
```bash
8181
# Run the benchmark many times
82-
BASE_URL="http://localhost:8000/v1"
82+
BASE_URL="http://localhost:8000"
8383
MODEL="Qwen/Qwen2.5-7B-Instruct-1M"
84-
NUM_ROUNDS=3
84+
NUM_ROUNDS=12
8585
OUTPUT_DIR="bench_dir"
8686
SRC_DIR="./data/128k"
8787
mkdir -p "$OUTPUT_DIR"
@@ -96,61 +96,91 @@ for c in {1..4}; do # You can change c and s to any value you like.
9696
done
9797
```
9898

99+
We compare two systems for demo:
100+
101+
System A
102+
* vLLM
103+
* v0.9.0.1
104+
* enable prefix-caching
105+
* enable chunked prefill
106+
* LMCache
107+
* local_cpu: True
108+
* max_local_cpu_size: 200
109+
* pipelined_backend: True
110+
* save_decode_cache: True
111+
112+
System B
113+
* vLLM
114+
* v0.9.0.1
115+
* enable prefix-caching
116+
* enable chunked prefill
117+
* LMCache
118+
* local_cpu: True
119+
* max_local_cpu_size: 200
120+
* pipelined_backend: True
121+
* save_decode_cache: True
122+
* local_disk: file:///data/tmp
123+
* max_local_disk_size: 400
124+
* Storage
125+
* DDN EXAScaler 2.14.0
126+
* stripe count is 8
127+
* stripe size is 1MiB
128+
99129
```bash
100130
# Plot and Show Result
101-
$ python plot.py ./bench_dir_vllm vllm.png
102-
c s ttft_95
103-
0 4 2 0.498404
104-
1 4 4 33.565437
105-
2 4 3 0.794144
106-
3 1 4 0.311046
107-
4 2 2 0.406148
108-
5 2 4 0.459704
109-
6 2 3 0.326396
110-
7 1 2 0.411317
111-
8 3 3 0.378674
112-
9 2 1 0.445499
113-
10 3 4 42.531053
114-
11 1 3 0.455651
115-
12 4 1 0.504505
116-
13 3 2 0.393902
117-
14 3 1 0.364927
118-
15 1 1 0.379049
119-
Max harmonic mean (C,S) where TTFT_95 <= 2s: 3.43
120-
=> C=4.0, S=3.0, CxS=12.0
121-
$ python plot.py ./bench_dir_lmcache lmcache.png
122-
c s ttft_95
123-
0 1 1 0.524989
124-
1 3 2 0.592148
125-
2 4 4 1.202544
126-
3 3 4 1.286755
127-
4 2 1 0.477370
128-
5 3 3 0.586793
129-
6 2 3 0.627655
130-
7 4 1 0.575724
131-
8 4 3 1.251918
132-
9 2 4 0.446477
133-
10 1 4 0.460711
134-
11 3 1 0.495073
135-
12 1 3 0.329389
136-
13 4 2 0.586223
137-
14 1 2 0.477946
138-
15 2 2 0.457463
139-
Max harmonic mean (C,S) where TTFT_95 <= 2s: 4.00
140-
=> C=4.0, S=4.0, CxS=16.0
131+
$ python plot.py lmcache_bench_dir-1749973344 lmcache_with_cpu_200g.png
132+
c s ttft_95
133+
0 8 16 2.674693
134+
1 12 32 3.268448
135+
2 4 32 2.496206
136+
3 16 16 3.310291
137+
4 4 8 0.146159
138+
5 8 32 2.801732
139+
6 12 24 3.283783
140+
7 12 16 3.185047
141+
8 12 8 0.390896
142+
9 4 24 0.217809
143+
10 16 8 3.799740
144+
11 8 8 0.347083
145+
12 16 24 3.171192
146+
13 16 32 3.032414
147+
14 8 24 3.383691
148+
15 4 16 0.253737
149+
Best (C,S) with TTFT_95 ≤ 2 s → C=12.0, S=8.0, HarmonicMean=9.60, C×S=96.0
150+
Saved: lmcache_with_cpu_200g.png
151+
$ python plot.py lmcache_bench_dir-1749897431 lmcache_with_cpu_200g_exa_400g.png
152+
c s ttft_95
153+
0 4 16 0.255378
154+
1 8 24 3.213307
155+
2 16 24 4.067904
156+
3 4 24 0.612876
157+
4 8 32 4.389398
158+
5 4 8 0.158686
159+
6 12 24 3.939205
160+
7 12 8 0.634048
161+
8 4 32 1.191106
162+
9 12 32 3.475115
163+
10 16 16 3.156051
164+
11 8 8 0.264291
165+
12 12 16 2.739532
166+
13 16 32 3.853057
167+
14 8 16 1.424959
168+
15 16 8 3.470811
169+
Best (C,S) with TTFT_95 ≤ 2 s → C=8.0, S=16.0, HarmonicMean=10.67, C×S=128.0
170+
Saved: lmcache_with_cpu_200g_exa_400g.png
141171
```
172+
This result shows that adding external storage (DDN EXAScaler) as a tier in the KV cache can increase the number of active sessions.
142173

143-
LMCache allows 1.17x increase in the number of user sessions kept active at least.
144-
145-
Note: LMCache has not yet reached its limit in this case,
146-
so we can aim to further improve the score by changing C and S.
174+
Note: the metrics included in the JSON indicate that TTFT degradation was caused by the storage running out of capacity.
147175

148176
## Viz
149177

150-
vllm.png
178+
The white dashed line indicates the TTFT = 2s boundary.
179+
180+
System A result:
151181

152-
![vLLM Plot](vllm.png)
182+
![LMCache+CPU Plot](lmcache_with_cpu_200g.png)
153183

154-
lmcache.png
184+
System B result:
155185

156-
![LMCache Plot](lmcache.png)
186+
![LMCache+CPU+Storage400g Plot](lmcache_with_cpu_200g_exa_400g.png)

real-multi-round-qa/lmcache.png

-46.1 KB
Binary file not shown.
56 KB
Loading
57.7 KB
Loading

real-multi-round-qa/multi-round-qa.py

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
from dataclasses import dataclass, asdict
99
from typing import List
1010
import json
11+
import requests
1112

1213
FIRST_PROMPT = "Read and summarize this novel.\n\n{}"
1314
FOLLOWUP_PROMPTS = [
@@ -57,11 +58,13 @@
5758
class Result:
5859
session_id: str
5960
turn: int
61+
start_time: float
6062
latency: float
6163
ttft: float
6264
generation_time: float
6365
prompt_tokens: int
6466
completion_tokens: int
67+
metrics: str
6568
status: str
6669

6770
class ChatSession:
@@ -98,7 +101,7 @@ def append_assistant_message(self, content):
98101
self.messages.append({"role": "assistant", "content": content})
99102
self.turns += 1
100103

101-
async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
104+
async def run_turn(session: ChatSession, client: openai.AsyncOpenAI, base_url: str) -> Result:
102105
prompt = session.get_next_prompt()
103106
session.append_user_message(prompt)
104107

@@ -109,6 +112,10 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
109112
prompt_tokens = 0
110113

111114
print(f"Session {session.session_id}, Turn {session.turns}: {prompt[:50]}...")
115+
116+
resp = requests.get(f"{base_url}/metrics")
117+
resp.raise_for_status()
118+
112119
response = await client.chat.completions.create(
113120
model=session.model,
114121
messages=session.messages,
@@ -137,28 +144,30 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
137144
result = Result(
138145
session_id=session.session_id,
139146
turn=session.turns,
147+
start_time=start_time,
140148
latency=latency,
141149
ttft=ttft,
142150
generation_time=generation_time,
143151
prompt_tokens=prompt_tokens,
144152
completion_tokens=completion_tokens,
153+
metrics=resp.text,
145154
status="success",
146155
)
147-
156+
148157
session.append_assistant_message(content)
149158

150159
return result
151160

152161
async def run_group(args) -> List[Result]:
153-
client = openai.AsyncOpenAI(base_url=args.base_url, api_key="EMPTY")
162+
client = openai.AsyncOpenAI(base_url=f"{args.base_url}/v1", api_key="EMPTY")
154163
sessions = [ChatSession(args) for _ in range(args.session_depth)]
155164
results = []
156165

157166
while any(not s.is_finished() for s in sessions):
158167
for session in sessions:
159168
if session.is_finished():
160169
continue
161-
result = await run_turn(session, client)
170+
result = await run_turn(session, client, args.base_url)
162171
results.append(result)
163172

164173
return results

0 commit comments

Comments
 (0)