You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: real-multi-round-qa/README.md
+81-51Lines changed: 81 additions & 51 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
This benchmark is designed to explore how TTFT changes across different $(C, S)$ combinations by sweeping concurrency ($C$) and session depth ($S$) independently. This helps isolate whether compute capacity or KV-cache pressure is the primary limiting factor.
6
6
7
-
We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time.
7
+
We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time. The JSON output from the benchmark includes metrics from vLLM/LMCache.
8
8
9
9
This benchmark feeds full‑length novels to your LLM server and asks many follow‑up questions, just like a book critic. It is handy for testing long‑context handling and KV‑cache tools such as LMCache.
10
10
@@ -79,9 +79,9 @@ python prepare.py --output data --model Qwen/Qwen2.5-7B-Instruct-1M # Models use
79
79
80
80
```bash
81
81
# Run the benchmark many times
82
-
BASE_URL="http://localhost:8000/v1"
82
+
BASE_URL="http://localhost:8000"
83
83
MODEL="Qwen/Qwen2.5-7B-Instruct-1M"
84
-
NUM_ROUNDS=3
84
+
NUM_ROUNDS=12
85
85
OUTPUT_DIR="bench_dir"
86
86
SRC_DIR="./data/128k"
87
87
mkdir -p "$OUTPUT_DIR"
@@ -96,61 +96,91 @@ for c in {1..4}; do # You can change c and s to any value you like.
0 commit comments