LMCache
diff --git a/‎real-multi-round-qa/README.md‎
Lines changed: 137 additions & 93 deletions b/‎real-multi-round-qa/README.md‎
Lines changed: 137 additions & 93 deletions
diff --git a/‎real-multi-round-qa/lmcache.png‎
-74.7 KB b/‎real-multi-round-qa/lmcache.png‎
-74.7 KB
diff --git a/‎real-multi-round-qa/lmcache_with_cpu_200g.png‎
56 KB b/‎real-multi-round-qa/lmcache_with_cpu_200g.png‎
56 KB
diff --git a/‎real-multi-round-qa/lmcache_with_cpu_200g_exa_400g.png‎
57.7 KB b/‎real-multi-round-qa/lmcache_with_cpu_200g_exa_400g.png‎
57.7 KB
diff --git a/‎real-multi-round-qa/multi-round-qa.py‎
Lines changed: 17 additions & 8 deletions b/‎real-multi-round-qa/multi-round-qa.py‎
Lines changed: 17 additions & 8 deletions
@@ -2,70 +2,74 @@
 
 ## Overview
 
-This benchmark is designed to identify **the maximum harmonic mean of user sessions $(C,S)$ that can be kept active while maintaining a steady-state TTFT ≤ 2 s (95-th percentile)**. By sweeping the concurrency ($C$) and sequential ($S$) independently, it isolates whether compute capacity or KV-cache pressure is the first limiting factor.
+This benchmark is designed to explore how TTFT changes across different $(C, S)$ combinations by sweeping concurrency ($C$) and session depth ($S$) independently. This helps isolate whether compute capacity or KV-cache pressure is the primary limiting factor.
 
-
-We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time.
+We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time. The JSON output from the benchmark includes metrics from vLLM/LMCache.
 
 This benchmark feeds full‑length novels to your LLM server and asks many follow‑up questions, just like a book critic. It is handy for testing long‑context handling and KV‑cache tools such as LMCache.
 
-The benchmark is called CxS (pronounced six for simplicity), referring to the product of Concurrent $\times$ Sequential users.
-
-### Definition
-
-Let us define the set of candidate pairs:
-
-$$
-\mathcal{D} = \{ (C_i, S_i) \mid \mathrm{TTFT}_{95}^{(i)} \leq 2 \}
-$$
-
-### Objective
-
-More precisely, we aim to find the pair that maximizes the harmonic mean among all candidates in $\mathcal{D}$:
-
-
-$$
-\underset{(C_i, S_i) \in \mathcal{D}}{\arg\max} \left( \frac{2 C_i S_i}{C_i + S_i} \right)
-$$
-
-We use the harmonic mean to compare scores.  
-As a business metric, we report the product, CxS.  
-For example, we say "Our system can keep up to {C×S} user sessions active!"
+The benchmark is called CxS (pronounced six for simplicity), referring to the product of Concurrent $\times$ Session Depth.
 
 ## Two simple knobs
 
 | Option | What it means |
 | ---- | ---- |
-| `--num-users-concurrent` (C) | How many threads run in parallel.  |
-| `--num-users-sequential` (S) | How many users each thread serves in turn. |
+| `--concurrent` (C) | How many threads run in parallel.  |
+| `--session-depth` (S) | How many sessions each thread serves in turn. |
 
 You can:
-* raise concurrent to test compute-side capability (higher GPU utilization; total KV footprint also rises).
-* raise sequential to test KV-cache pressure (larger resident KV per GPU, little change in instantaneous GPU utilization).
+* raise $C$ to test compute-side capability (higher GPU utilization; total KV footprint also rises).
+* raise $S$ to test KV-cache pressure (larger resident KV per GPU, little change in instantaneous GPU utilization).
 
 ## Execution model
 
 ```
-Concurrent USER: {A,B}
-Sequential USER: {X,Y}
-All USER: {AX,AY,BX,BY}
+Concurrent: {A,B}
+Session Depth: {X,Y}
+All Session: {AX,AY,BX,BY}
 
 Timeline
 -------------------------------------------------
 Thread A:
-  Turn 0 → UserAX: Q1 "Read and summarize this novel. {AX novel contents}" → Get Response
-  Turn 0 → UserAY: Q1 "Read and summarize this novel. {AY novel contents}" → Get Response
-  Turn 1 → UserAX: Q2 "Write down the author's feelings." → Get Response
-  Turn 1 → UserAY: Q2 "Write down the author's feelings." → Get Response
+  Turn 0 → SessionAX: Q1 "Read and summarize this novel. {AX novel contents}" → Get Response
+  Turn 0 → SessionAY: Q1 "Read and summarize this novel. {AY novel contents}" → Get Response
+  Turn 1 → SessionAX: Q2 "Write down the author's feelings." → Get Response
+  Turn 1 → SessionAY: Q2 "Write down the author's feelings." → Get Response
   ...
 Thread B:
-  Turn 0 → UserBX: Q1 "Read and summarize this novel. {BX novel contents}" → Get Response
-  Turn 0 → UserBY: Q1 "Read and summarize this novel. {BY novel contents}" → Get Response
-  Turn 1 → UserBX: Q2 "Write down the author's feelings." → Get Response
-  Turn 1 → UserBY: Q2 "Write down the author's feelings." → Get Response
+  Turn 0 → SessionBX: Q1 "Read and summarize this novel. {BX novel contents}" → Get Response
+  Turn 0 → SessionBY: Q1 "Read and summarize this novel. {BY novel contents}" → Get Response
+  Turn 1 → SessionBX: Q2 "Write down the author's feelings." → Get Response
+  Turn 1 → SessionBY: Q2 "Write down the author's feelings." → Get Response
   ...
 ```
 
+## For system competition
+
+The CxS benchmark provides a scalar score to encourage healthy competition, but its use is not mandatory.
+
+### Definition
+
+Let us define the set of candidate pairs:
+
+$$
+\mathcal{D} = {\{ (C_i, S_i) \mid \mathrm{TTFT}_{95}^{(i)} \leq 2 \}}
+$$
+
+### Objective
+
+More precisely, we aim to find the pair that maximizes the harmonic mean among all candidates in $\mathcal{D}$:
+
+
+$$
+\underset{(C_i, S_i) \in \mathcal{D}}{\arg\max} \left( \frac{2 C_i S_i}{C_i + S_i} \right)
+$$
+
+## For business metric
+
+As a business metric, we report the product, CxS.  
+For example, we say "Our system can keep up to {C×S} user sessions active!"
+
 ## Getting Started
 
 ```bash
@@ -75,9 +79,9 @@ python prepare.py --output data --model Qwen/Qwen2.5-7B-Instruct-1M # Models use
 
 ```bash
 # Run the benchmark many times
-BASE_URL="http://localhost:8000/v1"
+BASE_URL="http://localhost:8000"
 MODEL="Qwen/Qwen2.5-7B-Instruct-1M"
-NUM_ROUNDS=3
+NUM_ROUNDS=12
 OUTPUT_DIR="bench_dir"
 SRC_DIR="./data/128k"
 mkdir -p "$OUTPUT_DIR"
@@ -86,67 +90,107 @@ for c in {1..4}; do # You can change c and s to any value you like.
   for s in {1..4}; do
     TIMESTAMP=$(date +%s)
     OUTPUT_FILE="${OUTPUT_DIR}/bench_c${c}_s${s}_${TIMESTAMP}.json"
-    echo "Running benchmark: concurrent=${c}, sequential=${s}"
-    python multi-round-qa.py --num-users-concurrent "$c" --num-users-sequential "$s" --num-rounds "$NUM_ROUNDS" --model "$MODEL" --base-url "$BASE_URL" --output "$OUTPUT_FILE" --src-dir "$SRC_DIR"
+    echo "Running benchmark: C=${c}, S=${s}"
+    python multi-round-qa.py -c "$c" -s "$s" --num-rounds "$NUM_ROUNDS" --model "$MODEL" --base-url "$BASE_URL" --output "$OUTPUT_FILE" --src-dir "$SRC_DIR"
   done
 done
 ```
 
+We compare two systems for demo:
+
+System A
+* Model
+  * Qwen/Qwen2.5-7B-Instruct-1M
+* Dataset
+  * 32k
+* CPU/GPU
+  * NVIDIA GH200 480GB
+* vLLM
+  * v0.9.0.1
+  * enable prefix-caching
+  * enable chunked prefill
+* LMCache 
+  * local_cpu: True
+  * max_local_cpu_size: 200
+  * pipelined_backend: True
+  * save_decode_cache: True
+
+System B
+* Model
+  * Qwen/Qwen2.5-7B-Instruct-1M
+* Dataset
+  * 32k
+* CPU/GPU
+  * NVIDIA GH200 480GB
+* vLLM
+  * v0.9.0.1
+  * enable prefix-caching
+  * enable chunked prefill
+* LMCache 
+  * local_cpu: True
+  * max_local_cpu_size: 200
+  * pipelined_backend: True
+  * save_decode_cache: True
+  * local_disk: file:///data/tmp
+  * max_local_disk_size: 400
+* Storage
+  * DDN EXAScaler 2.14.0
+  * stripe count is 8
+  * stripe size is 1MiB
+
 ```bash
 # Plot and Show Result
-$ python plot.py ./bench_dir_vllm vllm.png
-    num_users_concurrent  num_users_sequential    ttft_95
-0                      4                     2   0.498404
-1                      4                     4  33.565437
-2                      4                     3   0.794144
-3                      1                     4   0.311046
-4                      2                     2   0.406148
-5                      2                     4   0.459704
-6                      2                     3   0.326396
-7                      1                     2   0.411317
-8                      3                     3   0.378674
-9                      2                     1   0.445499
-10                     3                     4  42.531053
-11                     1                     3   0.455651
-12                     4                     1   0.504505
-13                     3                     2   0.393902
-14                     3                     1   0.364927
-15                     1                     1   0.379049
-Max harmonic mean (C,S) where TTFT_95 <= 2s: 3.43
-  => C=4.0, S=3.0, CxS=12.0
-$ python plot.py ./bench_dir_lmcache lmcache.png
-    num_users_concurrent  num_users_sequential   ttft_95
-0                      1                     1  0.524989
-1                      3                     2  0.592148
-2                      4                     4  1.202544
-3                      3                     4  1.286755
-4                      2                     1  0.477370
-5                      3                     3  0.586793
-6                      2                     3  0.627655
-7                      4                     1  0.575724
-8                      4                     3  1.251918
-9                      2                     4  0.446477
-10                     1                     4  0.460711
-11                     3                     1  0.495073
-12                     1                     3  0.329389
-13                     4                     2  0.586223
-14                     1                     2  0.477946
-15                     2                     2  0.457463
-Max harmonic mean (C,S) where TTFT_95 <= 2s: 4.00
-  => C=4.0, S=4.0, CxS=16.0
+$ python plot.py lmcache_bench_dir-1749973344 lmcache_with_cpu_200g.png
+     c   s   ttft_95
+0    8  16  2.674693
+1   12  32  3.268448
+2    4  32  2.496206
+3   16  16  3.310291
+4    4   8  0.146159
+5    8  32  2.801732
+6   12  24  3.283783
+7   12  16  3.185047
+8   12   8  0.390896
+9    4  24  0.217809
+10  16   8  3.799740
+11   8   8  0.347083
+12  16  24  3.171192
+13  16  32  3.032414
+14   8  24  3.383691
+15   4  16  0.253737
+Best (C,S) with TTFT_95 ≤ 2 s → C=12.0, S=8.0, HarmonicMean=9.60, C×S=96.0
+Saved: lmcache_with_cpu_200g.png
+$ python plot.py lmcache_bench_dir-1749897431 lmcache_with_cpu_200g_exa_400g.png 
+     c   s   ttft_95
+0    4  16  0.255378
+1    8  24  3.213307
+2   16  24  4.067904
+3    4  24  0.612876
+4    8  32  4.389398
+5    4   8  0.158686
+6   12  24  3.939205
+7   12   8  0.634048
+8    4  32  1.191106
+9   12  32  3.475115
+10  16  16  3.156051
+11   8   8  0.264291
+12  12  16  2.739532
+13  16  32  3.853057
+14   8  16  1.424959
+15  16   8  3.470811
+Best (C,S) with TTFT_95 ≤ 2 s → C=8.0, S=16.0, HarmonicMean=10.67, C×S=128.0
+Saved: lmcache_with_cpu_200g_exa_400g.png
 ```
-
-LMCache allows 1.17x increase in the number of user sessions kept active at least.
-
-Note: LMCache has not yet reached its limit in this case,
-so we can aim to further improve the score by changing C and S.
+This result shows that adding external storage (DDN EXAScaler) as a tier in the KV cache can increase the number of active sessions.
 
 ## Viz
 
-vllm.png
+The white dashed line indicates the TTFT = 2s boundary.
+
+System A result:
 
-![vLLM Plot](vllm.png)
+![LMCache+CPU Plot](lmcache_with_cpu_200g.png)
 
-lmcache.png
+System B result:
 
-![LMCache Plot](lmcache.png)
+![LMCache+CPU+Storage400g Plot](lmcache_with_cpu_200g_exa_400g.png)
@@ -8,6 +8,7 @@
 from dataclasses import dataclass, asdict
 from typing import List
 import json
+import requests
 
 FIRST_PROMPT = "Read and summarize this novel.\n\n{}"
 FOLLOWUP_PROMPTS = [
@@ -57,11 +58,13 @@
 class Result:
     session_id: str
     turn: int
+    start_time: float
     latency: float
     ttft: float
     generation_time: float
     prompt_tokens: int
     completion_tokens: int
+    metrics: str
     status: str
 
 class ChatSession:
@@ -98,7 +101,7 @@ def append_assistant_message(self, content):
         self.messages.append({"role": "assistant", "content": content})
         self.turns += 1
 
-async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
+async def run_turn(session: ChatSession, client: openai.AsyncOpenAI, base_url: str) -> Result:
     prompt = session.get_next_prompt()
     session.append_user_message(prompt)
 
@@ -109,6 +112,10 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
     prompt_tokens = 0
 
     print(f"Session {session.session_id}, Turn {session.turns}: {prompt[:50]}...")
+
+    resp = requests.get(f"{base_url}/metrics")
+    resp.raise_for_status()
+
     response = await client.chat.completions.create(
         model=session.model,
         messages=session.messages,
@@ -137,41 +144,43 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
     result = Result(
         session_id=session.session_id,
         turn=session.turns,
+        start_time=start_time,
         latency=latency,
         ttft=ttft,
         generation_time=generation_time,
         prompt_tokens=prompt_tokens,
         completion_tokens=completion_tokens,
+        metrics=resp.text,
         status="success",
     )
-    
+
     session.append_assistant_message(content)
 
     return result
 
 async def run_group(args) -> List[Result]:
-    client = openai.AsyncOpenAI(base_url=args.base_url, api_key="EMPTY")
-    sessions = [ChatSession(args) for _ in range(args.num_users_sequential)]
+    client = openai.AsyncOpenAI(base_url=f"{args.base_url}/v1", api_key="EMPTY")
+    sessions = [ChatSession(args) for _ in range(args.session_depth)]
     results = []
 
     while any(not s.is_finished() for s in sessions):
         for session in sessions:
             if session.is_finished():
                 continue
-            result = await run_turn(session, client)
+            result = await run_turn(session, client, args.base_url)
             results.append(result)
 
     return results
 
 async def run_all_concurrent(args):
-    tasks = [run_group(args) for _ in range(args.num_users_concurrent)]
+    tasks = [run_group(args) for _ in range(args.concurrent)]
     all_results = await asyncio.gather(*tasks)
     return [asdict(r) for group in all_results for r in group]
 
 def parse_args():
     parser = argparse.ArgumentParser()
-    parser.add_argument("--num-users-concurrent", type=int, required=True)
-    parser.add_argument("--num-users-sequential", type=int, required=True)
+    parser.add_argument("-c", "--concurrent", type=int, required=True)
+    parser.add_argument("-s", "--session-depth", type=int, required=True)
     parser.add_argument("--model", type=str, required=True)
     parser.add_argument("--base-url", type=str, required=True)
     parser.add_argument("--num-rounds", type=int, default=10)