LMCache
diff --git a/‎real-multi-round-qa/README.md‎
Lines changed: 81 additions & 51 deletions b/‎real-multi-round-qa/README.md‎
Lines changed: 81 additions & 51 deletions
diff --git a/‎real-multi-round-qa/lmcache.png‎
-46.1 KB b/‎real-multi-round-qa/lmcache.png‎
-46.1 KB
diff --git a/‎real-multi-round-qa/lmcache_with_cpu_200g.png‎
56 KB b/‎real-multi-round-qa/lmcache_with_cpu_200g.png‎
56 KB
diff --git a/‎real-multi-round-qa/lmcache_with_cpu_200g_exa_400g.png‎
57.7 KB b/‎real-multi-round-qa/lmcache_with_cpu_200g_exa_400g.png‎
57.7 KB
diff --git a/‎real-multi-round-qa/multi-round-qa.py‎
Lines changed: 13 additions & 4 deletions b/‎real-multi-round-qa/multi-round-qa.py‎
Lines changed: 13 additions & 4 deletions
@@ -4,7 +4,7 @@
 
 This benchmark is designed to explore how TTFT changes across different $(C, S)$ combinations by sweeping concurrency ($C$) and session depth ($S$) independently. This helps isolate whether compute capacity or KV-cache pressure is the primary limiting factor.
 
-We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time.
+We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time. The JSON output from the benchmark includes metrics from vLLM/LMCache.
 
 This benchmark feeds full‑length novels to your LLM server and asks many follow‑up questions, just like a book critic. It is handy for testing long‑context handling and KV‑cache tools such as LMCache.
 
@@ -79,9 +79,9 @@ python prepare.py --output data --model Qwen/Qwen2.5-7B-Instruct-1M # Models use
 
 ```bash
 # Run the benchmark many times
-BASE_URL="http://localhost:8000/v1"
+BASE_URL="http://localhost:8000"
 MODEL="Qwen/Qwen2.5-7B-Instruct-1M"
-NUM_ROUNDS=3
+NUM_ROUNDS=12
 OUTPUT_DIR="bench_dir"
 SRC_DIR="./data/128k"
 mkdir -p "$OUTPUT_DIR"
@@ -96,61 +96,91 @@ for c in {1..4}; do # You can change c and s to any value you like.
 done
 ```
 
+We compare two systems for demo:
+
+System A
+* vLLM
+  * v0.9.0.1
+  * enable prefix-caching
+  * enable chunked prefill
+* LMCache 
+  * local_cpu: True
+  * max_local_cpu_size: 200
+  * pipelined_backend: True
+  * save_decode_cache: True
+
+System B
+* vLLM
+  * v0.9.0.1
+  * enable prefix-caching
+  * enable chunked prefill
+* LMCache 
+  * local_cpu: True
+  * max_local_cpu_size: 200
+  * pipelined_backend: True
+  * save_decode_cache: True
+  * local_disk: file:///data/tmp
+  * max_local_disk_size: 400
+* Storage
+  * DDN EXAScaler 2.14.0
+  * stripe count is 8
+  * stripe size is 1MiB
+
 ```bash
 # Plot and Show Result
-$ python plot.py ./bench_dir_vllm vllm.png
-    c  s    ttft_95
-0   4  2   0.498404
-1   4  4  33.565437
-2   4  3   0.794144
-3   1  4   0.311046
-4   2  2   0.406148
-5   2  4   0.459704
-6   2  3   0.326396
-7   1  2   0.411317
-8   3  3   0.378674
-9   2  1   0.445499
-10  3  4  42.531053
-11  1  3   0.455651
-12  4  1   0.504505
-13  3  2   0.393902
-14  3  1   0.364927
-15  1  1   0.379049
-Max harmonic mean (C,S) where TTFT_95 <= 2s: 3.43
-  => C=4.0, S=3.0, CxS=12.0
-$ python plot.py ./bench_dir_lmcache lmcache.png
-    c  s   ttft_95
-0   1  1  0.524989
-1   3  2  0.592148
-2   4  4  1.202544
-3   3  4  1.286755
-4   2  1  0.477370
-5   3  3  0.586793
-6   2  3  0.627655
-7   4  1  0.575724
-8   4  3  1.251918
-9   2  4  0.446477
-10  1  4  0.460711
-11  3  1  0.495073
-12  1  3  0.329389
-13  4  2  0.586223
-14  1  2  0.477946
-15  2  2  0.457463
-Max harmonic mean (C,S) where TTFT_95 <= 2s: 4.00
-  => C=4.0, S=4.0, CxS=16.0
+$ python plot.py lmcache_bench_dir-1749973344 lmcache_with_cpu_200g.png
+     c   s   ttft_95
+0    8  16  2.674693
+1   12  32  3.268448
+2    4  32  2.496206
+3   16  16  3.310291
+4    4   8  0.146159
+5    8  32  2.801732
+6   12  24  3.283783
+7   12  16  3.185047
+8   12   8  0.390896
+9    4  24  0.217809
+10  16   8  3.799740
+11   8   8  0.347083
+12  16  24  3.171192
+13  16  32  3.032414
+14   8  24  3.383691
+15   4  16  0.253737
+Best (C,S) with TTFT_95 ≤ 2 s → C=12.0, S=8.0, HarmonicMean=9.60, C×S=96.0
+Saved: lmcache_with_cpu_200g.png
+$ python plot.py lmcache_bench_dir-1749897431 lmcache_with_cpu_200g_exa_400g.png 
+     c   s   ttft_95
+0    4  16  0.255378
+1    8  24  3.213307
+2   16  24  4.067904
+3    4  24  0.612876
+4    8  32  4.389398
+5    4   8  0.158686
+6   12  24  3.939205
+7   12   8  0.634048
+8    4  32  1.191106
+9   12  32  3.475115
+10  16  16  3.156051
+11   8   8  0.264291
+12  12  16  2.739532
+13  16  32  3.853057
+14   8  16  1.424959
+15  16   8  3.470811
+Best (C,S) with TTFT_95 ≤ 2 s → C=8.0, S=16.0, HarmonicMean=10.67, C×S=128.0
+Saved: lmcache_with_cpu_200g_exa_400g.png
 ```
+This result shows that adding external storage (DDN EXAScaler) as a tier in the KV cache can increase the number of active sessions.
 
-LMCache allows 1.17x increase in the number of user sessions kept active at least.
-
-Note: LMCache has not yet reached its limit in this case,
-so we can aim to further improve the score by changing C and S.
+Note: the metrics included in the JSON indicate that TTFT degradation was caused by the storage running out of capacity.
 
 ## Viz
 
-vllm.png
+The white dashed line indicates the TTFT = 2s boundary.
+
+System A result:
 
-![vLLM Plot](vllm.png)
+![LMCache+CPU Plot](lmcache_with_cpu_200g.png)
 
-lmcache.png
+System B result:
 
-![LMCache Plot](lmcache.png)
+![LMCache+CPU+Storage400g Plot](lmcache_with_cpu_200g_exa_400g.png)
@@ -8,6 +8,7 @@
 from dataclasses import dataclass, asdict
 from typing import List
 import json
+import requests
 
 FIRST_PROMPT = "Read and summarize this novel.\n\n{}"
 FOLLOWUP_PROMPTS = [
@@ -57,11 +58,13 @@
 class Result:
     session_id: str
     turn: int
+    start_time: float
     latency: float
     ttft: float
     generation_time: float
     prompt_tokens: int
     completion_tokens: int
+    metrics: str
     status: str
 
 class ChatSession:
@@ -98,7 +101,7 @@ def append_assistant_message(self, content):
         self.messages.append({"role": "assistant", "content": content})
         self.turns += 1
 
-async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
+async def run_turn(session: ChatSession, client: openai.AsyncOpenAI, base_url: str) -> Result:
     prompt = session.get_next_prompt()
     session.append_user_message(prompt)
 
@@ -109,6 +112,10 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
     prompt_tokens = 0
 
     print(f"Session {session.session_id}, Turn {session.turns}: {prompt[:50]}...")
+
+    resp = requests.get(f"{base_url}/metrics")
+    resp.raise_for_status()
+
     response = await client.chat.completions.create(
         model=session.model,
         messages=session.messages,
@@ -137,28 +144,30 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
     result = Result(
         session_id=session.session_id,
         turn=session.turns,
+        start_time=start_time,
         latency=latency,
         ttft=ttft,
         generation_time=generation_time,
         prompt_tokens=prompt_tokens,
         completion_tokens=completion_tokens,
+        metrics=resp.text,
         status="success",
     )
-    
+
     session.append_assistant_message(content)
 
     return result
 
 async def run_group(args) -> List[Result]:
-    client = openai.AsyncOpenAI(base_url=args.base_url, api_key="EMPTY")
+    client = openai.AsyncOpenAI(base_url=f"{args.base_url}/v1", api_key="EMPTY")
     sessions = [ChatSession(args) for _ in range(args.session_depth)]
     results = []
 
     while any(not s.is_finished() for s in sessions):
         for session in sessions:
             if session.is_finished():
                 continue
-            result = await run_turn(session, client)
+            result = await run_turn(session, client, args.base_url)
             results.append(result)
 
     return results