Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 133 additions & 3 deletions kv_cache_benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ A storage benchmarking tool for Large Language Model inference systems. This ben
12. [Understanding Results](#understanding-results)
13. [Unit Testing](#unit-testing)
14. [Excel Export](#excel-export)
15. [MLPerf Submission Guidelines](#mlperf-submission-guidelines)
15. [Block-Layer Latency Tracing](#block-layer-latency-tracing)
16. [MLPerf Submission Guidelines](#mlperf-submission-guidelines)
16. [Troubleshooting](#troubleshooting)

---
Expand Down Expand Up @@ -1498,10 +1499,39 @@ The test suite covers 23 component categories with ~170+ individual tests:
| `TestPerTierPhaseMetrics` | 7 | Per-tier (GPU/CPU/Storage) KV bytes read/written tracking during prefill/decode phases |
| `TestPerTierPhaseMetricsWithGPU` | 4 | GPU tier metrics tracking, phase-aware read/write separation (skipped without GPU) |

### Visualize User Request Flow

The `TestVisualizeUserRequestFlow` test class traces the complete I/O path of real requests through the benchmark; these are the tests to run when you want to understand exactly what the benchmark does at each step:

```bash
# Part 3: The 4 latency levels (L1-L4) with real NVMe timing
pytest tests/test_kv_cache.py::TestVisualizeUserRequestFlow::test_part3_four_latency_levels -v -s

# Part 3b: How requests become .npy files on disk
pytest tests/test_kv_cache.py::TestVisualizeUserRequestFlow::test_part3b_request_to_npy_file_mapping -v -s

# Part 3c: Multi-turn conversation I/O (triangular read pattern)
pytest tests/test_kv_cache.py::TestVisualizeUserRequestFlow::test_part3c_multi_turn_prefill_decode_file_io -v -s

# Part 3d: Multi-turn with eviction pressure (hits vs misses under LRU)
pytest tests/test_kv_cache.py::TestVisualizeUserRequestFlow::test_part3d_multi_turn_with_eviction -v -s

# Part 4: 3-tier waterfall LRU eviction cascade (GPU -> CPU -> NVMe -> DELETE)
pytest tests/test_kv_cache.py::TestVisualizeUserRequestFlow::test_part4_three_tier_waterfall_eviction -v -s

# Part 5: NVMe-only eviction (what happens when the drive fills up)
pytest tests/test_kv_cache.py::TestVisualizeUserRequestFlow::test_part5_one_tier_nvme_only_eviction -v -s

# Run all visualization tests at once
pytest tests/test_kv_cache.py::TestVisualizeUserRequestFlow -v -s
```

Use `-s` to see the printed I/O traces; without it pytest captures the output and you lose the visualization.

### Expected Runtime

- **Without GPU**: ~5-10 seconds
- **With GPU**: ~10-15 seconds
- **Without GPU**: ~4-5 minutes (211 tests)
- **With GPU**: ~5-6 minutes

GPU tests are automatically skipped if CUDA is not available.

Expand Down Expand Up @@ -1553,6 +1583,106 @@ The Excel file contains a single row with all key metrics:

---

## Block-Layer Latency Tracing

The `--enable-latency-tracing` flag adds block-layer visibility to the benchmark with a single flag; no code changes, no separate tooling, minimal overhead. It spawns bpftrace as a sudo subprocess, attaches to the kernel block layer tracepoints during the benchmark run, and on completion distills the I/O profile into structured telemetry across stdout, JSON, and XLSX.

This is the same class of telemetry that storage engineers use when characterizing production workloads; the difference is that it is fully integrated into the benchmark and the results are machine-readable.

### What It Captures

15 histograms across the full I/O stack:

| Category | Histograms | What It Tells You |
|----------|-----------|-------------------|
| Device hardware | D2C read/write | Per-NVMe-command completion time; this is what the SSD controller actually took |
| I/O scheduler | Q2D read/write | Time sitting in the Linux I/O scheduler queue before dispatch |
| Application visible | VFS read/write | Full syscall latency from the application's perspective |
| Serialization | write-to-fsync gap, fsync, fadvise-to-read gap | CPU vs device bottleneck decomposition |
| Block sizes | bssplit read/write | I/O size distribution at the kernel layer (matches MDTS splits) |
| Queue depth | In-flight at dispatch read/write | Instantaneous I/O concurrency at the moment of dispatch |
| Spatial | LBA heatmap read/write | Where on the device the I/O lands (10 GB linear buckets) |

### Usage

```bash
# Run benchmark with tracing (requires sudo for bpftrace)
kv-cache --config config.yaml --model llama3.1-8b \
--num-users 10 --duration 30 \
--gpu-mem-gb 0 --cpu-mem-gb 0 \
--max-concurrent-allocs 1 \
--generation-mode none \
--cache-dir /mnt/nvme --seed 42 \
--enable-latency-tracing \
--xlsx-output results_traced.xlsx
```

The tracing output appears at the end of the benchmark results. The XLSX gets two additional sheets: **Device Tracing** (P50/P95/P99 summary per histogram) and **Trace Histograms** (raw bucket data for charting).

### Standalone Tracing Against vLLM / llm-d

The bpftrace scripts work independently of the benchmark. Point them at any inference engine process:

```bash
# Trace vLLM and generate a fio workload
sudo ./utils/storage_latency_stack.sh vllm --fio

# Trace llm-d
sudo ./utils/storage_latency_stack.sh llm-d --fio

# Trace any process
sudo ./utils/storage_latency_stack.sh python3

# Manual distill from saved trace
python3 utils/distill_fio.py -i trace_output.txt --process vllm -o vllm_workload.ini
```

The `--fio` flag captures the bpftrace output and pipes it through `distill_fio.py` to generate a standalone fio workload file. This means you can trace vLLM on a production node, take the generated .ini file, and replay the exact I/O pattern on a bare-metal test rig with fio to compare drives without running the inference stack.

### fio Workload Distiller

When `--enable-latency-tracing` is used with the benchmark, or when `--fio` is passed to the shell wrapper, a fio .ini file is generated automatically. The fio config includes:

- **bssplit** from the traced block size distribution (separate read/write splits)
- **rwmixread** from the read/write I/O count ratio
- **iodepth** from the in-flight I/O histogram P50
- **thinktime** from the write-to-fsync serialization gap (idle time between I/O bursts)
- D2C latency summary and LBA hot zone in the header comments

Example generated config:
```ini
[kv-cache-traced]
ioengine=libaio
direct=1
time_based
runtime=300
rw=randrw
rwmixread=87
bssplit=4k/1:8k/1:16k/1:32k/1:64k/1:128k/100,4k/7:8k/1:16k/1:32k/4:64k/4:128k/83
iodepth=2048
iodepth_batch_submit=2048
iodepth_batch_complete_min=1
size=100G
thinktime=32
thinktime_blocks=2048
thinktime_iotime=1s
refill_buffers=1
norandommap=1
randrepeat=0
numjobs=1
group_reporting
percentile_list=50:95:99:99.9:99.99
```

### Requirements

- Linux 5.x+ with BTF support
- bpftrace 0.14+ (`sudo apt install bpftrace`)
- sudo or CAP_BPF privileges
- If bpftrace is not available, the flag degrades gracefully; the benchmark runs normally without tracing.

---

## MLPerf Submission Guidelines

For official MLPerf v3.0 storage submissions, use these standardized commands. **These invocations have been validated through extensive discovery testing** (1,411 Fast system tests, 268 Slow system tests comparing 14,000 MB/s vs 3,000 MB/s storage).
Expand Down
86 changes: 86 additions & 0 deletions kv_cache_benchmark/docs/MLperf v3 KV cache proposal.md
Original file line number Diff line number Diff line change
Expand Up @@ -1981,6 +1981,92 @@ data = buffer[start:start+size].reshape(kv_shape)

---

## 8.5 Block-Layer Latency Tracing & fio Workload Distiller

The benchmark includes an integrated block-layer tracing capability that decomposes storage I/O across every layer of the Linux I/O stack; from the application (VFS) down to the NVMe controller (D2C). This is enabled with a single flag and requires no code changes, no separate tooling, and adds minimal overhead to the benchmark run.

### Motivation

The L4 "device" latency reported by the benchmark measures the time to read or write an entire .npy file through NumPy. For large KV cache entries (500 MB to 2 GB), the kernel splits each file I/O into hundreds of NVMe commands at the MDTS boundary. The resulting P95 device latency reflects the total time to load a large entry; it includes both the actual NVMe hardware time and the numpy deserialization overhead within that single np.load() call. Without block-layer visibility, there is no way to distinguish how much of that latency is the drive vs the host.

### Enabling Tracing

```bash
kv-cache --config config.yaml --model llama3.1-8b \
--num-users 10 --duration 30 \
--gpu-mem-gb 0 --cpu-mem-gb 0 \
--max-concurrent-allocs 1 \
--generation-mode none \
--cache-dir /mnt/nvme --seed 42 \
--enable-latency-tracing \
--xlsx-output results_traced.xlsx
```

The benchmark spawns bpftrace as a sudo subprocess before the run, attaches to 16 kernel tracepoints, and on completion sends SIGINT to collect the histogram data. The tracing subprocess runs in its own process group; the benchmark itself does not require root.

### Histograms Captured

| Histogram | Layer | What It Measures |
|-----------|-------|-----------------|
| D2C read/write | Device | Per-NVMe-command completion time (actual hardware latency) |
| Q2D read/write | I/O Scheduler | Time in the scheduler queue before dispatch to the NVMe driver |
| VFS read/write | Application | Full syscall time including page cache, filesystem, and block I/O |
| fsync | Device | Actual device flush latency after buffered writes |
| write-to-fsync gap | Serialization | CPU idle time between write() return and fsync() entry |
| fadvise-to-read gap | Cache mgmt | Overhead of page cache invalidation before reads |
| bssplit read/write | Block sizes | I/O size distribution at the kernel layer |
| Queue depth read/write | Concurrency | Instantaneous in-flight I/O count at the moment of dispatch |
| LBA heatmap read/write | Spatial | Where on the device the I/O lands (10 GB linear buckets) |

### fio Workload Distiller

When tracing is enabled, the benchmark automatically generates a standalone fio .ini file that reproduces the observed I/O pattern. The distiller extracts bssplit (block size distribution with separate read/write splits), rwmixread (from the I/O count ratio), iodepth (from the in-flight I/O histogram), and thinktime (from the write-to-fsync serialization gap) and writes them into a fio config that can be run independently against any device.

Example output from a traced benchmark run on Kingston DC3000ME:

```ini
[kv-cache-traced]
ioengine=libaio
direct=1
time_based
runtime=300
rw=randrw
rwmixread=87
bssplit=4k/1:8k/1:16k/1:32k/1:64k/1:128k/100,4k/7:8k/1:16k/1:32k/4:64k/4:128k/83
iodepth=2048
iodepth_batch_submit=2048
iodepth_batch_complete_min=1
size=100G
thinktime=32
thinktime_blocks=2048
thinktime_iotime=1s
refill_buffers=1
norandommap=1
randrepeat=0
numjobs=1
group_reporting
percentile_list=50:95:99:99.9:99.99
```

### Standalone Usage Against Inference Engines

The tracing tools work independently of the benchmark. The shell wrapper and Python distiller can be pointed at any process:

```bash
# Trace vLLM and generate fio workload
sudo ./utils/storage_latency_stack.sh vllm --fio

# Trace llm-d
sudo ./utils/storage_latency_stack.sh llm-d --fio

# Manual distill from saved trace output
python3 utils/distill_fio.py -i trace_output.txt --process vllm -o vllm_workload.ini
```

This means you can characterize the I/O profile of a real inference engine on a production node, take the generated fio .ini file to a test bench, and run it against multiple drives with fio to compare storage performance without deploying the full inference stack.

---

## 9. Common Issues & Troubleshooting

### Issue: High Host Latency
Expand Down
Loading
Loading