Reproducible LLM inference benchmarks (prefill vs decode throughput) to inform requirements for an intermediate memory tier (HBF-class) between HBM and SSD.
This repo uses llama.cpp’s llama-bench because it prints stable, machine-parseable throughput:
- ppXXX = prefill / prompt processing tokens/sec
- tgXXX = decode / token generation tokens/sec
You can run sweeps over prompt length (p) and generation length (n) to build a performance surface and later re-run under “tier constraints” (bandwidth/latency/QoS emulation).
- Windows 11 + WSL2 Ubuntu
llama.cppbuilt from source- Model: Qwen2.5-3B-Instruct GGUF (Q4_K_M) (good for 16GB RAM)
- Benchmark:
llama-benchwithppandtgthroughput
Tip: Keep model files in the Linux filesystem (
~/models/), not under/mnt/c/..., for consistent performance.
Typical files:
harness/run_llama_bench.py— run one benchmark and write JSONharness/sweep_llama_bench.py— run a grid sweep (p×n) and write CSV + JSONsharness/system_info.py— capture system/WSL context intoresults/system_info.json
V2 (HBF emulation additions):
emulation/tier_copy.py— you added this; throttled “tier” copy with BW cap + per-chunk latencyharness/sweep_hbf_weight_tier.py— stage model viatier_copy.pyat different constraints, then runllama-benchemulation/kv_spill_sim.py— compute decode ceiling vs tier BW/lat for assumed KV spill volumedocs/hbf_emulation_v2.md— interpretation and examples
Note: V2 emulates HBF-like tier constraints without real HBF hardware. It produces requirement curves.
In PowerShell (Admin):
wsl --install -d Ubuntu-24.04Reboot if prompted.
Launch Ubuntu:
wsl -d Ubuntu-24.04sudo apt update && sudo apt upgrade -y
sudo apt install -y git build-essential cmake python3 python3-venv python3-pip wget unzip
mkdir -p ~/work ~/modelscd ~/work
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build -j 2Verify:
ls -la build/bin | grep llama-benchRebuild with fewer parallel jobs:
cmake --build build -j 1For a 16GB machine, start with:
Qwen2.5-3B-Instruct GGUF → Q4_K_M
Download into Linux FS:
cd ~/models
wget -O qwen2.5-3b-instruct-q4_k_m.gguf \
"https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf"Confirm size:
ls -lh ~/models/qwen2.5-3b-instruct-q4_k_m.ggufIf it’s in Windows Downloads:
cp /mnt/c/Users/<WIN_USERNAME>/Downloads/qwen2.5-3b-instruct-q4_k_m.gguf ~/models/To print your Windows username from WSL:
cmd.exe /c echo %USERNAME%From this repo root:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
chmod +x harness/*.py(Optional) capture machine context:
./harness/system_info.py
cat results/system_info.json | head~/work/llama.cpp/build/bin/llama-bench \
-m ~/models/qwen2.5-3b-instruct-q4_k_m.gguf \
-t 8 \
-p 256 \
-n 256You will see two rows at the end:
pp256(prefill t/s)tg256(decode t/s)
./harness/run_llama_bench.py \
--model ~/models/qwen2.5-3b-instruct-q4_k_m.gguf \
-t 8 -p 256 -n 256 \
--mode warm \
--out results/bench_p256_n256.jsonInspect:
python3 - <<'PY'
import json
d=json.load(open("results/bench_p256_n256.json"))
print("pp_tps:", d.get("pp_tps"))
print("tg_tps:", d.get("tg_tps"))
print("rows:", [(r.get("test"), r.get("tps")) for r in d.get("rows", [])])
PYExample grid: p ∈ {64,128,256} and n ∈ {64,128,256}
./harness/sweep_llama_bench.py \
--model ~/models/qwen2.5-3b-instruct-q4_k_m.gguf \
-t 8 \
--mode warm \
--p_list 64,128,256 \
--n_list 64,128,256 \
--repeats 1 \
--out_dir results/sweep \
--csv_out results/bench_sweep.csvInspect:
head -n 5 results/bench_sweep.csv
tail -n 5 results/bench_sweep.csv
ls results/sweep | headOutputs:
results/bench_sweep.csv(summary)results/sweep/*.json(one JSON per grid point)
Why: cold runs approximate “first-touch / not-cached” behavior and are useful for tier-sensitivity.
- From PowerShell (Windows):
wsl --shutdown- Relaunch Ubuntu:
wsl -d Ubuntu-24.04- Run the same sweep labeled cold:
cd ~/work/hbf-ready-bench
source .venv/bin/activate
./harness/sweep_llama_bench.py \
--model ~/models/qwen2.5-3b-instruct-q4_k_m.gguf \
-t 8 \
--mode cold \
--p_list 64,128,256 \
--n_list 64,128,256 \
--repeats 3 \
--out_dir results/sweep_cold \
--csv_out results/bench_sweep_cold.csvNow you can compare warm vs cold surfaces and quantify % drops in pp vs tg.
HBF is positioned as a memory tier between HBM and SSD. Standards discussions need workload-driven targets:
- How much throughput (and stability/QoS) is required so decode (tg) doesn’t collapse?
- How much bandwidth is needed so prefill (pp) stays within X% of baseline?
This repo provides:
- A reproducible baseline surface (
ppandtgover p/n) - A harness that can be re-run under “tier constraints” (bandwidth/latency/QoS emulation) to produce degradation curves
- CSV/JSON artifacts that can be used in OCP-style discussions or future compliance tests
V1 goal: baseline + sweep + warm/cold surfaces
V2 goal: add explicit “tier constraint” emulation and plot/report deltas
HBF is positioned as a memory tier between HBM and SSD. To make this concrete without HBF hardware, V2 adds tools that emulate a constrained tier and generate workload-driven requirement curves.
This models a “weights live in a tier” workflow:
- Stage (copy) the model through a bandwidth/latency constrained path (
emulation/tier_copy.py) - Run
llama-benchon the staged model - Record:
- staging time / effective MB/s
pp(prefill) andtg(decode) tokens/sec
mkdir -p ~/models_staged./harness/sweep_hbf_weight_tier.py \
--model ~/models/qwen2.5-3b-instruct-q4_k_m.gguf \
--staged_dir ~/models_staged \
--mbps_list 250,500,1000,2000,4000 \
--lat_ms_list 0,0.05,0.2 \
-t 8 -p 256 -n 256 \
--mode hbf-emu \
--out_dir results/hbf_weight_tier \
--csv_out results/hbf_weight_tier.csvOutputs:
results/hbf_weight_tier.csv(summary)results/hbf_weight_tier/*.json(per-point artifacts)- staged models under
~/models_staged/
Interpretation: if staging dominates end-to-end latency, the tier BW/lat targets need to be higher, or the system must prefetch/hide staging.
Decode throughput is often the first to collapse when a tier has poor latency/jitter. This simulator turns “KV spill to tier” into a quantitative ceiling:
python3 emulation/kv_spill_sim.py \
--kv_kb_per_token 256 \
--tier_mbps_list 500,1000,2000,4000,8000 \
--op_lat_ms 0.05 \
--ops_per_token 2This prints tier_mbps,tg_tps_max, a conservative “decode cap” imposed by BW+latency assumptions.
See docs/hbf_emulation_v2.md for more details.
Large p/n tests on CPU can take a while. Start with -p 64 -n 64 as a smoke test.
- Use fewer parallel jobs:
cmake --build build -j 1 - Ensure WSL has enough memory/swap.
WSL may auto-limit memory. Increase via:
C:\Users\<YOU>\.wslconfig
[wsl2]
memory=12GB
swap=8GB
processors=8Then:
wsl --shutdownRelaunch and confirm:
free -hIf you plan to share widely, pick a license (MIT/Apache-2.0) and add LICENSE.