hbf-ready-bench

Reproducible LLM inference benchmarks (prefill vs decode throughput) to inform requirements for an intermediate memory tier (HBF-class) between HBM and SSD.

This repo uses llama.cpp’s llama-bench because it prints stable, machine-parseable throughput:

ppXXX = prefill / prompt processing tokens/sec
tgXXX = decode / token generation tokens/sec

You can run sweeps over prompt length (p) and generation length (n) to build a performance surface and later re-run under “tier constraints” (bandwidth/latency/QoS emulation).

Tested environment

Windows 11 + WSL2 Ubuntu
llama.cpp built from source
Model: Qwen2.5-3B-Instruct GGUF (Q4_K_M) (good for 16GB RAM)
Benchmark: llama-bench with pp and tg throughput

Tip: Keep model files in the Linux filesystem (~/models/), not under /mnt/c/..., for consistent performance.

0) Repo layout (what’s in here)

Typical files:

harness/run_llama_bench.py — run one benchmark and write JSON
harness/sweep_llama_bench.py — run a grid sweep (p×n) and write CSV + JSONs
harness/system_info.py — capture system/WSL context into results/system_info.json

V2 (HBF emulation additions):

emulation/tier_copy.py — you added this; throttled “tier” copy with BW cap + per-chunk latency
harness/sweep_hbf_weight_tier.py — stage model via tier_copy.py at different constraints, then run llama-bench
emulation/kv_spill_sim.py — compute decode ceiling vs tier BW/lat for assumed KV spill volume
docs/hbf_emulation_v2.md — interpretation and examples

Note: V2 emulates HBF-like tier constraints without real HBF hardware. It produces requirement curves.

1) Install WSL2 Ubuntu (Windows)

In PowerShell (Admin):

wsl --install -d Ubuntu-24.04

Reboot if prompted.

Launch Ubuntu:

wsl -d Ubuntu-24.04

2) Install dependencies (inside Ubuntu / WSL)

sudo apt update && sudo apt upgrade -y
sudo apt install -y git build-essential cmake python3 python3-venv python3-pip wget unzip
mkdir -p ~/work ~/models

3) Build llama.cpp (inside Ubuntu / WSL)

cd ~/work
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build -j 2

Verify:

ls -la build/bin | grep llama-bench

If the build gets killed (“Terminated”)

Rebuild with fewer parallel jobs:

cmake --build build -j 1

4) Download Qwen GGUF model (recommended baseline)

For a 16GB machine, start with:

Qwen2.5-3B-Instruct GGUF → Q4_K_M

Download into Linux FS:

cd ~/models
wget -O qwen2.5-3b-instruct-q4_k_m.gguf \
  "https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf"

Confirm size:

ls -lh ~/models/qwen2.5-3b-instruct-q4_k_m.gguf

Alternate: download on Windows and copy into WSL

If it’s in Windows Downloads:

cp /mnt/c/Users/<WIN_USERNAME>/Downloads/qwen2.5-3b-instruct-q4_k_m.gguf ~/models/

To print your Windows username from WSL:

cmd.exe /c echo %USERNAME%

5) Set up this repo’s Python environment

From this repo root:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
chmod +x harness/*.py

(Optional) capture machine context:

./harness/system_info.py
cat results/system_info.json | head

6) Run a single benchmark (pp/tg throughput)

A) Run `llama-bench` directly

~/work/llama.cpp/build/bin/llama-bench \
  -m ~/models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -t 8 \
  -p 256 \
  -n 256

You will see two rows at the end:

pp256 (prefill t/s)
tg256 (decode t/s)

B) Run via the repo harness (writes JSON)

./harness/run_llama_bench.py \
  --model ~/models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -t 8 -p 256 -n 256 \
  --mode warm \
  --out results/bench_p256_n256.json

Inspect:

python3 - <<'PY'
import json
d=json.load(open("results/bench_p256_n256.json"))
print("pp_tps:", d.get("pp_tps"))
print("tg_tps:", d.get("tg_tps"))
print("rows:", [(r.get("test"), r.get("tps")) for r in d.get("rows", [])])
PY

7) Run a BOTH sweep (prompt × generation grid)

Example grid: p ∈ {64,128,256} and n ∈ {64,128,256}

./harness/sweep_llama_bench.py \
  --model ~/models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -t 8 \
  --mode warm \
  --p_list 64,128,256 \
  --n_list 64,128,256 \
  --repeats 1 \
  --out_dir results/sweep \
  --csv_out results/bench_sweep.csv

Inspect:

head -n 5 results/bench_sweep.csv
tail -n 5 results/bench_sweep.csv
ls results/sweep | head

Outputs:

results/bench_sweep.csv (summary)
results/sweep/*.json (one JSON per grid point)

8) Warm vs Cold sweeps (WSL “cold-ish” protocol)

Why: cold runs approximate “first-touch / not-cached” behavior and are useful for tier-sensitivity.

Cold protocol

From PowerShell (Windows):

wsl --shutdown

Relaunch Ubuntu:

wsl -d Ubuntu-24.04

Run the same sweep labeled cold:

cd ~/work/hbf-ready-bench
source .venv/bin/activate

./harness/sweep_llama_bench.py \
  --model ~/models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -t 8 \
  --mode cold \
  --p_list 64,128,256 \
  --n_list 64,128,256 \
  --repeats 3 \
  --out_dir results/sweep_cold \
  --csv_out results/bench_sweep_cold.csv

Now you can compare warm vs cold surfaces and quantify % drops in pp vs tg.

9) How this relates to HBF (why this repo is useful)

HBF is positioned as a memory tier between HBM and SSD. Standards discussions need workload-driven targets:

How much throughput (and stability/QoS) is required so decode (tg) doesn’t collapse?
How much bandwidth is needed so prefill (pp) stays within X% of baseline?

This repo provides:

A reproducible baseline surface (pp and tg over p/n)
A harness that can be re-run under “tier constraints” (bandwidth/latency/QoS emulation) to produce degradation curves
CSV/JSON artifacts that can be used in OCP-style discussions or future compliance tests

V1 goal: baseline + sweep + warm/cold surfaces
V2 goal: add explicit “tier constraint” emulation and plot/report deltas

V2: HBF-like tier emulation (weight-tier + KV spill)

HBF is positioned as a memory tier between HBM and SSD. To make this concrete without HBF hardware, V2 adds tools that emulate a constrained tier and generate workload-driven requirement curves.

V2-A) Weight-tier emulation (staged model through constrained tier)

This models a “weights live in a tier” workflow:

Stage (copy) the model through a bandwidth/latency constrained path (emulation/tier_copy.py)
Run llama-bench on the staged model
Record:
- staging time / effective MB/s
- pp (prefill) and tg (decode) tokens/sec

1) Create a staged directory

mkdir -p ~/models_staged

2) Run a sweep over tier constraints

./harness/sweep_hbf_weight_tier.py \
  --model ~/models/qwen2.5-3b-instruct-q4_k_m.gguf \
  --staged_dir ~/models_staged \
  --mbps_list 250,500,1000,2000,4000 \
  --lat_ms_list 0,0.05,0.2 \
  -t 8 -p 256 -n 256 \
  --mode hbf-emu \
  --out_dir results/hbf_weight_tier \
  --csv_out results/hbf_weight_tier.csv

Outputs:

results/hbf_weight_tier.csv (summary)
results/hbf_weight_tier/*.json (per-point artifacts)
staged models under ~/models_staged/

Interpretation: if staging dominates end-to-end latency, the tier BW/lat targets need to be higher, or the system must prefetch/hide staging.

V2-B) KV spill decode ceiling simulator

Decode throughput is often the first to collapse when a tier has poor latency/jitter. This simulator turns “KV spill to tier” into a quantitative ceiling:

python3 emulation/kv_spill_sim.py \
  --kv_kb_per_token 256 \
  --tier_mbps_list 500,1000,2000,4000,8000 \
  --op_lat_ms 0.05 \
  --ops_per_token 2

This prints tier_mbps,tg_tps_max, a conservative “decode cap” imposed by BW+latency assumptions.

See docs/hbf_emulation_v2.md for more details.

10) Troubleshooting

`llama-bench` exists but takes time

Large p/n tests on CPU can take a while. Start with -p 64 -n 64 as a smoke test.

Build gets killed (“Terminated”)

Use fewer parallel jobs: cmake --build build -j 1
Ensure WSL has enough memory/swap.

WSL reports less memory than the machine has

WSL may auto-limit memory. Increase via: C:\Users\<YOU>\.wslconfig

[wsl2]
memory=12GB
swap=8GB
processors=8

Then:

wsl --shutdown

Relaunch and confirm:

free -h

License

If you plan to share widely, pick a license (MIT/Apache-2.0) and add LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
emulation		emulation
harness		harness
results		results
.gitignore		.gitignore
Findings.md		Findings.md
LICENSE		LICENSE
README.md		README.md
findings_v2_weight_tier.md		findings_v2_weight_tier.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hbf-ready-bench

Tested environment

0) Repo layout (what’s in here)

1) Install WSL2 Ubuntu (Windows)

2) Install dependencies (inside Ubuntu / WSL)

3) Build llama.cpp (inside Ubuntu / WSL)

If the build gets killed (“Terminated”)

4) Download Qwen GGUF model (recommended baseline)

Alternate: download on Windows and copy into WSL

5) Set up this repo’s Python environment

6) Run a single benchmark (pp/tg throughput)

A) Run `llama-bench` directly

B) Run via the repo harness (writes JSON)

7) Run a BOTH sweep (prompt × generation grid)

8) Warm vs Cold sweeps (WSL “cold-ish” protocol)

Cold protocol

9) How this relates to HBF (why this repo is useful)

V2: HBF-like tier emulation (weight-tier + KV spill)

V2-A) Weight-tier emulation (staged model through constrained tier)

1) Create a staged directory

2) Run a sweep over tier constraints

V2-B) KV spill decode ceiling simulator

10) Troubleshooting

`llama-bench` exists but takes time

Build gets killed (“Terminated”)

WSL reports less memory than the machine has

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hbf-ready-bench

Tested environment

0) Repo layout (what’s in here)

1) Install WSL2 Ubuntu (Windows)

2) Install dependencies (inside Ubuntu / WSL)

3) Build llama.cpp (inside Ubuntu / WSL)

If the build gets killed (“Terminated”)

4) Download Qwen GGUF model (recommended baseline)

Alternate: download on Windows and copy into WSL

5) Set up this repo’s Python environment

6) Run a single benchmark (pp/tg throughput)

A) Run llama-bench directly

B) Run via the repo harness (writes JSON)

7) Run a BOTH sweep (prompt × generation grid)

8) Warm vs Cold sweeps (WSL “cold-ish” protocol)

Cold protocol

9) How this relates to HBF (why this repo is useful)

V2: HBF-like tier emulation (weight-tier + KV spill)

V2-A) Weight-tier emulation (staged model through constrained tier)

1) Create a staged directory

2) Run a sweep over tier constraints

V2-B) KV spill decode ceiling simulator

10) Troubleshooting

llama-bench exists but takes time

Build gets killed (“Terminated”)

WSL reports less memory than the machine has

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

A) Run `llama-bench` directly

`llama-bench` exists but takes time

Packages