Skip to content

ssaketh-ch/llama31b-benchmark

Repository files navigation

Llama-3.1-8B vLLM Benchmark Harness on H200

What is this repo for?

Reproducible benchmarking and tuning harness for Llama-3.1-8B on NVIDIA H200 (MIG 71GB) using vLLM. One-knob tests, isolation runs, stacked tuning, accuracy validation, MLflow logging, and summarized results.


You can view my Dissertation report here.

System Info

  • GPU: NVIDIA H200 NVL, MIG enabled, driver 570.133.20, CUDA 12.8
  • CPU: Dual Intel Xeon Platinum 8480+ (224 vCPUs total)
  • RAM: 256 GB
  • Key libraries:
    • vLLM 0.17.1
    • PyTorch 2.10.0+cu129
    • Transformers 4.57.6
    • CUDA 12.9 (nvidia-cuda-runtime-cu12 12.9.79)
    • FlashInfer 0.6.4
    • bitsandbytes 0.49.2
    • accelerate 1.13.0
    • ray 2.54.0
    • triton 3.6.0
    • huggingface_hub 0.36.2
    • safetensors 0.7.0
    • tokenizers 0.22.2

Key Results at a Glance

Full tables and per-experiment logs live in the result subfolders. Reference baseline: BF16, BS=16, stock vLLM V1 defaults (~1082-1089 tok/s depending on run date). Best certified stacked config: FP8 weights + max_model_len=2668 (1544.29 tok/s, +41.8% vs baseline). BS=1024 directional result: ~2303-3322 tok/s (+112 to +205%) -- not yet certified due to loadgen QPS issue. Best follow-up tuned-baseline result: FLASHINFER on the stacked FP8/BS1024 baseline (3261.60 tok/s, +7.88% vs that baseline).

Top Throughput Wins (Isolation, from OG baseline)

Change Tok/s Speedup Safe to Ship?
batch_size=1024 ~2154-3322 +112 to +205% PENDING (accuracy regression -- see below)
FP8 weight quantization ~1328-1456 +34% PENDING (accuracy run failed -- re-run needed)
max_model_len=2668 (right-sized KV) ~1165 +7.6% YES -- confirmed accuracy-neutral
async_scheduling=True ~1115 +3-5% YES -- low risk, no accuracy concern
sort_by_input_length ~1115 +3% PENDING (accuracy run failed -- re-run needed)
FLASHINFER on tuned baseline 3261.60 +7.88% vs exp_00 tuned baseline PENDING (throughput-only so far)
prefix_caching=True ~4476 +57.6% CONDITIONAL -- workload-dependent (see below)

Critical Accuracy Findings

Finding Impact Status
BS=1024 causes ~13 ROUGE-1 regression Hard blocker for certified runs Open -- root cause under investigation
max_model_len < 2668 truncates inputs -2.24 R1 at len=1500; fails at len=1000 Resolved -- lock to 2668
FP8 accuracy run failed to produce output Cannot certify FP8 without this Blocker -- re-run needed
APC neutral on unique queries, negative on CNN/DM +42% from disabling on low-reuse dataset Workload-dependent
compilation_config=3, chunked_prefill, dtype=fp16 All confirmed ROUGE-neutral Safe

Things That Hurt -- Do Not Use

Change Throughput Impact Notes
compilation_config=O0 -20.82% Torch compilation is mandatory
scheduler_delay_factor=0.1 -18.57% Worst single flag in tuned-baseline set
bitsandbytes quantization -59% (stacked) Never substitute for fp8
num_scheduler_steps=2/4 -11 to -13% Harmful at BS=1024
FLASHINFER on stock baseline -0.08% Neutral in isolation; only positive on the tuned BS=1024 stack
gpu_memory_utilization=0.98 -9.30% Memory pressure causes stalls
fp8 KV cache alongside fp8 weights -16% vs weights-alone Use weights-only fp8

Quickstart (step-by-step)

  1. Install prerequisites
  • GPU: NVIDIA driver and a CUDA stack compatible with the pinned PyTorch / vLLM versions
  • Conda:
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.5.2-0-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init
  1. Clone the upstream MLPerf Inference repo
git clone --recurse-submodules https://github.com/mlcommons/inference.git --depth 1
cd inference
  1. Set helper paths
export ROOT=$PWD
export LLAMA_FOLDER=$ROOT/language/llama3.1-8b
export LOADGEN_FOLDER=$ROOT/loadgen
export DATASET_FOLDER=$LLAMA_FOLDER/dataset
  1. Create and activate the conda env
conda create -y -n llama3.1-8b python=3.10
conda activate llama3.1-8b
conda install -y -c conda-forge libstdcxx-ng=12
  1. Install loadgen
cd $LOADGEN_FOLDER && pip install -e . && cd -
  1. Download the dataset
mkdir -p "$DATASET_FOLDER"
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
    https://inference.mlcommons-storage.org/metadata/llama3-1-8b-cnn-eval.uri \
    -d "$DATASET_FOLDER"
  1. Download the model (choose one)
# Scripted HF download (needs a HF token with access)
huggingface-cli login
./scripts/download_model.sh meta-llama/Meta-Llama-3.1-8B-Instruct "$LLAMA_FOLDER/model"

# Or git-xet clone (needs a HF write token)
curl -sSfL https://hf.co/git-xet/install.sh | sh
git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct "$LLAMA_FOLDER/model"
  1. Clone this repo and run
cd $LLAMA_FOLDER
git clone https://github.com/ssaketh-ch/llama31b-benchmark.git
cd llama31b-benchmark

export INFERENCE_DIR=$(pwd)/llama3.1-8b
export CHECKPOINT_PATH=$INFERENCE_DIR/model
export DATASET_PATH=$INFERENCE_DIR/dataset
export MLFLOW_TRACKING_URI=http://localhost:5000

pip install -r env/requirements.txt

# Tuned baseline (FP8, TP=1, BS=1024, max_len=2668)
./scripts/run_stacked.sh exp_00

# One tuned-baseline combination test
./scripts/run_stacked.sh exp_A5

# Isolation run
./scripts/run_isolation.sh exp_00
./scripts/run_isolation.sh exp_05

run_stacked.sh targets the tuned FP8/BS=1024 baseline. run_isolation.sh targets the stock BF16/BS=16 production baseline. The plain Python examples below stay close to the upstream BF16/BS=16 reference path.


MLflow Logging

  • What is logged: run params (experiment id, batch size, flags), metrics (tokens/s, samples/s, duration, validity), latency summaries, MLPerf logs, and the SUT snapshot for each run.
  • Start a local server:
export MLFLOW_BACKEND_URI=sqlite:///$(pwd)/mlflow.db
export MLFLOW_ARTIFACT_ROOT=file:$(pwd)/mlartifacts

mlflow server \
    --backend-store-uri $MLFLOW_BACKEND_URI \
    --default-artifact-root $MLFLOW_ARTIFACT_ROOT \
    --host 0.0.0.0 --port 5000

export MLFLOW_TRACKING_URI=http://localhost:5000

Setup (detailed)

  • Clone upstream MLPerf repo: git clone --recurse-submodules https://github.com/mlcommons/inference.git
  • Clone this repo and cd into it.
  • Set helper vars: INFERENCE_DIR, CHECKPOINT_PATH, DATASET_PATH, MLFLOW_TRACKING_URI.
  • Create env: conda create -y -n llama3.1-8b python=3.10 && conda activate llama3.1-8b && conda install -y -c conda-forge libstdcxx-ng=12
  • Install deps: pip install -r env/requirements.txt
  • Install loadgen: cd ../inference/loadgen && pip install -e . && cd -

Get the Model

Requires Hugging Face access to meta-llama/Meta-Llama-3.1-8B-Instruct.

huggingface-cli login
./scripts/download_model.sh meta-llama/Meta-Llama-3.1-8B-Instruct "$CHECKPOINT_PATH"

Alternative with git-xet (needs a HF write token):

curl -sSfL https://hf.co/git-xet/install.sh | sh
git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

Get the Dataset (CNN/DailyMail)

# Datacenter eval set (required)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
    https://inference.mlcommons-storage.org/metadata/llama3-1-8b-cnn-eval.uri \
    -d "$DATASET_PATH"

# 5k eval subset (optional)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
    https://inference.mlcommons-storage.org/metadata/llama3-1-8b-sample-cnn-eval-5000.uri \
    -d "$DATASET_PATH"

# Calibration set (optional)
bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) \
    https://inference.mlcommons-storage.org/metadata/llama3-1-8b-cnn-dailymail-calibration.uri \
    -d "$DATASET_PATH"

Run Performance (plain Python)

python -u llama3.1-8b/main.py --scenario Offline \
    --model-path "$CHECKPOINT_PATH" \
    --batch-size 16 \
    --dtype bfloat16 \
    --user-conf llama3.1-8b/user.conf \
    --total-sample-count 13368 \
    --dataset-path "$DATASET_PATH" \
    --output-log-dir output \
    --tensor-parallel-size 1 \
    --vllm

Server Scenario (not tested in this harness)

python -u llama3.1-8b/main.py --scenario Server \
    --model-path "$CHECKPOINT_PATH" \
    --batch-size 16 \
    --dtype bfloat16 \
    --user-conf llama3.1-8b/user.conf \
    --total-sample-count 13368 \
    --dataset-path "$DATASET_PATH" \
    --output-log-dir output \
    --tensor-parallel-size 1 \
    --vllm

The upstream reference notes Server SUT was not tested for GPU runs; same applies here.


Run Accuracy (plain Python)

python -u llama3.1-8b/main.py --scenario Offline \
    --model-path "$CHECKPOINT_PATH" \
    --batch-size 16 \
    --accuracy \
    --dtype bfloat16 \
    --user-conf llama3.1-8b/user.conf \
    --total-sample-count 13368 \
    --dataset-path "$DATASET_PATH" \
    --output-log-dir output \
    --tensor-parallel-size 1 \
    --vllm

# Evaluate the accuracy log
python llama3.1-8b/evaluation.py \
    --mlperf-accuracy-file output/mlperf_log_accuracy.json \
    --dataset-file "$DATASET_PATH" \
    --dtype int32

Upstream Accuracy Baselines (BF16, reference GPU run)

Metric Reference Value MLPerf Target (99%)
ROUGE-1 38.7792 38.3914
ROUGE-2 15.9075 15.7484
ROUGE-L 24.4957 24.2507
RougeLsum 35.7930 35.4551
gen_len 8,167,644 7,350,880 (90%)

Our measured BF16 baseline: R1=38.80, R2=15.95, RL=24.52 -- within spec.


Throughput Results

Isolation Run (vLLM v0.17.1)

Source: results/throughput/isolation_one_change/

Each run changes a single parameter relative to the production baseline (BF16, BS=16, stock vLLM V1 defaults). All speedups relative to exp_00 (1082.82 tok/s).

Experiment Description Tok/s Speedup
exp_00 PRODUCTION BASELINE: stock vLLM V1 defaults, BS=16 1082.82 baseline
exp_00b RIGHT-SIZED KV: max_model_len=2668 on production baseline 1165.24 +7.61%
exp_01 -prefix_caching=False (disable V1 default APC) 1048.73 -3.15%
exp_02 -async_output_proc OFF (disable V1 default async) 1006.82 -7.02%
exp_03 compilation_config=O1 (below V1 default O2) 1062.58 -1.87%
exp_04 compilation_config=O0 (no compile vs V1 default O2) 857.43 -20.82%
exp_05 compilation_config=O3 (above V1 default O2) 1098.13 +1.41%
exp_06 max_num_batched_tokens=2668 (neutralise chunked prefill) 1063.47 -1.79%
exp_07 async_scheduling=True (experimental) 1115.30 +3.00%
exp_08 max_num_batched_tokens=8192 1120.72 +3.50%
exp_09 max_num_batched_tokens=65536 1089.07 +0.58%
exp_10 quantization=fp8 weight only 1456.36 +34.50%
exp_11 kv_cache_dtype=fp8 KV only 1131.84 +4.53%
exp_12 fp8 weights + fp8 KV 1277.73 +18.00%
exp_13 +batch_size=1024 2302.79 +112.67%
exp_14 +batch_size=2048 2379.46 +119.75%
exp_15 sort_by_input_length 1115.30 +3.00%
exp_16 gpu_memory_utilization=0.95 1104.48 +2.00%
exp_17 attention_backend=FLASHINFER 1081.93 -0.08%
exp_18 skip_tokenizer_init=True 1083.77 +0.09%
  • Batch size (exp_13, exp_14) is the single biggest lever: +113% at BS=1024 and +120% at BS=2048.
  • FP8 weight quantization (exp_10) gives +34.5% at BS=16. Adding fp8 KV on top (exp_12) drops this to +18% -- fp8 KV introduces overhead that partially cancels the weight quantization benefit.
  • Right-sizing KV (exp_00b) is a free +7.6% -- always set max_model_len to the dataset maximum.
  • V1 defaults are net-positive: disabling APC costs -3.15%, disabling async output costs -7%. Do not turn these off without a specific reason.
  • compilation_config=O0 is the worst single regression at -20.8%. Torch compilation is mandatory.
  • FLASHINFER and skip_tokenizer_init had zero measurable impact.

Progressive / Stacked Tuning

Source: results/throughput/progression_full_stack/

Stacked tuning sessions that build from the production baseline toward a best-performing configuration. Each stack step adds one change on top of all previous ones.

All speedups relative to stk_00_baseline (1089.04 tok/s). Latencies in seconds.

Stack Step Changes Added Tok/s Speedup p50 (s) p99 (s) Result
stk_00_baseline Production baseline: BS=16, BF16, stock vLLM V1 defaults 1089.04 baseline 783.92 1556.55 VALID
stk_01_fp8 +fp8 weight quantization 1462.53 +34.30% 583.79 1158.98 VALID
stk_03_kv2668 +max_model_len=2668 (right-sized KV) 1544.29 +41.80% 550.27 1097.43 VALID
stk_04_sort +sort_by_input_length 1507.93 +38.46% 472.53 1117.14 VALID
stk_05_gmu095 +gpu_memory_utilization=0.95 1504.52 +38.15% 474.87 1119.70 VALID
stk_06_kvcache_fp8 +kv_cache_dtype=fp8 1500.22 +37.76% 477.08 1122.68 VALID
stk_07_compile3 +compilation_config=O3 1494.88 +37.27% 480.19 1126.89 VALID
stk_08_expandable +PYTORCH_CUDA_ALLOC_CONF=expandable_segments 1420.80 +30.46% 481.81 1186.76 VALID
stk_09_bitsandbytes +bitsandbytes quantization (replaces fp8) 446.14 -59.03% 1764.17 3787.78 VALID
stk_02_bs1024 +batch_size=1024 (standalone, not stacked) 3321.76 +205.02% 191.41 509.22 INVALID*

*stk_02_bs1024 is INVALID: min_duration not satisfied. Increase expected QPS so loadgen pre-generates a larger coalesced query set. Throughput is directionally correct but not certified.

  • FP8 weights + right-sized KV (stk_01 + stk_03) is the sweet spot at +41.8% throughput and -30% p99 latency with just two changes.
  • Sort, GPU memory utilization, and fp8 KV add marginal gains but show diminishing returns once stacked. Peak is at stk_03 (1544 tok/s); subsequent steps gradually erode it.
  • expandable_segments (stk_08) drops to 1420 tok/s in the stacked context despite being positive in isolation -- likely conflicting with fp8 memory allocation patterns.
  • bitsandbytes (stk_09) is strongly negative at -59%. Never substitute for fp8.
  • BS=1024 directionally confirms +205% but requires a corrected loadgen QPS run before certification.

Quantization, APC & Chunked Prefill Deep-Dive

Source: results/throughput/Quant_APC_chunked/

Targeted isolation probing fp8 vs int8, chunked prefill MNBT settings, and APC cost/benefit. All speedups relative to baseline (685.88 tok/s). Latencies in seconds.

Experiment Description Tok/s Speedup p50 (s) p99 (s) Result
baseline BF16, no quant, prefix ON, CP ON, MNBT=16384, BS=16 685.88 baseline 1.06 2.45 VALID
quant_fp8_weights fp8 W8A8 weights (float16 compute) 1380.12 +101.22% 0.57 1.22 VALID
quant_int8_weights int8 W8A8 weights -- N/A -- -- FAILED
cp_off Chunked prefill OFF (MNBT=16384) 587.32 -14.37% 1.05 2.45 VALID
cp_on_mnbt_2668 CP ON + MNBT=2668 (~V0 scheduling) 659.35 -3.87% 1.09 2.56 VALID
cp_on_mnbt_8192 CP ON + MNBT=8192 (docs-recommended min) 671.70 -2.07% 1.17 2.52 VALID
prefix_off Prefix caching OFF (APC disabled) 975.21 +42.18% 0.68 1.72 VALID
  • fp8 W8A8 more than doubles throughput (+101%) and halves latency. Consistent with isolation results.
  • int8 W8A8 failed entirely (exit within 20s). Likely a missing kernel -- investigate before retrying.
  • APC is strongly negative on CNN/DailyMail (+42% from disabling it) due to low prefix reuse. Workload-dependent -- beneficial for chat/RAG with long shared system prompts.
  • CP ON + MNBT=16384 is optimal. CP OFF costs -14%; lower MNBT values cost 2-4%.

Combination Results

Source: results/throughput/Combination results/

This sweep starts from an already optimized serving baseline instead of the stock BF16/BS=16 reference. The local baseline here is:

  • FP8 weights
  • batch size 1024
  • gpu_memory_utilization=0.90
  • max_model_len=2668
  • prefix caching enabled
  • async output enabled
  • compilation level O2
  • FLASH_ATTN

That baseline (exp_00) delivers 3023.34 tok/s. The sweep then tests:

  • Phase A: memory and batching knobs
  • Phase B: runtime and backend toggles
  • Phase C: speculative decoding
  • Phase D: best-of combination templates
Experiment Group Tok/s Delta vs exp_00 Result Notes
exp_00 Baseline 3023.34 +0.00% INVALID FP8+BS1024+gmu0.90+len2668+APC+asyncOut+O2+FLASH_ATTN
exp_A1 Phase A 3117.49 +3.11% INVALID gpu_memory_utilization=0.95
exp_A3 Phase A 3148.96 +4.16% INVALID max_num_batched_tokens=32768
exp_A7 Phase A 3141.87 +3.92% INVALID skip_tokenizer_init=True
exp_B1 Phase B 3180.24 +5.19% INVALID compilation_config=3 (O3)
exp_B3 Phase B 3195.83 +5.71% INVALID serial output processing
exp_B4 Phase B 3261.60 +7.88% INVALID attention_backend=FLASHINFER
exp_C2 Phase C 2766.57 N/A VALID n-gram spec, 5 tokens, lookup max 4
exp_C3 Phase C 2781.44 N/A VALID n-gram spec, 5 tokens, gmu=0.82
exp_D2 Phase D 2696.25 N/A VALID best A + best C template
  • FLASHINFER is the best non-speculative result in this sweep at 3261.60 tok/s (exp_B4, +7.88% vs the tuned exp_00 baseline).
  • Phase A knobs still have modest headroom: higher GPU memory utilization, larger batched-token budget, and skipping tokenizer init all land in the +3% to +4% range.
  • Speculative decoding underperforms the tuned baseline. The only VALID runs (exp_C2, exp_C3, exp_D2) are all slower than exp_00.
  • Phase D is exploratory rather than final: the metadata marks these as EDIT SUT FIRST, so they should be treated as templates, not locked production configs.

See results/throughput/Combination results/README.md for the full table and per-phase breakdown.


Accuracy Results

Reference baseline (acc_baseline, BF16 BS=16): R1=38.80 | R2=15.95 | RL=24.52 Source: results/accuracy/

Experiment ROUGE-1 ROUGE-L vs Baseline Verdict
acc_baseline 38.8001 24.5197 baseline Reference
acc_chunked_65536 38.8000 24.5189 ~= baseline SAFE
acc_compile_cfg_3 38.7993 24.5194 ~= baseline SAFE
acc_dtype_fp16 38.8381 24.5365 +0.04 R1 SAFE
acc_fp8_quant N/A N/A -- FAILED -- re-run needed
acc_prefix_cache_on 38.8006 24.5196 ~= baseline SAFE
acc_sched_steps_4 38.8007 24.5227 ~= baseline SAFE
acc_sort_by_len N/A N/A -- FAILED -- re-run needed
acc_kv_len_1500 36.5577 23.2105 -2.24 R1 UNSAFE
acc_kv_len_2668 38.7976 24.5183 ~= baseline SAFE
acc_batch_512 26.2134 16.5851 -12.59 R1 UNSAFE
acc_batch_1024 25.5702 16.2307 -13.23 R1 UNSAFE
combo_acc_fp8_kv2668 38.8376 24.5358 ~= baseline SAFE
combo_acc_full_optimal N/A N/A -- FAILED -- re-run needed

Open Issues & Blockers

Issue Priority Action
BS=1024 ROUGE-1 regression (~13 pts) CRITICAL Investigate non-determinism at large batch sizes
acc_fp8_quant failed -- no output CRITICAL Re-run; fp8 is the primary throughput optimization
combo_acc_full_optimal failed CRITICAL Re-run; final accuracy answer for the production config
acc_sort_by_len failed -- no output HIGH Re-run; expected neutral but unconfirmed
int8 W8A8 kernel failure MEDIUM Check vLLM int8 support for LLaMA 3.1-8B on this GPU
stk_02_bs1024 INVALID (loadgen) MEDIUM Rerun with corrected expected QPS

Nsight Systems GPU Profiling

GPU execution traces were collected using NVIDIA Nsight Systems for two batch size configurations on the same 13,368‑sample CNN/DailyMail Offline run:

  • batch=1024
  • batch=16

Each run was stopped after two complete batches; all figures are derived from the Nsight Systems SQLite trace files.

Key Findings

  • The model is compute‑bound during active execution (95.1% of GPU active time spent on GEMM at batch=1024, 90.9% at batch=16).
  • At batch=16, the GPU sits completely idle for ≈1,121.7 ms between every batch while the CPU executes QuerySamplesComplete() and re‑queues work, wasting ≈50% of total GPU time per batch.
  • At batch=1024, the same inter‑batch gap is only 0.516 ms (≈0.002% of batch duration), a 2,174× reduction.
  • The “Other” kernel category (small dispatch and synchronisation overhead) grows from 1.4% to 5.7% of GPU active time at batch=16, confirming that fixed‑cost operations become disproportionately expensive at small batch sizes.
  • FlashAttention efficiency drops from 1.6% to 1.2% of GPU active time at batch=16, consistent with 16‑sequence attention matrices being too small to saturate GPU warp occupancy.

Complete Data Table

Metric batch=1024 Batch 1 batch=1024 Batch 2 batch=16 Batch 1 batch=16 Batch 2
Wall time (s) 24.111 27.437 1.093 1.104
GPU active time (s) 24.002 27.379 1.080 0.564
GPU utilisation (%) 99.5 99.8 98.8 51.1
Kernel count 111,242 111,682 54,965 24,602
Avg kernel duration (ms) 0.216 0.245 0.020 0.023
Inter‑batch idle gap (avg, ms) 0.516 0.516 1,121.7 1,121.7
Inter‑batch idle gap (% of batch time) 0.002% 0.002% 50% 50%
GEMM % of GPU active time 95.1% 95.1% 90.9% 90.9%
Triton/Elementwise % of GPU active time 1.8% 1.8% 1.8% 1.8%
FlashAttention % of GPU active time 1.6% 1.6% 1.2% 1.2%
Other (overhead) % of GPU active time 1.4% 1.4% 5.7% 5.7%
Sampling/TopK % of GPU active time 0.2% 0.2% 0.4% 0.4%
KV cache % of GPU active time 0.0% 0.0% 0.0% 0.0%
Occurrences of decode kernel (Batch 1) 33,024 33,024 16,256 16,256
Avg decode step interval (ms) 0.729 0.729 0.058 0.058
Implied tokens per request (Batch 1) 32.2 32.2 32.0 32.0

Key Ratios Explained

  • Inter‑batch gap ratio: 1,121.7 ms (batch=16) / 0.516 ms (batch=1024) ≈ 2,174×.
  • Utilisation regime:
    • batch=1024 → compute‑bound, 99.5–99.8% utilisation.
    • batch=16 → CPU‑latency‑bound, 51.1% utilisation with 50% idle time.
  • Throughput implication: no amount of compute‑side optimisation can recover the 50% idle time at batch=16; only increasing batch size eliminates the gap.

How to View the Traces

  • Traces live in nsight/ as 1024_nsys_report.nsys-rep and 16_nsys_report.nsys-rep.
  • Open in NVIDIA Nsight Systems and inspect:
    • NVTX process_batch ranges vs GPU kernel rows (to see the inter‑batch idle gap).
    • GPU kernel summary panel to confirm the kernel‑category percentages above.

Scenarios Tested

  • Offline: tested -- tuned FP8 baseline and all sweeps.
  • Server: reference command shown above; not tested in this harness.
  • Accuracy: Offline scripts provided and run; Server accuracy not tested.

References

About

Reference implementations of MLPerf® inference benchmarks

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors