GPU profiling scripts for DeepSeek-R1 distilled models on VLLM.
- DeepSeek-R1-Distill-Qwen-14B
- DeepSeek-R1-Distill-Llama-8B
- DeepSeek-R1-Distill-Qwen-1.5B
bench/ Latency benchmarking
benchmark.py avg / p95 / p99 / tokens-per-sec
perfbench.sh quick single-model benchmark
nsys/ Nsight Systems timeline profiling
nsys_prefill.py prefill inference with NVTX markers
nsys_decode.py single inference for clean timeline
prefill_nsys.sh sweep across input sizes and all models
decode_nsys.sh decode phase timeline verification
ncu/ Nsight Compute kernel-level profiling
ncu_decode.py NCU driver with GEMM kernel focus
profile_runner.py NVTX-instrumented inference runner
runcu_decode.sh n-sequences sweep (decode)
runcu_prefill.sh targeted kernel filtering (prefill)
kernel_count_test.py probe kernel counts at different phases
utils/ Shared tooling
extract_kernels_from_nsys.py extract + categorize kernels from NSYS reports
generate_ncu_filters.py build NCU kernel filters from extracted data
data/ Kernel data
all_kernels_sorted.json extracted kernel lists by model + input size
kernels.json kernel metadata
- Run NSYS sweeps to get system-level timelines
- Extract kernel lists from NSYS reports into JSON (
utils/extract_kernels_from_nsys.py) - Use extracted kernels to build targeted NCU filters (
utils/generate_ncu_filters.py) - Run NCU profiling with those filters for kernel-level metrics
All scripts assume you run from the repo root:
bash nsys/prefill_nsys.sh
bash ncu/runcu_prefill.sh
bash bench/perfbench.sh- Batch sizes > 1 (throughput scaling)
- Decode profiling across varying input/output lengths
- bfloat16 / quantized dtype comparisons
- Multi-GPU / tensor parallel > 1
- Power and thermal metrics
- GPU memory high-water-mark aggregation
- Architecturally different model families
- NVIDIA Nsight Systems (
nsys) - NVIDIA Nsight Compute (
ncu) - VLLM
- PyTorch with CUDA
nvtxpython package