profile

GPU profiling scripts for DeepSeek-R1 distilled models on VLLM.

Models

DeepSeek-R1-Distill-Qwen-14B
DeepSeek-R1-Distill-Llama-8B
DeepSeek-R1-Distill-Qwen-1.5B

Structure

bench/          Latency benchmarking
  benchmark.py      avg / p95 / p99 / tokens-per-sec
  perfbench.sh      quick single-model benchmark

nsys/           Nsight Systems timeline profiling
  nsys_prefill.py   prefill inference with NVTX markers
  nsys_decode.py    single inference for clean timeline
  prefill_nsys.sh   sweep across input sizes and all models
  decode_nsys.sh    decode phase timeline verification

ncu/            Nsight Compute kernel-level profiling
  ncu_decode.py     NCU driver with GEMM kernel focus
  profile_runner.py NVTX-instrumented inference runner
  runcu_decode.sh   n-sequences sweep (decode)
  runcu_prefill.sh  targeted kernel filtering (prefill)
  kernel_count_test.py  probe kernel counts at different phases

utils/          Shared tooling
  extract_kernels_from_nsys.py  extract + categorize kernels from NSYS reports
  generate_ncu_filters.py       build NCU kernel filters from extracted data

data/           Kernel data
  all_kernels_sorted.json       extracted kernel lists by model + input size
  kernels.json                  kernel metadata

Workflow

Run NSYS sweeps to get system-level timelines
Extract kernel lists from NSYS reports into JSON (utils/extract_kernels_from_nsys.py)
Use extracted kernels to build targeted NCU filters (utils/generate_ncu_filters.py)
Run NCU profiling with those filters for kernel-level metrics

All scripts assume you run from the repo root:

bash nsys/prefill_nsys.sh
bash ncu/runcu_prefill.sh
bash bench/perfbench.sh

Not yet covered

Batch sizes > 1 (throughput scaling)
Decode profiling across varying input/output lengths
bfloat16 / quantized dtype comparisons
Multi-GPU / tensor parallel > 1
Power and thermal metrics
GPU memory high-water-mark aggregation
Architecturally different model families

Requirements

NVIDIA Nsight Systems (nsys)
NVIDIA Nsight Compute (ncu)
VLLM
PyTorch with CUDA
nvtx python package

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

profile

Models

Structure

Workflow

Not yet covered

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bench		bench
data		data
ncu		ncu
nsys		nsys
utils		utils
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

profile

Models

Structure

Workflow

Not yet covered

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages