Skip to content

edge-inference/profile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

profile

GPU profiling scripts for DeepSeek-R1 distilled models on VLLM.

Models

  • DeepSeek-R1-Distill-Qwen-14B
  • DeepSeek-R1-Distill-Llama-8B
  • DeepSeek-R1-Distill-Qwen-1.5B

Structure

bench/          Latency benchmarking
  benchmark.py      avg / p95 / p99 / tokens-per-sec
  perfbench.sh      quick single-model benchmark

nsys/           Nsight Systems timeline profiling
  nsys_prefill.py   prefill inference with NVTX markers
  nsys_decode.py    single inference for clean timeline
  prefill_nsys.sh   sweep across input sizes and all models
  decode_nsys.sh    decode phase timeline verification

ncu/            Nsight Compute kernel-level profiling
  ncu_decode.py     NCU driver with GEMM kernel focus
  profile_runner.py NVTX-instrumented inference runner
  runcu_decode.sh   n-sequences sweep (decode)
  runcu_prefill.sh  targeted kernel filtering (prefill)
  kernel_count_test.py  probe kernel counts at different phases

utils/          Shared tooling
  extract_kernels_from_nsys.py  extract + categorize kernels from NSYS reports
  generate_ncu_filters.py       build NCU kernel filters from extracted data

data/           Kernel data
  all_kernels_sorted.json       extracted kernel lists by model + input size
  kernels.json                  kernel metadata

Workflow

  1. Run NSYS sweeps to get system-level timelines
  2. Extract kernel lists from NSYS reports into JSON (utils/extract_kernels_from_nsys.py)
  3. Use extracted kernels to build targeted NCU filters (utils/generate_ncu_filters.py)
  4. Run NCU profiling with those filters for kernel-level metrics

All scripts assume you run from the repo root:

bash nsys/prefill_nsys.sh
bash ncu/runcu_prefill.sh
bash bench/perfbench.sh

Not yet covered

  • Batch sizes > 1 (throughput scaling)
  • Decode profiling across varying input/output lengths
  • bfloat16 / quantized dtype comparisons
  • Multi-GPU / tensor parallel > 1
  • Power and thermal metrics
  • GPU memory high-water-mark aggregation
  • Architecturally different model families

Requirements

  • NVIDIA Nsight Systems (nsys)
  • NVIDIA Nsight Compute (ncu)
  • VLLM
  • PyTorch with CUDA
  • nvtx python package

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors