Phase 1: Colonel learns to profile real vLLM by A14N77 · Pull Request #3 · A14N77/colonel

A14N77 · 2026-04-21T03:17:59Z

Motivation

Before this change, demos/vllm/ ran HuggingFace Transformers with a comment ("the kernels are the same shape as vLLM"). Real vLLM V1 could not be profiled through Colonel because:

V1 runs GPU work in an EngineCore subprocess; cudaProfilerStart from the parent never reaches the worker.
vLLM's collective_rpc refuses arbitrary callables without VLLM_ALLOW_INSECURE_SERIALIZATION=1.
nsys/ncu flags required for the V1 subprocess + cuda-graph decode (target-processes, profile-from-start, capture-range) were not plumbed through Colonel's evaluators.

What this adds

colonel/profiling/vllm/ package:
- profile_region(llm) — context manager that calls llm.collective_rpc(cudaProfilerStart/Stop) to gate capture inside the EngineCore worker.
- enable_nvtx() — flips ObservabilityConfig.enable_layerwise_ nvtx_tracing so nsys captures per-layer NVTX ranges.
--flavor {generic,vllm} CLI flag. vllm sets:
- env: VLLM_ALLOW_INSECURE_SERIALIZATION=1, VLLM_USE_V1=1, COLONEL_VLLM_ENABLE_NVTX=1.
- nsys: --capture-range cudaProfilerApi --capture-range-end stop, --trace includes nvtx.
- ncu: --section SpeedOfLight --profile-from-start off (the spike's proven section; see Known limitation below).
Evaluator metadata knobs (additive, backward-compatible):
- ncu: ncu_set, ncu_section, ncu_replay_mode, ncu_profile_from_start.
- nsys: nsys_trace, nsys_capture_range, nsys_enable_nvtx.
demos/vllm/05_real_vllm_decode.py — first demo that actually imports vllm. Uses profile_region(). Defaults to Qwen2.5-0.5B; fits on a 16 GB A4000.
docs/vllm_contribution_prd.md — the 963-line PRD driving this effort (two-track plan: ship Colonel's real-vLLM support, then use it to contribute fixes upstream to vllm-project).

Verified on Shadeform A4000 (48 SMs, Ampere, CUDA 12.8, vLLM 0.19.1):

colonel run --flavor vllm --evaluator nsys --no-analyze -- python demos/vllm/05_real_vllm_decode.py --num-prompts 4 --max-tokens 16 produced session 5e27fc0e with 15 kernels captured (cutlass GEMM 12.8 ms × 16, flash_fwd_* 0.75 ms × 48, vllm::reshape_and_cache 0.13 ms × 48 — the kernel list the spike predicted).
colonel run --flavor generic --evaluator nsys regression-clean on a plain torch matmul (session aef9f2c0).

Known limitation

colonel run --flavor vllm --evaluator ncu hangs at 0% GPU util in kernel-replay mode on graph-captured decode (enforce_eager=False). Documented in demos/vllm/05_real_vllm_decode.py; workaround is passing --eager to disable CUDA graphs. Tracking as a Phase 2 item to reproduce the spike's working ncu configuration.

Non-goals in this commit

vLLM V0 support (V0 is deprecated upstream).
Windows support (vLLM doesn't run there).
Replacing demos 01–04 yet; Phase 2.

References

PRD: docs/vllm_contribution_prd.md
Spike reports: ~/smoke/reports/{ncu_decode.ncu-rep,vllm_decode.nsys-rep}

Motivation ---------- Before this change, `demos/vllm/` ran HuggingFace Transformers with a comment ("the kernels are the same shape as vLLM"). Real vLLM V1 could not be profiled through Colonel because: 1. V1 runs GPU work in an EngineCore subprocess; cudaProfilerStart from the parent never reaches the worker. 2. vLLM's collective_rpc refuses arbitrary callables without VLLM_ALLOW_INSECURE_SERIALIZATION=1. 3. nsys/ncu flags required for the V1 subprocess + cuda-graph decode (target-processes, profile-from-start, capture-range) were not plumbed through Colonel's evaluators. What this adds -------------- - `colonel/profiling/vllm/` package: * `profile_region(llm)` — context manager that calls `llm.collective_rpc(cudaProfilerStart/Stop)` to gate capture inside the EngineCore worker. * `enable_nvtx()` — flips ObservabilityConfig.enable_layerwise_ nvtx_tracing so nsys captures per-layer NVTX ranges. - `--flavor {generic,vllm}` CLI flag. `vllm` sets: * env: VLLM_ALLOW_INSECURE_SERIALIZATION=1, VLLM_USE_V1=1, COLONEL_VLLM_ENABLE_NVTX=1. * nsys: --capture-range cudaProfilerApi --capture-range-end stop, --trace includes nvtx. * ncu: --section SpeedOfLight --profile-from-start off (the spike's proven section; see Known limitation below). - Evaluator metadata knobs (additive, backward-compatible): * ncu: ncu_set, ncu_section, ncu_replay_mode, ncu_profile_from_start. * nsys: nsys_trace, nsys_capture_range, nsys_enable_nvtx. - `demos/vllm/05_real_vllm_decode.py` — first demo that actually imports vllm. Uses profile_region(). Defaults to Qwen2.5-0.5B; fits on a 16 GB A4000. - `docs/vllm_contribution_prd.md` — the 963-line PRD driving this effort (two-track plan: ship Colonel's real-vLLM support, then use it to contribute fixes upstream to vllm-project). Verified on Shadeform A4000 (48 SMs, Ampere, CUDA 12.8, vLLM 0.19.1): - `colonel run --flavor vllm --evaluator nsys --no-analyze -- python demos/vllm/05_real_vllm_decode.py --num-prompts 4 --max-tokens 16` produced session 5e27fc0e with 15 kernels captured (cutlass GEMM 12.8 ms × 16, flash_fwd_* 0.75 ms × 48, vllm::reshape_and_cache 0.13 ms × 48 — the kernel list the spike predicted). - `colonel run --flavor generic --evaluator nsys` regression-clean on a plain torch matmul (session aef9f2c0). Known limitation ---------------- `colonel run --flavor vllm --evaluator ncu` hangs at 0% GPU util in kernel-replay mode on graph-captured decode (enforce_eager=False). Documented in demos/vllm/05_real_vllm_decode.py; workaround is passing `--eager` to disable CUDA graphs. Tracking as a Phase 2 item to reproduce the spike's working ncu configuration. Non-goals in this commit ------------------------ - vLLM V0 support (V0 is deprecated upstream). - Windows support (vLLM doesn't run there). - Replacing demos 01–04 yet; Phase 2. References ---------- PRD: docs/vllm_contribution_prd.md Spike reports: ~/smoke/reports/{ncu_decode.ncu-rep,vllm_decode.nsys-rep} Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1: Colonel learns to profile real vLLM#3

Phase 1: Colonel learns to profile real vLLM#3
A14N77 wants to merge 1 commit into
mainfrom
phase1/vllm-flavor

A14N77 commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

A14N77 commented Apr 21, 2026

Motivation

What this adds

Known limitation

Non-goals in this commit

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant