Phase 1: Colonel learns to profile real vLLM#3
Open
A14N77 wants to merge 1 commit into
Open
Conversation
Motivation
----------
Before this change, `demos/vllm/` ran HuggingFace Transformers with a
comment ("the kernels are the same shape as vLLM"). Real vLLM V1 could
not be profiled through Colonel because:
1. V1 runs GPU work in an EngineCore subprocess; cudaProfilerStart
from the parent never reaches the worker.
2. vLLM's collective_rpc refuses arbitrary callables without
VLLM_ALLOW_INSECURE_SERIALIZATION=1.
3. nsys/ncu flags required for the V1 subprocess + cuda-graph decode
(target-processes, profile-from-start, capture-range) were not
plumbed through Colonel's evaluators.
What this adds
--------------
- `colonel/profiling/vllm/` package:
* `profile_region(llm)` — context manager that calls
`llm.collective_rpc(cudaProfilerStart/Stop)` to gate capture
inside the EngineCore worker.
* `enable_nvtx()` — flips ObservabilityConfig.enable_layerwise_
nvtx_tracing so nsys captures per-layer NVTX ranges.
- `--flavor {generic,vllm}` CLI flag. `vllm` sets:
* env: VLLM_ALLOW_INSECURE_SERIALIZATION=1, VLLM_USE_V1=1,
COLONEL_VLLM_ENABLE_NVTX=1.
* nsys: --capture-range cudaProfilerApi --capture-range-end stop,
--trace includes nvtx.
* ncu: --section SpeedOfLight --profile-from-start off (the
spike's proven section; see Known limitation below).
- Evaluator metadata knobs (additive, backward-compatible):
* ncu: ncu_set, ncu_section, ncu_replay_mode, ncu_profile_from_start.
* nsys: nsys_trace, nsys_capture_range, nsys_enable_nvtx.
- `demos/vllm/05_real_vllm_decode.py` — first demo that actually
imports vllm. Uses profile_region(). Defaults to Qwen2.5-0.5B;
fits on a 16 GB A4000.
- `docs/vllm_contribution_prd.md` — the 963-line PRD driving this
effort (two-track plan: ship Colonel's real-vLLM support, then use
it to contribute fixes upstream to vllm-project).
Verified on Shadeform A4000 (48 SMs, Ampere, CUDA 12.8, vLLM 0.19.1):
- `colonel run --flavor vllm --evaluator nsys --no-analyze -- python
demos/vllm/05_real_vllm_decode.py --num-prompts 4 --max-tokens 16`
produced session 5e27fc0e with 15 kernels captured (cutlass GEMM
12.8 ms × 16, flash_fwd_* 0.75 ms × 48, vllm::reshape_and_cache
0.13 ms × 48 — the kernel list the spike predicted).
- `colonel run --flavor generic --evaluator nsys` regression-clean on
a plain torch matmul (session aef9f2c0).
Known limitation
----------------
`colonel run --flavor vllm --evaluator ncu` hangs at 0% GPU util in
kernel-replay mode on graph-captured decode (enforce_eager=False).
Documented in demos/vllm/05_real_vllm_decode.py; workaround is
passing `--eager` to disable CUDA graphs. Tracking as a Phase 2 item
to reproduce the spike's working ncu configuration.
Non-goals in this commit
------------------------
- vLLM V0 support (V0 is deprecated upstream).
- Windows support (vLLM doesn't run there).
- Replacing demos 01–04 yet; Phase 2.
References
----------
PRD: docs/vllm_contribution_prd.md
Spike reports: ~/smoke/reports/{ncu_decode.ncu-rep,vllm_decode.nsys-rep}
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Before this change,
demos/vllm/ran HuggingFace Transformers with a comment ("the kernels are the same shape as vLLM"). Real vLLM V1 could not be profiled through Colonel because:What this adds
colonel/profiling/vllm/package:profile_region(llm)— context manager that callsllm.collective_rpc(cudaProfilerStart/Stop)to gate capture inside the EngineCore worker.enable_nvtx()— flips ObservabilityConfig.enable_layerwise_ nvtx_tracing so nsys captures per-layer NVTX ranges.--flavor {generic,vllm}CLI flag.vllmsets:demos/vllm/05_real_vllm_decode.py— first demo that actually imports vllm. Uses profile_region(). Defaults to Qwen2.5-0.5B; fits on a 16 GB A4000.docs/vllm_contribution_prd.md— the 963-line PRD driving this effort (two-track plan: ship Colonel's real-vLLM support, then use it to contribute fixes upstream to vllm-project).Verified on Shadeform A4000 (48 SMs, Ampere, CUDA 12.8, vLLM 0.19.1):
colonel run --flavor vllm --evaluator nsys --no-analyze -- python demos/vllm/05_real_vllm_decode.py --num-prompts 4 --max-tokens 16produced session 5e27fc0e with 15 kernels captured (cutlass GEMM 12.8 ms × 16, flash_fwd_* 0.75 ms × 48, vllm::reshape_and_cache 0.13 ms × 48 — the kernel list the spike predicted).colonel run --flavor generic --evaluator nsysregression-clean on a plain torch matmul (session aef9f2c0).Known limitation
colonel run --flavor vllm --evaluator ncuhangs at 0% GPU util in kernel-replay mode on graph-captured decode (enforce_eager=False). Documented in demos/vllm/05_real_vllm_decode.py; workaround is passing--eagerto disable CUDA graphs. Tracking as a Phase 2 item to reproduce the spike's working ncu configuration.Non-goals in this commit
References
PRD: docs/vllm_contribution_prd.md
Spike reports: ~/smoke/reports/{ncu_decode.ncu-rep,vllm_decode.nsys-rep}