Skip to content

Phase 1: Colonel learns to profile real vLLM#3

Open
A14N77 wants to merge 1 commit into
mainfrom
phase1/vllm-flavor
Open

Phase 1: Colonel learns to profile real vLLM#3
A14N77 wants to merge 1 commit into
mainfrom
phase1/vllm-flavor

Conversation

@A14N77
Copy link
Copy Markdown
Owner

@A14N77 A14N77 commented Apr 21, 2026

Motivation

Before this change, demos/vllm/ ran HuggingFace Transformers with a comment ("the kernels are the same shape as vLLM"). Real vLLM V1 could not be profiled through Colonel because:

  1. V1 runs GPU work in an EngineCore subprocess; cudaProfilerStart from the parent never reaches the worker.
  2. vLLM's collective_rpc refuses arbitrary callables without VLLM_ALLOW_INSECURE_SERIALIZATION=1.
  3. nsys/ncu flags required for the V1 subprocess + cuda-graph decode (target-processes, profile-from-start, capture-range) were not plumbed through Colonel's evaluators.

What this adds

  • colonel/profiling/vllm/ package:
    • profile_region(llm) — context manager that calls llm.collective_rpc(cudaProfilerStart/Stop) to gate capture inside the EngineCore worker.
    • enable_nvtx() — flips ObservabilityConfig.enable_layerwise_ nvtx_tracing so nsys captures per-layer NVTX ranges.
  • --flavor {generic,vllm} CLI flag. vllm sets:
    • env: VLLM_ALLOW_INSECURE_SERIALIZATION=1, VLLM_USE_V1=1, COLONEL_VLLM_ENABLE_NVTX=1.
    • nsys: --capture-range cudaProfilerApi --capture-range-end stop, --trace includes nvtx.
    • ncu: --section SpeedOfLight --profile-from-start off (the spike's proven section; see Known limitation below).
  • Evaluator metadata knobs (additive, backward-compatible):
    • ncu: ncu_set, ncu_section, ncu_replay_mode, ncu_profile_from_start.
    • nsys: nsys_trace, nsys_capture_range, nsys_enable_nvtx.
  • demos/vllm/05_real_vllm_decode.py — first demo that actually imports vllm. Uses profile_region(). Defaults to Qwen2.5-0.5B; fits on a 16 GB A4000.
  • docs/vllm_contribution_prd.md — the 963-line PRD driving this effort (two-track plan: ship Colonel's real-vLLM support, then use it to contribute fixes upstream to vllm-project).

Verified on Shadeform A4000 (48 SMs, Ampere, CUDA 12.8, vLLM 0.19.1):

  • colonel run --flavor vllm --evaluator nsys --no-analyze -- python demos/vllm/05_real_vllm_decode.py --num-prompts 4 --max-tokens 16 produced session 5e27fc0e with 15 kernels captured (cutlass GEMM 12.8 ms × 16, flash_fwd_* 0.75 ms × 48, vllm::reshape_and_cache 0.13 ms × 48 — the kernel list the spike predicted).
  • colonel run --flavor generic --evaluator nsys regression-clean on a plain torch matmul (session aef9f2c0).

Known limitation

colonel run --flavor vllm --evaluator ncu hangs at 0% GPU util in kernel-replay mode on graph-captured decode (enforce_eager=False). Documented in demos/vllm/05_real_vllm_decode.py; workaround is passing --eager to disable CUDA graphs. Tracking as a Phase 2 item to reproduce the spike's working ncu configuration.

Non-goals in this commit

  • vLLM V0 support (V0 is deprecated upstream).
  • Windows support (vLLM doesn't run there).
  • Replacing demos 01–04 yet; Phase 2.

References

PRD: docs/vllm_contribution_prd.md
Spike reports: ~/smoke/reports/{ncu_decode.ncu-rep,vllm_decode.nsys-rep}

Motivation
----------
Before this change, `demos/vllm/` ran HuggingFace Transformers with a
comment ("the kernels are the same shape as vLLM"). Real vLLM V1 could
not be profiled through Colonel because:

1. V1 runs GPU work in an EngineCore subprocess; cudaProfilerStart
   from the parent never reaches the worker.
2. vLLM's collective_rpc refuses arbitrary callables without
   VLLM_ALLOW_INSECURE_SERIALIZATION=1.
3. nsys/ncu flags required for the V1 subprocess + cuda-graph decode
   (target-processes, profile-from-start, capture-range) were not
   plumbed through Colonel's evaluators.

What this adds
--------------
- `colonel/profiling/vllm/` package:
    * `profile_region(llm)` — context manager that calls
      `llm.collective_rpc(cudaProfilerStart/Stop)` to gate capture
      inside the EngineCore worker.
    * `enable_nvtx()` — flips ObservabilityConfig.enable_layerwise_
      nvtx_tracing so nsys captures per-layer NVTX ranges.
- `--flavor {generic,vllm}` CLI flag. `vllm` sets:
    * env: VLLM_ALLOW_INSECURE_SERIALIZATION=1, VLLM_USE_V1=1,
      COLONEL_VLLM_ENABLE_NVTX=1.
    * nsys: --capture-range cudaProfilerApi --capture-range-end stop,
      --trace includes nvtx.
    * ncu: --section SpeedOfLight --profile-from-start off (the
      spike's proven section; see Known limitation below).
- Evaluator metadata knobs (additive, backward-compatible):
    * ncu: ncu_set, ncu_section, ncu_replay_mode, ncu_profile_from_start.
    * nsys: nsys_trace, nsys_capture_range, nsys_enable_nvtx.
- `demos/vllm/05_real_vllm_decode.py` — first demo that actually
  imports vllm. Uses profile_region(). Defaults to Qwen2.5-0.5B;
  fits on a 16 GB A4000.
- `docs/vllm_contribution_prd.md` — the 963-line PRD driving this
  effort (two-track plan: ship Colonel's real-vLLM support, then use
  it to contribute fixes upstream to vllm-project).

Verified on Shadeform A4000 (48 SMs, Ampere, CUDA 12.8, vLLM 0.19.1):
- `colonel run --flavor vllm --evaluator nsys --no-analyze -- python
  demos/vllm/05_real_vllm_decode.py --num-prompts 4 --max-tokens 16`
  produced session 5e27fc0e with 15 kernels captured (cutlass GEMM
  12.8 ms × 16, flash_fwd_* 0.75 ms × 48, vllm::reshape_and_cache
  0.13 ms × 48 — the kernel list the spike predicted).
- `colonel run --flavor generic --evaluator nsys` regression-clean on
  a plain torch matmul (session aef9f2c0).

Known limitation
----------------
`colonel run --flavor vllm --evaluator ncu` hangs at 0% GPU util in
kernel-replay mode on graph-captured decode (enforce_eager=False).
Documented in demos/vllm/05_real_vllm_decode.py; workaround is
passing `--eager` to disable CUDA graphs. Tracking as a Phase 2 item
to reproduce the spike's working ncu configuration.

Non-goals in this commit
------------------------
- vLLM V0 support (V0 is deprecated upstream).
- Windows support (vLLM doesn't run there).
- Replacing demos 01–04 yet; Phase 2.

References
----------
PRD: docs/vllm_contribution_prd.md
Spike reports: ~/smoke/reports/{ncu_decode.ncu-rep,vllm_decode.nsys-rep}

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant