Skip to content

HIPCC_VERIFY: detect IGC-dropped kernels at compile time#1240

Draft
pvelesko wants to merge 7 commits into
mainfrom
missing-kernels
Draft

HIPCC_VERIFY: detect IGC-dropped kernels at compile time#1240
pvelesko wants to merge 7 commits into
mainfrom
missing-kernels

Conversation

@pvelesko
Copy link
Copy Markdown
Collaborator

@pvelesko pvelesko commented Apr 23, 2026

Summary

Adds HIPCC_VERIFY (default ON): after every successful hipcc compile, a new post-compile tool chip-kernel-verify invokes ocloc and diffs the SPIR-V OpEntryPoint set against the kernels present in the native binary. Mismatches — silent entry-point drops by IGC — fail the compile with a named list of the missing symbols.

Tracks: intel/intel-graphics-compiler#403 — on dg2 IGC's SIMD32 register-pressure retry path silently drops template instantiations from the native binary while ocloc reports success. At runtime chipStar then fails with "Failed to find kernel via kernel name" far from the root cause. This check catches that at build time.

What's in the box

  • tools/hipcc-verify/chip-kernel-verify — extracts the SPIR-V from a hipcc output (reusing tools/spirv-extractor/), runs ocloc compile -spirv_input -device <...> per target in HIPCC_VERIFY_DEVICES, then ocloc disasm to enumerate the native kernels, diffs, reports.
  • CMake option HIPCC_VERIFY (ON) + cache var HIPCC_VERIFY_DEVICES (default dg2). Baked into hipcc.bin as HIPCC_VERIFY_DEFAULT.
  • HIPCC driver hook (paired branch: CHIP-SPV/HIPCC hipcc-verify) — after a successful compile, invokes ${HIP_PATH}/bin/chip-kernel-verify <output>. Submodule pointer bump included here.
  • Test fixture + ctest: tests/hipcc-verify/ ships a real 1.9 MB SPIR-V dump (via CHIP_DUMP_PROCESSED_SPIRV) of rocThrust set_difference_by_key. Under IGC-with-#403 the reproducer ctest fails and names the 3 dropped lookback_set_op_kernel instantiations (ff/ii/jj element types). When IGC is eventually fixed the test flips to passing — that's the signal to retire the fixture.

Env overrides

Var Default Effect
HIPCC_VERIFY built-in ON 0/off = skip; warn = print only; else = fail on mismatch
HIPCC_VERIFY_DEVICES dg2 comma-separated ocloc -device targets
HIPCC_VERIFY_OCLOC ocloc binary path override

If ocloc is absent, the verifier prints a one-line notice and exits 0 — never fails a build just because the tool isn't installed.

Example failure output

[chip-kernel-verify] device=dg2: 3 kernel(s) missing from native binary
  (IGC likely dropped them; see https://github.com/intel/intel-graphics-compiler/issues/403 ):
    - ...lookback_set_op_kernel<...default_set_operations_config<10000, float, float>, true, ...>
    - ...lookback_set_op_kernel<...default_set_operations_config<10000, int,   int>,   true, ...>
    - ...lookback_set_op_kernel<...default_set_operations_config<10000, uint,  uint>,  true, ...>

Test plan

  • Standalone build of chip-kernel-verify — smoke-tested off/fail/warn modes, missing-file, non-fatbin ELF, missing-ocloc paths.
  • Full chipStar build with LLVM 21 (meatloaf). hipcc produces working binaries; verifier fires per compile.
  • End-to-end: hipcc samples/2_vecadd/VecAdd.cpp with HIPCC_VERIFY_DEVICES=nonsense-device → hipcc exits 1; same with HIPCC_VERIFY=warn → exits 0; HIPCC_VERIFY=0 → fully silent.
  • ctest -R hipcc_verify_hipcc_verify_detects_dropped_kernels FAILS under current IGC (expected: demonstrates IGC #403); hipcc_verify_off_mode_silent passes.
  • Full check.py dgpu level0 — verify no regressions from default-on verification on the existing suite.
  • Full check.py dgpu opencl.
  • Salami / pastrami (aarch64 / macOS where ocloc is absent) — confirm graceful skip.

Notes

  • Runtime Level Zero code is unchanged — this is purely a build-time check.
  • tools/spirv-extractor/spirv-extractor.hh has an unbounded 1 MB scan on its raw-data path; the verifier works around it by padding the input buffer before calling extractSPIRVModule. Separate cleanup for another PR.
  • The reproducer fixture is 1.9 MB — consistent with existing multi-MB .ll fixtures under tests/compiler/promoteInt/.

References

New tool chip-kernel-verify: extracts the SPIR-V bundle from a hipcc output,
runs ocloc compile+disasm per device in HIPCC_VERIFY_DEVICES, and diffs the
SPIR-V OpEntryPoint set against the kernels present in the native binary.
Missing names are reported with a pointer to intel-graphics-compiler#403.
Graceful when ocloc is absent.

CMake:
  HIPCC_VERIFY           ON by default; baked into hipcc.bin as the runtime
                         default via HIPCC_VERIFY_DEFAULT.
  HIPCC_VERIFY_DEVICES   default 'dg2', baked as compile-time default.
Runtime env override: HIPCC_VERIFY={0,warn,anything-else}.

The HIPCC submodule gains a post-compile hook that invokes the verifier on
the compiled output.

Test: tests/hipcc-verify/ checks in a real SPIR-V fixture from rocThrust
set_difference_by_key that reproduces IGC #403 on dg2 — IGC silently drops
three lookback_set_op_kernel template instantiations (ff/ii/jj). The
reproducer test FAILS under ctest while the bug is present; that failure is
the point. The sanity test (HIPCC_VERIFY=0) passes silently.
Fixes CI build failure where hipcc invocations during chipStar's own
bootstrap (before install tree is populated) returned exit 127 from a
non-existent chip-kernel-verify path.
ocloc returning non-zero (e.g. 'Double type is not supported on this
platform' on dg2 with rocRAND fp64 kernels) is a real but ordinary compile
error, not the silent IGC kernel-drop class this tool is designed to detect
(intel-graphics-compiler#403). Print diagnostics for visibility and continue;
only an apparent-success-with-missing-kernels result is the #403 signal that
should fail the build.
The reproducer test exits 1 to demonstrate the silent kernel drop, which
turns CI red. Use CTest WILL_FAIL to invert the exit-code interpretation:
while the bug is present, the test reports as Passed (verifier output naming
the dropped kernels still appears in logs); if IGC is fixed and the test
starts unexpectedly succeeding, it flips to Failed — the signal to retire
the fixture. SKIP_REGULAR_EXPRESSION handles ocloc-unavailable hosts (macOS,
aarch64) by marking the test Skipped instead of unexpectedly passed.
Hardening: when the input is an ELF, only verify if a .hip_fatbin section is
present. Intermediate -dc/-c objects carry partial offload bundles that
extractSPIRVModule's bundle-walking path crashed on (called _copyAs on a
pointer derived from arbitrary file bytes). Combined with the matching
HIPCC submodule fix that no longer invokes the verifier on intermediate
objects, this defends against future callers feeding partial inputs too.
…ractor.hh

macOS doesn't ship <elf.h>. spirv-extractor.hh already defines Elf64_Ehdr /
Elf64_Shdr / ELFMAG / SELFMAG portably (with __APPLE__ stubs), so reuse
those instead of pulling in the system header. Drop the EI_CLASS/ELFCLASS64
predicate that wasn't otherwise available on macOS — checking just the
ELFMAG is sufficient since Mach-O magic doesn't collide with it.
The previous commit's intent was to drop <elf.h> in favor of the portable
defs from spirv-extractor.hh, but the deletion was not staged. CI's macOS
build still hit 'fatal error: elf.h file not found'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant