HIPCC_VERIFY: detect IGC-dropped kernels at compile time#1240
Draft
pvelesko wants to merge 7 commits into
Draft
Conversation
New tool chip-kernel-verify: extracts the SPIR-V bundle from a hipcc output,
runs ocloc compile+disasm per device in HIPCC_VERIFY_DEVICES, and diffs the
SPIR-V OpEntryPoint set against the kernels present in the native binary.
Missing names are reported with a pointer to intel-graphics-compiler#403.
Graceful when ocloc is absent.
CMake:
HIPCC_VERIFY ON by default; baked into hipcc.bin as the runtime
default via HIPCC_VERIFY_DEFAULT.
HIPCC_VERIFY_DEVICES default 'dg2', baked as compile-time default.
Runtime env override: HIPCC_VERIFY={0,warn,anything-else}.
The HIPCC submodule gains a post-compile hook that invokes the verifier on
the compiled output.
Test: tests/hipcc-verify/ checks in a real SPIR-V fixture from rocThrust
set_difference_by_key that reproduces IGC #403 on dg2 — IGC silently drops
three lookback_set_op_kernel template instantiations (ff/ii/jj). The
reproducer test FAILS under ctest while the bug is present; that failure is
the point. The sanity test (HIPCC_VERIFY=0) passes silently.
Fixes CI build failure where hipcc invocations during chipStar's own bootstrap (before install tree is populated) returned exit 127 from a non-existent chip-kernel-verify path.
ocloc returning non-zero (e.g. 'Double type is not supported on this platform' on dg2 with rocRAND fp64 kernels) is a real but ordinary compile error, not the silent IGC kernel-drop class this tool is designed to detect (intel-graphics-compiler#403). Print diagnostics for visibility and continue; only an apparent-success-with-missing-kernels result is the #403 signal that should fail the build.
The reproducer test exits 1 to demonstrate the silent kernel drop, which turns CI red. Use CTest WILL_FAIL to invert the exit-code interpretation: while the bug is present, the test reports as Passed (verifier output naming the dropped kernels still appears in logs); if IGC is fixed and the test starts unexpectedly succeeding, it flips to Failed — the signal to retire the fixture. SKIP_REGULAR_EXPRESSION handles ocloc-unavailable hosts (macOS, aarch64) by marking the test Skipped instead of unexpectedly passed.
Hardening: when the input is an ELF, only verify if a .hip_fatbin section is present. Intermediate -dc/-c objects carry partial offload bundles that extractSPIRVModule's bundle-walking path crashed on (called _copyAs on a pointer derived from arbitrary file bytes). Combined with the matching HIPCC submodule fix that no longer invokes the verifier on intermediate objects, this defends against future callers feeding partial inputs too.
…ractor.hh macOS doesn't ship <elf.h>. spirv-extractor.hh already defines Elf64_Ehdr / Elf64_Shdr / ELFMAG / SELFMAG portably (with __APPLE__ stubs), so reuse those instead of pulling in the system header. Drop the EI_CLASS/ELFCLASS64 predicate that wasn't otherwise available on macOS — checking just the ELFMAG is sufficient since Mach-O magic doesn't collide with it.
The previous commit's intent was to drop <elf.h> in favor of the portable defs from spirv-extractor.hh, but the deletion was not staged. CI's macOS build still hit 'fatal error: elf.h file not found'.
a1eaf10 to
de33431
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
HIPCC_VERIFY(default ON): after every successfulhipcccompile, a new post-compile toolchip-kernel-verifyinvokesoclocand diffs the SPIR-VOpEntryPointset against the kernels present in the native binary. Mismatches — silent entry-point drops by IGC — fail the compile with a named list of the missing symbols.Tracks: intel/intel-graphics-compiler#403 — on dg2 IGC's SIMD32 register-pressure retry path silently drops template instantiations from the native binary while
oclocreports success. At runtime chipStar then fails with "Failed to find kernel via kernel name" far from the root cause. This check catches that at build time.What's in the box
tools/hipcc-verify/chip-kernel-verify— extracts the SPIR-V from a hipcc output (reusingtools/spirv-extractor/), runsocloc compile -spirv_input -device <...>per target inHIPCC_VERIFY_DEVICES, thenocloc disasmto enumerate the native kernels, diffs, reports.HIPCC_VERIFY(ON) + cache varHIPCC_VERIFY_DEVICES(defaultdg2). Baked intohipcc.binasHIPCC_VERIFY_DEFAULT.hipcc-verify) — after a successful compile, invokes${HIP_PATH}/bin/chip-kernel-verify <output>. Submodule pointer bump included here.tests/hipcc-verify/ships a real 1.9 MB SPIR-V dump (viaCHIP_DUMP_PROCESSED_SPIRV) of rocThrustset_difference_by_key. Under IGC-with-#403 the reproducer ctest fails and names the 3 droppedlookback_set_op_kernelinstantiations (ff/ii/jjelement types). When IGC is eventually fixed the test flips to passing — that's the signal to retire the fixture.Env overrides
HIPCC_VERIFY0/off= skip;warn= print only; else = fail on mismatchHIPCC_VERIFY_DEVICESdg2ocloc -devicetargetsHIPCC_VERIFY_OCLOCoclocIf
oclocis absent, the verifier prints a one-line notice and exits 0 — never fails a build just because the tool isn't installed.Example failure output
Test plan
chip-kernel-verify— smoke-tested off/fail/warn modes, missing-file, non-fatbin ELF, missing-ocloc paths.hipccproduces working binaries; verifier fires per compile.hipcc samples/2_vecadd/VecAdd.cppwithHIPCC_VERIFY_DEVICES=nonsense-device→ hipcc exits 1; same withHIPCC_VERIFY=warn→ exits 0;HIPCC_VERIFY=0→ fully silent.ctest -R hipcc_verify_—hipcc_verify_detects_dropped_kernelsFAILS under current IGC (expected: demonstrates IGC #403);hipcc_verify_off_mode_silentpasses.check.py dgpu level0— verify no regressions from default-on verification on the existing suite.check.py dgpu opencl.oclocis absent) — confirm graceful skip.Notes
tools/spirv-extractor/spirv-extractor.hhhas an unbounded 1 MB scan on its raw-data path; the verifier works around it by padding the input buffer before callingextractSPIRVModule. Separate cleanup for another PR..llfixtures undertests/compiler/promoteInt/.References