[TIRx] Phase out flat device-intrinsic op aliases#19838
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors TVM's backend intrinsics by replacing flat names (e.g., tirx.ptx_cp_async, tirx.cuda_func_call, tirx.nki_load) with their canonical namespace-qualified equivalents (e.g., tirx.ptx.cp_async, tirx.cuda.func_call, tirx.nki.load), removing legacy flat macro registrations. The review feedback highlights a regression where phasing out these flat registrations lost the TScriptDtypePrintLocation attribute for key PTX intrinsics, which is necessary for correct TVMScript printing and parsing. The reviewer suggests extending the registration struct and macros to restore this attribute, and also recommends simplifying the redundant dual-name matching logic for ptx_elect_sync in tile_primitive_dispatch.cc.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
eabcbbb to
102b8ca
Compare
PR apache#19677 registered every CUDA / Trainium device intrinsic under two Op names: a flat `tirx.<ns>_<name>` alias plus the canonical `tirx.<ns>.<name>`. The flat aliases were a migration shim; passes and codegen that match an intrinsic had to check both spellings. The canonical namespaced form is the intended end state. Remove the flat device-intrinsic aliases, keeping only the canonical namespaced ops: - RegisterDeviceIntrinsic (backend/cuda) and RegisterNKIIntrinsic (backend/trn) register only the canonical name. - Drop the flat-only macro registrations for device intrinsics; the canonical op with all attrs is registered from the alias table. The WMMA tvm_*_sync / mma_store / mma_fill builtins and the profiling timer_*_cuda builtins keep their flat names (no namespace / canonical form, category "builtin"). - Remove the redundant flat tirx.ptx_fetch_register registration. - C++ consumers that resolved a flat op by name string now use the canonical name; the ptx_elect_sync / cuda_func_call dual-name matchers collapse to a single canonical same_as check. - Python: every device-intrinsic wrapper call site (backend/cuda/op.py, backend/trn/op.py, backend/cuda/script.py, gemm_async/tcgen05.py) now passes the canonical op name to call_intrin; no Python source names a removed flat op. _canonical_device_intrin_name is retained only as the codegen-registration bridge (basename stays underscore because it generates the C helper name) plus a fallback for user-passed flat names. The remaining underscore strings in operator/intrinsics (cp_async.py / wgmma.py CODEGEN_REGISTRY keys) are codegen variant names, not TIR ops. The flat tirx.ptx_cp_async op was not purely redundant: it carried TScriptDtypePrintLocation=kFirst so the raw InjectPTXAsyncCopy-emitted form (element dtype in Call.dtype) round-trips dtype-first, while the fork-native form (void dtype) did not. To preserve that without a flat alias, the raw form now uses a dedicated canonical op tirx.ptx.cp_async_raw (printer name ptx.cp_async, so the TVMScript surface is unchanged); the fork-native form keeps tirx.ptx.cp_async. Only this op carries dtype-first; ptx.mma / ptx.ldmatrix keep void Call.dtype and must not (printing a leading dtype would corrupt their script form). Generated CUDA is byte-identical: helper names are literals and codegen dispatches by op name, with the registry resolving the canonical name to the same helper.
102b8ca to
6fc3c15
Compare
PR #19677 registered every CUDA / Trainium device intrinsic under two Op names: a flat
tirx.<ns>_<name>alias plus the canonicaltirx.<ns>.<name>. The flat aliases were a migration shim; passes and codegen that match an intrinsic had to check both spellings (the dual-name IsOp pattern). The Python builders and TVMScript parser already canonicalize, so every real Call already carries the canonical op and the flat aliases were dead weight.This pr removes the flat device-intrinsic aliases, keeping only the canonical namespaced ops:
Generated CUDA is byte-identical: helper names are literals and codegen dispatches by op name, with the registry resolving the canonical name to the same helper.