feat: add kernelgen backend#56
Conversation
Add a complete kernelgen backend that generates NKI kernels via MLIR, as an alternative to the existing HLO backend. This includes: - Kernelgen backend with MLIR encapsulated behind a builder API - Op implementations for kernelgen (_kernelgen_impls.py) - Inplace update (dynamic_update_slice) support - Unified IR interface for HLO and kernelgen backends - Custom op interface via nki_op.nki_custom_op - Compilation delegation to nkipy_kernelgen.compile - Comprehensive unit and numerical tests
vgene
left a comment
There was a problem hiding this comment.
This part looks good to me in general. Two minor issues.
| if out_idx >= self._user_return_len | ||
| } | ||
|
|
||
| def resolve_input_arrays(self, original_inputs): |
There was a problem hiding this comment.
let's discuss this and consolidate IO tensor handling between two backends, especially with aliasing.
There was a problem hiding this comment.
I've consolidated resolve_input_arrays and get_alias_input_name into a single prepare_io_mapping function, and reused as much logic as possible. For alias handling, the main difference is HLO backend needs to create a new variable .must_alias_input suffix while kernelgen can reuse the same buffer without creating alias variable in the graph. Also, another difference is kernelgen generated neff has automatic variable naming like"input_0", "input_1" and requires extra work to do variable mapping.
…ram_names list The redundant Dict[str, str] mapping NEFF input names to parameter names was fragile — it could drift out of sync with _input_specs. Replace with a List[str] positionally aligned with _input_specs, making the invariant structurally impossible to violate.
Separate the op registration system so that op definition files are backend-agnostic and each backend registers its implementations lazily through its own registration module. Key changes: - Add composed_impl to Op class for backend-agnostic fallback dispatch - Extract all HLO implementations into _hlo_impls.py with lazy registration via _register_hlo.py (same pattern as kernelgen) - Op definition files now only declare Op instances, CPU impls, and composed_impl for ops built from other dispatched ops - Remove redundant per-backend registration of composed ops from _register_kernelgen.py (they fall through to composed_impl) Adding a new backend now only requires registering primitive ops; all composed ops (floor_divide, tan, mean, cumsum, etc.) automatically work via the composed_impl fallback.
…ied prepare_io_mapping Add TensorPlaceholder.original_name so each backend encodes the user-facing parameter name at construction time (HLO strips the .must_alias_input suffix; KernelGen uses its _original_param_names list). This enables a single, backend-agnostic prepare_io_mapping free function that replaces two per-backend protocol methods, fixes a silent bug in KernelGen's fallback path, and moves input-count validation into one place.
BIR emission assigns "in_tensor_N" names during compilation regardless of caller-provided names — the old comment incorrectly attributed this to the "C++ pipeline from unnamed MLIR block arguments". Update to reflect the actual mechanism.
|
@vgene Updated the PR as we discussed. Please take a look when you have a moment and let me know if you have any comments. |
Rename all references from kernelgen/KernelGen to nkigen/NkiGen to align with the new nkigen package name (formerly nkipy_kernelgen).
Summary
ComputationIRabstraction, including uniform input/output name resolution and compilation delegation_kernelgen_impls.py,_register_kernelgen.py) with support for inplace updates (dynamic_update_slice) and custom ops vianki_op.nki_custom_optest_kernelgen_backend.py), op tests (test_kernelgen_ops.py), and numerical tests (test_kernelgen_numerical.py)Test plan
uv run pytest tests/unit/test_kernelgen_backend.py -v -n autouv run pytest tests/test_kernelgen_ops.py -v -n autouv run pytest tests/test_kernelgen_numerical.py -v -n autouv run pytest tests/ -n auto(full suite)