Skip to content

Nki beta 2 pt2#321

Open
latentCall145 wants to merge 38 commits intomainfrom
nki-beta-2-pt2
Open

Nki beta 2 pt2#321
latentCall145 wants to merge 38 commits intomainfrom
nki-beta-2-pt2

Conversation

@latentCall145
Copy link
Collaborator

@latentCall145 latentCall145 commented Mar 6, 2026

Summary

adds core NKI Beta 2 examples (elementwise add, tiled matmul, RoPE, attention, rmsnorm, softmax, mlp kernels), the interpreter functions needed to handle it, and unit tests to check function parity with NKI docs

Test Plan

Related Issues

#296

Breaking Changes

N/A

Checklist

  • I added tests to all new functionality I added/bugs I fixed.
  • I verified that a human has reviewed all code in this PR.
  • I ran npm run build:frontend if the PR modified any TypeScript code.
  • I made sure that my code is well documented (comments explaining strange code, docstrings for functions, website modified if new functionality added).

@latentCall145 latentCall145 marked this pull request as draft March 6, 2026 18:03
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Sanitizer Performance Benchmark

Benchmark main (min) PR (min) Change
simple_load_store 0.005s 0.005s +0.3%
gemm 0.023s 0.023s +1.1%
gemm_oob 0.024s 0.024s +1.9%
indirect_load 0.077s 0.077s +0.1%
nested_loop 0.025s 0.025s +0.4%
block_pointer_loop_advance 0.007s 0.007s -0.2%
liger_jsd 0.150s 0.152s +1.1%
flaggems_layernorm 2.806s 2.824s +0.6%
Total 3.117s 3.137s +0.6%

Iterations: 1 warmup + 40 measured

@mark14wu
Copy link
Collaborator

mark14wu commented Mar 6, 2026

I will review after 4pm because my token is used up...

@latentCall145
Copy link
Collaborator Author

latentCall145 commented Mar 6, 2026

it's not done yet 😮‍💨 you can skim it if you'd like but this is likely to change some amount before it's ready for review

by the way we may need to fix up code quality in another PR because jialiang needs this ready soon

@mark14wu
Copy link
Collaborator

mark14wu commented Mar 8, 2026

by the way we may need to fix up code quality in another PR because jialiang needs this ready soon

Which PR?
@latentCall145

@latentCall145
Copy link
Collaborator Author

latentCall145 commented Mar 8, 2026

by the way we may need to fix up code quality in another PR because jialiang needs this ready soon

Which PR? @latentCall145

oh I guess I didn't say. This PR is kinda messy although I'm cleaning it up a bit and I don't think we'll need a separate PR (this'll stay messy, but it's not core functionality so it's probably fine )

@mark14wu
Copy link
Collaborator

mark14wu commented Mar 8, 2026

Code review

Found 4 issues:

  1. Operator precedence bug in tensor_tensor power check — Because and binds tighter than or, the condition engine == nisa.gpsimd_engine and "psum" in (...) or dst_buffer == "psum" is parsed as (engine == gpsimd_engine AND "psum" in buffers) OR (dst_buffer == "psum"). The dst_buffer == "psum" clause fires for any engine, not just gpsimd. The error message says "Gpsimd, which cannot access PSUM", suggesting it should only reject for gpsimd engine. The test only uses engine=nisa.gpsimd_engine, so the overly-broad rejection is not caught.

raise ValueError("tensor_tensor cannot read both inputs from PSUM")
if op.name == "power" and (
engine == nisa.gpsimd_engine
and "psum" in (dst_buffer, lhs_buffer, rhs_buffer)
or dst_buffer == "psum"
):
raise ValueError("nl.power uses Gpsimd, which cannot access PSUM")

  1. Dead code in _apply_tensor_scalar_opinspect.signature(op) is called on a SubOp instance, whose __call__ has signature (*args, **kwargs). The resulting parameter names will always be ('args', 'kwargs'), never ("x", "dtype"), making the unary activation branch at L764-766 unreachable. The SubOp docstring says the underlying _op should be accessed via self._op/self._run, not by inspecting the wrapper. Currently works by accident since _run slices args[:self.num_args].

def _apply_tensor_scalar_op(data, op, operand, reverse):
"""Apply tensor_scalar op, including unary ops such as nl.rsqrt."""
if not (op is None or isinstance(op, SubOp)):
raise ValueError(f"Unsupported op: {op}")
param_names = tuple(inspect.signature(op).parameters)
if param_names == ("x", "dtype"):
applied = op(operand if reverse else data)
return applied.data if isinstance(applied, NDArray) else np.asarray(applied)
lhs, rhs = (operand, data) if reverse else (data, operand)
applied = op._run(lhs, rhs)
return applied.data if isinstance(applied, NDArray) else np.asarray(applied)

  1. Unconditional import of NKIBeta2InterpretedFunction in NKITrace.__init__ — Both .nki and .nki_beta2 are imported unconditionally regardless of the beta2 flag. Since nki_beta2.py has top-level imports of nki.isa/nki.language that raise ModuleNotFoundError if the nki package isn't installed, using backend="nki" (beta1 path) now also requires the nki package. On main, NKITrace only imported from .nki. Consider making the .nki_beta2 import conditional on beta2=True.

class NKITrace(KernelInterface, TraceInterface):
def __init__(self, kernel, client: str | Client, beta2: bool = True) -> None:
from .nki import NKIInterpretedFunction
from .nki_beta2 import NKIBeta2InterpretedFunction
nki_fn_cls = NKIBeta2InterpretedFunction if beta2 else NKIInterpretedFunction
self.backend = "nki_beta2" if beta2 else "nki"

  1. Missing version constraint on nki dependencyneuronx-cc is pinned to ==2.22.* but nki has no version constraint. Both packages come from the same AWS Neuron registry and should be version-compatible. A previous reviewer on PR [CI] Install neuronx-cc from AWS Neuron repo #255 flagged similar dependency concerns for packages from the Neuron index.

nki = [
"neuronx-cc==2.22.*; python_version >= '3.10' and python_version <= '3.12'",
"nki"
]

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@mark14wu
Copy link
Collaborator

mark14wu commented Mar 8, 2026

@latentCall145 Claude has given some comments but with low confidence / priority. Feel free to disregard them.

@latentCall145
Copy link
Collaborator Author

suggestions were pretty reasonable btw

@latentCall145 latentCall145 marked this pull request as ready for review March 10, 2026 19:31
@latentCall145
Copy link
Collaborator Author

@claude review?

@mark14wu
Copy link
Collaborator

@codex review.

@mark14wu
Copy link
Collaborator

[P1] Register the beta2 ops used by the new demos

When backend="nki_beta2" is used on the new beta2 kernels, ClientManager.patch_run() only wraps program_id, ndarray, and nc_matmul in triton_viz/frontends/nki_beta2.py. That leaves dma_copy, tensor_reduce, tensor_scalar, tensor_tensor, and activation uninstrumented, so examples/nki_beta2/add.py, softmax.py, rmsnorm.py, and rope.py produce nearly empty traces and the profiler/sanitizer clients never see their memory accesses. As shipped, most of the new backend appears to work while providing no meaningful analysis output.

@mark14wu
Copy link
Collaborator

[P1] Register the beta2 ops used by the new demos

When backend="nki_beta2" is used on the new beta2 kernels, ClientManager.patch_run() only wraps program_id, ndarray, and nc_matmul in triton_viz/frontends/nki_beta2.py. That leaves dma_copy, tensor_reduce, tensor_scalar, tensor_tensor, and activation uninstrumented, so examples/nki_beta2/add.py, softmax.py, rmsnorm.py, and rope.py produce nearly empty traces and the profiler/sanitizer clients never see their memory accesses. As shipped, most of the new backend appears to work while providing no meaningful analysis output.

@latentCall145 Can you verify this issue from codex-cli? Not sure if it's hallucination or real problem.

@mark14wu
Copy link
Collaborator

Code review

Found 3 issues:

  1. Substring match "nki" in backend is inconsistent with exact matching used in patch.py (backend == "nki", backend == "nki_beta2"). A future backend containing "nki" (e.g., "nki_v3") would route through NKITrace in trace.py but raise ValueError in patch_lang. Consider using if backend in {"nki", "nki_beta2"}: for consistency.

# NKI allow Python function( NKIInterpretedFunction)
if "nki" in backend:
return NKITrace(kernel, client, beta2=("beta2" in backend))
if isinstance(

vs exact matching in patch.py:

elif backend == "nki":
from triton_viz.core.nki import nki_patch_lang
scope = _LangPatchScope()
nki_patch_lang(scope)
elif backend == "nki_beta2":
from triton_viz.core.nki_beta2 import nki_patch_lang
scope = _LangPatchScope()
nki_patch_lang(scope)
else:

  1. _LangPatchScope (private symbol with underscore prefix) is imported directly in test code. The same pattern was flagged in PR [REFACTOR] Simplify _CombinedLoopHooks by delegating hook management to ClientManager #304 when _LoopIter was imported across module boundaries -- the reviewer suggested removing the _ prefix for symbols used as public API.

try:
from triton_viz.core.patch import _LangPatchScope

  1. nki==0.2.0 is added to the nki extras group without a python_version constraint, while the existing neuronx-cc dependency in the same group is restricted to python_version >= '3.10' and python_version <= '3.12'. On Python 3.13+, nki would install but neuronx-cc would not, resulting in an incomplete environment.

nki = [
"neuronx-cc==2.22.*; python_version >= '3.10' and python_version <= '3.12'",
"nki==0.2.0"
]

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@latentCall145
Copy link
Collaborator Author

[P1] Register the beta2 ops used by the new demos
When backend="nki_beta2" is used on the new beta2 kernels, ClientManager.patch_run() only wraps program_id, ndarray, and nc_matmul in triton_viz/frontends/nki_beta2.py. That leaves dma_copy, tensor_reduce, tensor_scalar, tensor_tensor, and activation uninstrumented, so examples/nki_beta2/add.py, softmax.py, rmsnorm.py, and rope.py produce nearly empty traces and the profiler/sanitizer clients never see their memory accesses. As shipped, most of the new backend appears to work while providing no meaningful analysis output.

@latentCall145 Can you verify this issue from codex-cli? Not sure if it's hallucination or real problem.

It’s correct but we don’t trace these operations in triton either (they’re basically just elementwise ops) so not an issue for now. Also I’m planning to connect NKI beta 2 interpreter to triton viz in a another PR

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a4d6f7f02

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 191 to 193
else:
self.interpreter_fn = NKIInterpretedFunction(kernel)
self.interpreter_fn = nki_fn_cls(kernel)
self.func = kernel

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore GenericKernel unwrapping for nki backend

For backend="nki", this branch now wraps any non-interpreted object directly (self.interpreter_fn = nki_fn_cls(kernel)), but NKIInterpretedFunction.run expects to operate on the underlying Python kernel function (it inspects/calls self.fn as a function). Passing a compiled NKI GenericKernel here no longer uses kernel.func as before, which can bypass interpreter execution or fail argument/source handling when users trace jitted kernels; please unwrap compiled kernels to their .func before constructing the interpreter wrapper.

Useful? React with 👍 / 👎.


kwargs.pop("warmup", None)
grid = kwargs.pop("grid", None)
nki.trace(self.func, grid=grid, platform_target="trn1").specialize(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not hard-code beta2 pre-trace to trn1

The beta2 pre-trace path always calls nki.trace(..., platform_target="trn1") and provides no way for callers to override the target, so kernels that are valid on newer NeuronCore targets can be rejected before interpreter execution when pre_trace=True (the default). This creates false failures for trn2/gen4 workflows even when the interpreter path itself supports those semantics; the platform target should be configurable or derived instead of fixed to trn1.

Useful? React with 👍 / 👎.

@mark14wu
Copy link
Collaborator

Code review

Found 3 issues:

  1. Substring match "nki" in backend is inconsistent with exact matching used in patch.py (backend == "nki", backend == "nki_beta2"). A future backend containing "nki" (e.g., "nki_v3") would route through NKITrace in trace.py but raise ValueError in patch_lang. Consider using if backend in {"nki", "nki_beta2"}: for consistency.

# NKI allow Python function( NKIInterpretedFunction)
if "nki" in backend:
return NKITrace(kernel, client, beta2=("beta2" in backend))
if isinstance(

vs exact matching in patch.py:

elif backend == "nki":
from triton_viz.core.nki import nki_patch_lang
scope = _LangPatchScope()
nki_patch_lang(scope)
elif backend == "nki_beta2":
from triton_viz.core.nki_beta2 import nki_patch_lang
scope = _LangPatchScope()
nki_patch_lang(scope)
else:

  1. _LangPatchScope (private symbol with underscore prefix) is imported directly in test code. The same pattern was flagged in PR [REFACTOR] Simplify _CombinedLoopHooks by delegating hook management to ClientManager #304 when _LoopIter was imported across module boundaries -- the reviewer suggested removing the _ prefix for symbols used as public API.

try:
from triton_viz.core.patch import _LangPatchScope

  1. nki==0.2.0 is added to the nki extras group without a python_version constraint, while the existing neuronx-cc dependency in the same group is restricted to python_version >= '3.10' and python_version <= '3.12'. On Python 3.13+, nki would install but neuronx-cc would not, resulting in an incomplete environment.

nki = [
"neuronx-cc==2.22.*; python_version >= '3.10' and python_version <= '3.12'",
"nki==0.2.0"
]

🤖 Generated with Claude Code

  • If this code review was useful, please react with 👍. Otherwise, react with 👎.

@latentCall145 Those issues are from Claude Code with confidence of 75. Default threshold is 80 so it wasn't published previously. You may also take a look.

@mark14wu
Copy link
Collaborator

@latentCall145 Did you find Claude's findings useful? If not, I'm good to stamp it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants