Nki beta 2 pt2 by latentCall145 · Pull Request #321 · Deep-Learning-Profiling-Tools/triton-viz

latentCall145 · 2026-03-06T18:03:04Z

Summary

adds core NKI Beta 2 examples (elementwise add, tiled matmul, RoPE, attention, rmsnorm, softmax, mlp kernels), the interpreter functions needed to handle it, and unit tests to check function parity with NKI docs

Test Plan

a ton of unit tests that check for semantic behavior based on these docs:
- NKI ISA functions: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.isa.html
- NKI language: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.language.html
- NKI Misc Info: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/api/nki.api.shared.html
ran 87 kernels on a trn1.2xlarge node to check for edge case behavior for NKI ISA functions and made sure that the interpreter matched the compiler's behavior (i.e. fail when the compiler fails, succeed when the compiler suceeds)
ran kernels in examples/nki_beta to make sure they run on the trn1.2xlarge node and also on the interpreter
ran a codex review (5.4 xhigh) against main

Related Issues

#296

Breaking Changes

N/A

Checklist

I added tests to all new functionality I added/bugs I fixed.
I verified that a human has reviewed all code in this PR.
I ran npm run build:frontend if the PR modified any TypeScript code.
I made sure that my code is well documented (comments explaining strange code, docstrings for functions, website modified if new functionality added).

github-actions · 2026-03-06T18:15:47Z

Sanitizer Performance Benchmark

Benchmark	main (min)	PR (min)	Change
simple_load_store	0.005s	0.005s	+0.3%
gemm	0.023s	0.023s	+1.1%
gemm_oob	0.024s	0.024s	+1.9%
indirect_load	0.077s	0.077s	+0.1%
nested_loop	0.025s	0.025s	+0.4%
block_pointer_loop_advance	0.007s	0.007s	-0.2%
liger_jsd	0.150s	0.152s	+1.1%
flaggems_layernorm	2.806s	2.824s	+0.6%
Total	3.117s	3.137s	+0.6%

Iterations: 1 warmup + 40 measured

mark14wu · 2026-03-06T18:20:50Z

I will review after 4pm because my token is used up...

latentCall145 · 2026-03-06T18:22:09Z

it's not done yet 😮‍💨 you can skim it if you'd like but this is likely to change some amount before it's ready for review

by the way we may need to fix up code quality in another PR because jialiang needs this ready soon

…ools/triton-viz into nki-beta-2-pt2

mark14wu · 2026-03-08T19:21:18Z

by the way we may need to fix up code quality in another PR because jialiang needs this ready soon

Which PR?
@latentCall145

latentCall145 · 2026-03-08T20:07:55Z

by the way we may need to fix up code quality in another PR because jialiang needs this ready soon

Which PR? @latentCall145

oh I guess I didn't say. This PR is kinda messy although I'm cleaning it up a bit and I don't think we'll need a separate PR (this'll stay messy, but it's not core functionality so it's probably fine )

mark14wu · 2026-03-08T21:47:10Z

Code review

Found 4 issues:

Operator precedence bug in tensor_tensor power check — Because and binds tighter than or, the condition engine == nisa.gpsimd_engine and "psum" in (...) or dst_buffer == "psum" is parsed as (engine == gpsimd_engine AND "psum" in buffers) OR (dst_buffer == "psum"). The dst_buffer == "psum" clause fires for any engine, not just gpsimd. The error message says "Gpsimd, which cannot access PSUM", suggesting it should only reject for gpsimd engine. The test only uses engine=nisa.gpsimd_engine, so the overly-broad rejection is not caught.

triton-viz/triton_viz/core/nki_beta2.py

Lines 656 to 662 in df6dbc8

    
               raise ValueError("tensor_tensor cannot read both inputs from PSUM") 
        
           if op.name == "power" and ( 
        
               engine == nisa.gpsimd_engine 
        
               and "psum" in (dst_buffer, lhs_buffer, rhs_buffer) 
        
               or dst_buffer == "psum" 
        
           ): 
        
               raise ValueError("nl.power uses Gpsimd, which cannot access PSUM")

Dead code in _apply_tensor_scalar_op — inspect.signature(op) is called on a SubOp instance, whose __call__ has signature (*args, **kwargs). The resulting parameter names will always be ('args', 'kwargs'), never ("x", "dtype"), making the unary activation branch at L764-766 unreachable. The SubOp docstring says the underlying _op should be accessed via self._op/self._run, not by inspecting the wrapper. Currently works by accident since _run slices args[:self.num_args].

triton-viz/triton_viz/core/nki_beta2.py

Lines 759 to 770 in df6dbc8

    
           def _apply_tensor_scalar_op(data, op, operand, reverse): 
        
               """Apply tensor_scalar op, including unary ops such as nl.rsqrt.""" 
        
               if not (op is None or isinstance(op, SubOp)): 
        
                   raise ValueError(f"Unsupported op: {op}") 
        
               param_names = tuple(inspect.signature(op).parameters) 
        
               if param_names == ("x", "dtype"): 
        
                   applied = op(operand if reverse else data) 
        
                   return applied.data if isinstance(applied, NDArray) else np.asarray(applied) 
        
               lhs, rhs = (operand, data) if reverse else (data, operand) 
        
               applied = op._run(lhs, rhs) 
        
               return applied.data if isinstance(applied, NDArray) else np.asarray(applied)

Unconditional import of NKIBeta2InterpretedFunction in NKITrace.__init__ — Both .nki and .nki_beta2 are imported unconditionally regardless of the beta2 flag. Since nki_beta2.py has top-level imports of nki.isa/nki.language that raise ModuleNotFoundError if the nki package isn't installed, using backend="nki" (beta1 path) now also requires the nki package. On main, NKITrace only imported from .nki. Consider making the .nki_beta2 import conditional on beta2=True.

triton-viz/triton_viz/core/trace.py

Lines 174 to 180 in df6dbc8

    
           class NKITrace(KernelInterface, TraceInterface): 
        
               def __init__(self, kernel, client: str | Client, beta2: bool = True) -> None: 
        
                   from .nki import NKIInterpretedFunction 
        
                   from .nki_beta2 import NKIBeta2InterpretedFunction 
        
                   nki_fn_cls = NKIBeta2InterpretedFunction if beta2 else NKIInterpretedFunction 
        
                   self.backend = "nki_beta2" if beta2 else "nki"

Missing version constraint on nki dependency — neuronx-cc is pinned to ==2.22.* but nki has no version constraint. Both packages come from the same AWS Neuron registry and should be version-compatible. A previous reviewer on PR [CI] Install neuronx-cc from AWS Neuron repo #255 flagged similar dependency concerns for packages from the Neuron index.

triton-viz/pyproject.toml

Lines 61 to 64 in df6dbc8

    
           nki = [ 
        
             "neuronx-cc==2.22.*; python_version >= '3.10' and python_version <= '3.12'", 
        
             "nki" 
        
           ]

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

mark14wu · 2026-03-08T21:59:15Z

@latentCall145 Claude has given some comments but with low confidence / priority. Feel free to disregard them.

latentCall145 · 2026-03-08T22:51:13Z

suggestions were pretty reasonable btw

latentCall145 · 2026-03-10T19:32:22Z

@claude review?

…ools/triton-viz into nki-beta-2-pt2

mark14wu · 2026-03-10T20:43:08Z

@codex review.

mark14wu · 2026-03-10T20:43:21Z

[P1] Register the beta2 ops used by the new demos

When backend="nki_beta2" is used on the new beta2 kernels, ClientManager.patch_run() only wraps program_id, ndarray, and nc_matmul in triton_viz/frontends/nki_beta2.py. That leaves dma_copy, tensor_reduce, tensor_scalar, tensor_tensor, and activation uninstrumented, so examples/nki_beta2/add.py, softmax.py, rmsnorm.py, and rope.py produce nearly empty traces and the profiler/sanitizer clients never see their memory accesses. As shipped, most of the new backend appears to work while providing no meaningful analysis output.

mark14wu · 2026-03-10T20:44:26Z

[P1] Register the beta2 ops used by the new demos

When backend="nki_beta2" is used on the new beta2 kernels, ClientManager.patch_run() only wraps program_id, ndarray, and nc_matmul in triton_viz/frontends/nki_beta2.py. That leaves dma_copy, tensor_reduce, tensor_scalar, tensor_tensor, and activation uninstrumented, so examples/nki_beta2/add.py, softmax.py, rmsnorm.py, and rope.py produce nearly empty traces and the profiler/sanitizer clients never see their memory accesses. As shipped, most of the new backend appears to work while providing no meaningful analysis output.

@latentCall145 Can you verify this issue from codex-cli? Not sure if it's hallucination or real problem.

mark14wu · 2026-03-10T20:46:18Z

Code review

Found 3 issues:

Substring match "nki" in backend is inconsistent with exact matching used in patch.py (backend == "nki", backend == "nki_beta2"). A future backend containing "nki" (e.g., "nki_v3") would route through NKITrace in trace.py but raise ValueError in patch_lang. Consider using if backend in {"nki", "nki_beta2"}: for consistency.

triton-viz/triton_viz/core/trace.py

Lines 312 to 315 in 7a4d6f7

    
           # NKI allow Python function（ NKIInterpretedFunction） 
        
           if "nki" in backend: 
        
               return NKITrace(kernel, client, beta2=("beta2" in backend)) 
        
           if isinstance(

vs exact matching in patch.py:

triton-viz/triton_viz/core/patch.py

Lines 312 to 322 in 7a4d6f7

    
           elif backend == "nki": 
        
               from triton_viz.core.nki import nki_patch_lang 
        
               scope = _LangPatchScope() 
        
               nki_patch_lang(scope) 
        
           elif backend == "nki_beta2": 
        
               from triton_viz.core.nki_beta2 import nki_patch_lang 
        
               scope = _LangPatchScope() 
        
               nki_patch_lang(scope) 
        
           else:

_LangPatchScope (private symbol with underscore prefix) is imported directly in test code. The same pattern was flagged in PR [REFACTOR] Simplify _CombinedLoopHooks by delegating hook management to ClientManager #304 when _LoopIter was imported across module boundaries -- the reviewer suggested removing the _ prefix for symbols used as public API.

triton-viz/tests/nki/test_nki_beta2.py

Lines 7 to 9 in 7a4d6f7

    
           try: 
        
               from triton_viz.core.patch import _LangPatchScope

nki==0.2.0 is added to the nki extras group without a python_version constraint, while the existing neuronx-cc dependency in the same group is restricted to python_version >= '3.10' and python_version <= '3.12'. On Python 3.13+, nki would install but neuronx-cc would not, resulting in an incomplete environment.

triton-viz/pyproject.toml

Lines 61 to 64 in 7a4d6f7

    
           nki = [ 
        
             "neuronx-cc==2.22.*; python_version >= '3.10' and python_version <= '3.12'", 
        
             "nki==0.2.0" 
        
           ]

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

latentCall145 · 2026-03-10T20:46:47Z

[P1] Register the beta2 ops used by the new demos
When backend="nki_beta2" is used on the new beta2 kernels, ClientManager.patch_run() only wraps program_id, ndarray, and nc_matmul in triton_viz/frontends/nki_beta2.py. That leaves dma_copy, tensor_reduce, tensor_scalar, tensor_tensor, and activation uninstrumented, so examples/nki_beta2/add.py, softmax.py, rmsnorm.py, and rope.py produce nearly empty traces and the profiler/sanitizer clients never see their memory accesses. As shipped, most of the new backend appears to work while providing no meaningful analysis output.

@latentCall145 Can you verify this issue from codex-cli? Not sure if it's hallucination or real problem.

It’s correct but we don’t trace these operations in triton either (they’re basically just elementwise ops) so not an issue for now. Also I’m planning to connect NKI beta 2 interpreter to triton viz in a another PR

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7a4d6f7f02

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-10T20:50:39Z

triton_viz/core/trace.py

        else:
-            self.interpreter_fn = NKIInterpretedFunction(kernel)
+            self.interpreter_fn = nki_fn_cls(kernel)
            self.func = kernel


Restore GenericKernel unwrapping for nki backend

For backend="nki", this branch now wraps any non-interpreted object directly (self.interpreter_fn = nki_fn_cls(kernel)), but NKIInterpretedFunction.run expects to operate on the underlying Python kernel function (it inspects/calls self.fn as a function). Passing a compiled NKI GenericKernel here no longer uses kernel.func as before, which can bypass interpreter execution or fail argument/source handling when users trace jitted kernels; please unwrap compiled kernels to their .func before constructing the interpreter wrapper.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-10T20:50:39Z

triton_viz/core/trace.py

+
+            kwargs.pop("warmup", None)
+            grid = kwargs.pop("grid", None)
+            nki.trace(self.func, grid=grid, platform_target="trn1").specialize(


Do not hard-code beta2 pre-trace to trn1

The beta2 pre-trace path always calls nki.trace(..., platform_target="trn1") and provides no way for callers to override the target, so kernels that are valid on newer NeuronCore targets can be rejected before interpreter execution when pre_trace=True (the default). This creates false failures for trn2/gen4 workflows even when the interpreter path itself supports those semantics; the platform target should be configurable or derived instead of fixed to trn1.

Useful? React with 👍 / 👎.

mark14wu · 2026-03-10T21:00:46Z

Code review

Found 3 issues:

Substring match "nki" in backend is inconsistent with exact matching used in patch.py (backend == "nki", backend == "nki_beta2"). A future backend containing "nki" (e.g., "nki_v3") would route through NKITrace in trace.py but raise ValueError in patch_lang. Consider using if backend in {"nki", "nki_beta2"}: for consistency.

triton-viz/triton_viz/core/trace.py

Lines 312 to 315 in 7a4d6f7

# NKI allow Python function（ NKIInterpretedFunction）

if "nki" in backend:

return NKITrace(kernel, client, beta2=("beta2" in backend))

if isinstance(

vs exact matching in patch.py:

triton-viz/triton_viz/core/patch.py

Lines 312 to 322 in 7a4d6f7

elif backend == "nki":

from triton_viz.core.nki import nki_patch_lang

scope = _LangPatchScope()

nki_patch_lang(scope)

elif backend == "nki_beta2":

from triton_viz.core.nki_beta2 import nki_patch_lang

scope = _LangPatchScope()

nki_patch_lang(scope)

else:

_LangPatchScope (private symbol with underscore prefix) is imported directly in test code. The same pattern was flagged in PR [REFACTOR] Simplify _CombinedLoopHooks by delegating hook management to ClientManager #304 when _LoopIter was imported across module boundaries -- the reviewer suggested removing the _ prefix for symbols used as public API.

triton-viz/tests/nki/test_nki_beta2.py

Lines 7 to 9 in 7a4d6f7

try:

from triton_viz.core.patch import _LangPatchScope

nki==0.2.0 is added to the nki extras group without a python_version constraint, while the existing neuronx-cc dependency in the same group is restricted to python_version >= '3.10' and python_version <= '3.12'. On Python 3.13+, nki would install but neuronx-cc would not, resulting in an incomplete environment.

triton-viz/pyproject.toml

Lines 61 to 64 in 7a4d6f7

nki = [

"neuronx-cc==2.22.*; python_version >= '3.10' and python_version <= '3.12'",

"nki==0.2.0"

]

🤖 Generated with Claude Code

If this code review was useful, please react with 👍. Otherwise, react with 👎.

@latentCall145 Those issues are from Claude Code with confidence of 75. Default threshold is 80 so it wasn't published previously. You may also take a look.

mark14wu · 2026-03-14T01:54:51Z

@latentCall145 Did you find Claude's findings useful? If not, I'm good to stamp it.

latentCall145 added 5 commits March 1, 2026 23:30

initial port

b00d65f

trim down interpreter to just what nki examples needs

c486d4c

get rid of non-essential tests for now

e26809a

push NKIBeta2Trace into NKITrace

20ea47b

verify kernels against trainium

f0bfb58

latentCall145 marked this pull request as draft March 6, 2026 18:03

Merge branch 'main' into nki-beta-2-pt2

2e16f00

latentCall145 added 5 commits March 7, 2026 21:17

add new tests

11843c7

make examples work

875dc48

debloat

5e02059

Merge branch 'nki-beta-2-pt2' of github.com:Deep-Learning-Profiling-T…

c65b6a8

…ools/triton-viz into nki-beta-2-pt2

recapitalize

df6dbc8

more tests; debloat

f5d36c1

latentCall145 added 4 commits March 8, 2026 18:41

more tests

a5c096b

type hints

e12f667

pin nki version

a452a39

conditional import

b5195a2

latentCall145 added 5 commits March 8, 2026 19:02

comments

1d5919a

nki.trace before running kernel to catch some frontend errors

e3fb949

fix beta 2 examples

0db44a0

stuff

93d86ea

better match NKI compiler

8fe631d

latentCall145 added 10 commits March 10, 2026 00:29

fixes to NKI compiler edge cases

5021a33

lint

fb8fc79

tighten tolerance

badaf3b

edge case testing

9e7048c

debloat

99ae69a

pre-trace arg

aaa1caa

BHND tiled attn

79e8b55

import triton_viz only when needed

0357158

move nki beta 2 examples to its own folder

490f04d

BHLD rope

c155694

latentCall145 marked this pull request as ready for review March 10, 2026 19:31

Merge branch 'main' into nki-beta-2-pt2

735d267

latentCall145 added 2 commits March 10, 2026 15:35

don't change legacy nki examples

fa31bed

Merge branch 'nki-beta-2-pt2' of github.com:Deep-Learning-Profiling-T…

288530e

…ools/triton-viz into nki-beta-2-pt2

latentCall145 requested review from Jokeren and mark14wu March 10, 2026 19:36

latentCall145 added 2 commits March 10, 2026 16:10

more descriptive err msgs

23f7e21

vibe-coded mlp kernel (works on trainium) first try

7a4d6f7

latentCall145 requested a review from gujialiang123 March 10, 2026 20:24

chatgpt-codex-connector bot reviewed Mar 10, 2026

View reviewed changes

latentCall145 and others added 2 commits March 10, 2026 18:43

appease the llm

86541b9

Merge branch 'main' into nki-beta-2-pt2

00b492c

Conversation

latentCall145 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Related Issues

Breaking Changes

Checklist

Uh oh!

github-actions bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sanitizer Performance Benchmark

Uh oh!

mark14wu commented Mar 6, 2026

Uh oh!

latentCall145 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mark14wu commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

latentCall145 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mark14wu commented Mar 8, 2026

Code review

Uh oh!

mark14wu commented Mar 8, 2026

Uh oh!

latentCall145 commented Mar 8, 2026

Uh oh!

latentCall145 commented Mar 10, 2026

Uh oh!

mark14wu commented Mar 10, 2026

Uh oh!

mark14wu commented Mar 10, 2026

Uh oh!

mark14wu commented Mar 10, 2026

Uh oh!

mark14wu commented Mar 10, 2026

Code review

Uh oh!

latentCall145 commented Mar 10, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

mark14wu commented Mar 10, 2026

Code review

Uh oh!

mark14wu commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

latentCall145 commented Mar 6, 2026 •

edited

Loading

github-actions bot commented Mar 6, 2026 •

edited

Loading

latentCall145 commented Mar 6, 2026 •

edited

Loading

mark14wu commented Mar 8, 2026 •

edited

Loading

latentCall145 commented Mar 8, 2026 •

edited

Loading