[codex] align PTODSL public surface and sync validation by Zhendong404 · Pull Request #388 · mouliangyu/PTOAS

Zhendong404 · 2026-05-21T08:27:22Z

Summary

完善 PTODSL 的实现，对齐 user guide：补齐同步前端校验、收紧 public surface 约束，并同步更新 surface contract tests。
增加 docs as test 测试框架：让 user guide 的 fenced code blocks 进入可执行、可回归的测试流程。
完善文档与命名：更新 Chapter 8 / Chapter 10 的 public DSL 描述，调整部分 op 命名并移除旧别名。

Validation

python3 -m unittest ptodsl.tests.test_vector_cube_ops -v
python3 test/python/ptodsl_jit_compile.py
python3 test/python/ptodsl_docs_as_test.py (stops on an existing unrelated fixture gap in 07-data-movement-ops.md:1206)

learning-chip · 2026-05-21T08:42:15Z

+    specialized_text = compiled.mlir_text()
+    expect_parse_roundtrip_and_verify(specialized_text, "flash attention specialized MLIR")
+    expect("func.func @flash_attention_kernel" in specialized_text, "direct compile should emit the flash_attention_kernel entry")
+    expect("!pto.tile_buf<mat, 64x128xf32" in specialized_text, "BLOCK_Q=64 specialization should change the physical Q tile shape")
+    expect("func.call @materialize_tile_bounds" in specialized_text, "direct compile should still route SIMT helpers through func.call")
+
+    cached = demo.flash_attention_kernel.cached_specializations()
+    expect(len(cached) >= 2, "wrapper compile plus explicit compile should populate at least two cached specializations")
+    print("ptodsl_flash_attention_demo_compile: PASS")


Here only tests python dsl -> MLIR? Should also test the ptoas step that lowers to binary.

learning-chip · 2026-05-21T09:43:04Z



 @pto.cube
 def qk_matmul(


Is this @pto.cube decorator necessary? Can this function just be inlined?

learning-chip · 2026-05-21T09:44:53Z

 @pto.cube
 def pv_matmul(


Same for here. It feels cumbersome that every small util function needs to be a separately-decorated function. The actual compute is only 7 lines (if inlined), but this function with argument is 20 lines...

vloncar · 2026-05-21T09:49:22Z

 | Double-buffer handoff (compute → DMA) | `rls_buf(V, id)` + `get_buf(MTE2, id)` |
 | Double-buffer handoff (DMA → compute) | `rls_buf(MTE2, id)` + `get_buf(V, id)` |
-| Core A notifies core B | `set_cross_core(B, id)` + `wait_flag_dev(A, id)` |
+| Core A notifies core B | `set_cross_flag(B, id)` + `wait_cross_flag(A, id)` |


This is a leftover from the previous design, these functions accept pipes now (only Pipe.FIX).

Did CCE change this interface recently?

learning-chip · 2026-05-21T14:41:05Z

Are those attributes like KernelRole.UKERNEL actually needed by the IR and passes? If not, we should keep the minimum needed context managers like with pto.vf():, and only keep one decorator @pto.jit, and remove the redundant decorators, to reduce the grammar noise.

PTO IR actually need simd/simt/cube decorators to create different region/function/section. For ukernel, I'm considering remove it.

learning-chip · 2026-05-25T14:11:58Z

+        raise ValueError("seq must be positive")
+
+    @pto.jit(
+        name=name,


Minor thing: we can omit name can default to kernel.__name__ of this function object.

* pip install ptoas * use pip install in CI * wheels pipelines use pip install * add missing license header * fix pip setup

learning-chip · 2026-05-26T07:13:05Z

+if __package__ in {None, ""}:
+    here = Path(__file__).resolve()
+    for candidate in here.parents:
+        if (candidate / "ptodsl" / "__init__.py").exists():
+            sys.path.insert(0, str(candidate))
+            break
+    else:
+        raise RuntimeError(
+            "Unable to locate the PTODSL Python package root from flash_attention_softmax_launch.py"
+        )
+
+from ptodsl import pto


We can assume user already typed pip install the ptodsl package, so no need extra sys.path.insert here.

learning-chip · 2026-05-26T07:17:09Z

+    def kernel(
+        scores: pto.tensor_spec(rank=2, dtype=pto.f32),
+        out: pto.tensor_spec(rank=2, dtype=pto.f32),
+    ):
+        lane_num = pto.elements_per_vreg(pto.f32)
+        physical_rows = ((rows + lane_num - 1) // lane_num) * lane_num
+        scores_tile_bytes = seq * physical_rows * pto.bytewidth(pto.f32)
+        runtime_seq = scores.shape[0]
+        runtime_rows = scores.shape[1]
+        total_elems = runtime_rows * runtime_seq
+
+        scores_view = pto.make_tensor_view(
+            scores,
+            shape=[1, 1, 1, runtime_seq, runtime_rows],
+            strides=[total_elems, total_elems, total_elems, runtime_rows, 1],
+        )
+        out_view = pto.make_tensor_view(
+            out,
+            shape=[1, 1, 1, runtime_seq, runtime_rows],
+            strides=[total_elems, total_elems, total_elems, runtime_rows, 1],
+        )


In type declaration, scores: pto.ptr(dtype=pto.f32) is more suitable than pto.tensor_spec(rank=2, dtype=pto.f32). Because scores is converted to 5D tensor by pto.make_tensor_view anyways, so the previous rank=2 information looks useless?

learning-chip · 2026-05-26T07:24:20Z

+_DEVICE = "npu:0"
+
+
+def _make_softmax_kernel(name: str, *, rows: int, seq: int):


Here uses closure to re-compile kernel for every [rows, seq] shape. Should test dynamic-shape kernel by having rows: pto.i32 as kernel's dynamic arg (not as closure/constant)

@MirkoDeVita98 check if dynamic shape works? ptodsl/examples/jit/tadd_launch.py is an easier starting point.

I updated tadd_launch.py in #418 to include a dynamic-shape TADD kernel with rows: pto.i32 as a runtime kernel argument instead of capturing it as a closure/constant. The dynamic kernel reuses the same compiled kernel for different row counts (16x64 and 32x64) and passes rows at launch time. Verified with msprof and all TADD cases pass.

learning-chip · 2026-05-26T07:34:45Z

+@pto.jit(
+    name="TADD_f32_16x64",
+    kernel_kind="vector",
+    target="a5",
+)
+def TADD_f32_16x64(
+    A: pto.tensor_spec(rank=2, dtype=pto.f32),
+    B: pto.tensor_spec(rank=2, dtype=pto.f32),
+    C: pto.tensor_spec(rank=2, dtype=pto.f32),
+):
+    _tadd_tile(A, B, C, 16, 64)


Same issues here as in flash_attention_softmax_launch.py:

rank=2 is useless & redundant information

only closure-based static shape, dynamic dim is not tested (cc @MirkoDeVita98

name can be omitted

sys.path.insert not needed, assuming pip installed ptodsl

Zhendong404 · 2026-05-27T02:51:30Z

Will be merged into feature-vpto-backend directly

Zhendong404 force-pushed the pto-dsl-impl branch from 1774355 to 546708a Compare May 21, 2026 08:28

Zhendong404 marked this pull request as ready for review May 21, 2026 08:29

learning-chip reviewed May 21, 2026

View reviewed changes

learning-chip approved these changes May 21, 2026

View reviewed changes

learning-chip reviewed May 21, 2026

View reviewed changes

vloncar reviewed May 21, 2026

View reviewed changes

learning-chip suggested changes May 21, 2026

View reviewed changes

mouliangyu force-pushed the feature-pto-dsl branch from f2824be to f8a71f9 Compare May 22, 2026 08:18

Zhendong404 force-pushed the pto-dsl-impl branch from bfbde22 to 87f8347 Compare May 22, 2026 08:21

Add lit for dynamic flagId

350468a

Zhendong404 force-pushed the pto-dsl-impl branch 4 times, most recently from 191e7a1 to 538529a Compare May 25, 2026 07:28

fix(vpto): expand arith floordiv before llvm export (hw-native-sys#394)

1280751

learning-chip approved these changes May 25, 2026

View reviewed changes

learning-chip reviewed May 25, 2026

View reviewed changes

learning-chip and others added 12 commits May 26, 2026 11:45

quick install script on top of MLIR docker image

cdb1193

add reference result for top->vop expansion

217dc2a

low-level python binding example to generate vpto IR

c5b540d

initial prototype of high-level dsl builder api

8697f5e

initial prototype of softmax IR builder

cf4ece0

script to check IR equal

de75cad

avoid raw MLIR Type.parse

8d1a834

more Pythonic builder style suggestions

8dc1a4a

major refactor of dsl syntax and impl

63e590a

[vpto] Add ptodsl tracing POC

1ac8d0d

[vpto] Allow structured loops without vecscope

2c2cf6d

Add user guides

16303bd

Zhendong404 and others added 12 commits May 26, 2026 11:45

Add a flash attention demo

60f4c6a

Completed the first version of PTODSL user guide

d094fa2

Complete the mlir text emission of the FA demo

d8db04e

pip install ptoas

78c4cf8

use pip install in CI (hw-native-sys#385)

8de1968

* pip install ptoas * use pip install in CI * wheels pipelines use pip install * add missing license header * fix pip setup

feature(ptodsl): align ptodsl implementation with user guide

36bb9c5

chore(ptodsl): normalize docs test headers

f17c5c7

python builder to reproduce tilelang_st/tadd.pto

ac4d5ff

Switch to new kernel surface

84e6f48

Clean up the pending docs-as-test in the user guide

5ba043d

Clarify the pto.jit kernel signature

ade5bf7

Refine the online softmax demo

c236b24

Zhendong404 force-pushed the pto-dsl-impl branch from 538529a to c236b24 Compare May 26, 2026 03:45

learning-chip reviewed May 26, 2026

View reviewed changes

learning-chip mentioned this pull request May 26, 2026

[Feature] Unify kernel entry convention to "ptr + int" #417

Open

Zhendong404 closed this May 27, 2026

		_DEVICE = "npu:0"


		def _make_softmax_kernel(name: str, *, rows: int, seq: int):

Conversation

Zhendong404 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

learning-chip May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zhendong404 May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

learning-chip May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MirkoDeVita98 May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

learning-chip May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zhendong404 commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Zhendong404 commented May 21, 2026 •

edited

Loading

learning-chip May 21, 2026 •

edited

Loading

Zhendong404 May 21, 2026 •

edited

Loading

learning-chip May 26, 2026 •

edited

Loading

MirkoDeVita98 May 26, 2026 •

edited

Loading

learning-chip May 26, 2026 •

edited

Loading