Add VPTO flash attention runtime case by mouliangyu · Pull Request #365 · mouliangyu/PTOAS

mouliangyu · 2026-05-13T19:14:45Z

Summary

Adds a VPTO runtime Flash Attention case under test/vpto/cases/kernels/flash-attention.

Workflow Covered

Uses a mixed pto.section.cube / pto.section.vector kernel in a single kernel.pto.
Computes Q @ K^T on cube with mte_gm_l1_frac, mte_l1_l0a/l0b, mad, and mte_l0c_ub.
Runs tiled online softmax and V accumulation on vector, then stores the output to GM.
Launches two 16-row blocks for a 32-row input to cover multi-block execution.

Validation

WORK_SPACE=/tmp/pto-vpto-flash CASE_NAME='kernels/flash-attention' \
  test/vpto/scripts/run_host_vpto_validation.sh

Result: All 1 VPTO case(s) passed.

- Add detailed mode parameter documentation (mode=0 vs mode=1) - Add 'Why get_buf/rls_buf is More Programmer-Friendly' section: - No manual priming/draining for ping/pong loops - No loop peeling for complex/nested loop dependencies - Simpler mental model (buffer ID + program order) - Add quick example comparison showing set_flag overhead vs get_buf simplicity - Update Example 2 and 3b with explicit mode=0 in code - Update comparison table with 'Loop peeling' row

- set_flag/wait_flag: 2 IDs per buffer (1 forward + 1 reverse pipe-pair) - get_buf/rls_buf: 1 ID per buffer (handles both directions automatically) - 8 per pipe-pair is HW limit, not a formula

- set_flag/wait_flag: 8 IDs per pipe-pair direction (HW limit) - get_buf/rls_buf: 1 buffer ID per shared resource (HW limit: 32 global), same ID used across all pipelines

- Event ID mgmt: each buffer occupies 1 ID per direction (removed misleading 4 IDs calc) - Drain example: use concrete EVT_*_0/EVT_*_1 instead of {(N-1)%2} expressions

- 4 set_flag + 4 wait_flag (not 8) - 4 IDs = 2 pipe-pair directions × 2 ping/pong buffers

- set_flag/wait_flag: 1 MTE2 load, 8 Vector slices — must peel set/wait outside loop - get_buf/rls_buf: same pattern but acquire/release can stay inside or outside

- Acquire/release per slice inside loop - Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately

Add the merged v0.3 PTO micro-instruction release spec document for A5, including ISA group references and updated synchronization notes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduce a three-pass pipeline that lowers PTO tile ops to vector-level implementations via TileLang DSL templates: - ExpandTileOp: invokes TileLang Python DSL to instantiate template functions and replaces tile ops with func.call. SpecKey covers all operands; tile_buf operands are passed through without bridging. - PTOInlineLibCall: extended to recognize tilelang instance functions via the attribute set by the DSL frontend. - FoldTileBufIntrinsics: resolves pto.tile_buf_addr / tile_valid_rows / tile_valid_cols, including dynamic valid-shape via pto.bind_tile chain tracing. - MemrefToTileBuf: recovers tile_buf types from memref + bind_tile metadata after PlanMemory/InsertSync. - PTOViewToMemref: insert pto.bind_tile anchors for tile_buf function args so MemrefToTileBuf can recover them. Adds new PTO ops (tile_buf_addr/tile_valid_rows/tile_valid_cols), ptoas pipeline wiring, design docs, and unit tests.

…rmat2 fix(pto): relax tmrgsort format2 tmp shape check

Signed-off-by: FangRui <fangrui_95@163.com>

- Add tilelang-dsl cube surface design and implementation - Add vpto lowering for cube operations - Update frontend AST, semantic analysis, and lowering passes - Add tests for tilelang-dsl v1 cube features - Add design specs and task documentation

…303) * feat: enhance verification for multiple PTO operations in A5 architecture - TLoadOp: add vec dst layout validation (ND/DN/NZ) and shape matching - TStoreOp: add vec layout consistency and alignment checks - TAddSOp: enable requireValidRowsEqual validation for A5 - TCvtOp: add A5 type conversion validation - TPartAddOp/TPartMaxOp/TPartMinOp/TPartMulOp: add verifyPartialValidPattern for A5 - TSelSOp: add mask row-major layout check - TShlOp/TShrOp: add dst element type validation in verifyShiftLikeBinaryTileOpCommon * feat: add 10 test cases and update PTO.cpp verification Test cases added to test/basic: - tadds_validrow_mismatch.pto - tcvt_type_combination.pto - tload_vec_layout_mismatch.pto - tpartadd_valid_shape.pto - tpartmax_valid_shape.pto - tpartmin_valid_shape.pto - tpartmul_valid_shape.pto - tshl_dtype_mismatch.pto - tshr_dtype_mismatch.pto - tstore_vec_alignment.pto PTO.cpp updates: - Simplify TLoadOp NZ validation - TStoreOp verification refinements * fix: update TPart*Op test cases and PTO.cpp verification * fix: update PTO.cpp and remove tcvt_type_combination test * fix: update PTO.cpp verification logic * fix: update PTO.cpp and test cases * fix: update PTO.cpp * fix: update PTO.cpp verification * fix: update PTO.cpp * fix: update PTO.cpp * fix: update PTO.cpp * fix: update PTO.cpp * revert: 回退PTO.cpp校验逻辑到正确版本回退到commit 0ec558d（最后能通过的版本）原因： - 检视意见误解了ISA编译时static_assert vs AS运行时校验的差异 - 盲目照搬ISA规则导致破坏原有设计 - 添加的错误校验导致CI反复失败（5次失败）删除的错误校验： - TLoad的ValidCol/shape匹配强制校验 - TLoad的NZ shape[3]/shape[4]强制校验 - TPart的'至少一个src==dst'校验 - TStore过度严格的对齐校验 - fp4特殊处理逻辑恢复的正确设计： - TLoad只校验布局类型匹配（ND/DN/NZ） - TStore标准对齐校验 - TPart只校验dst>=src - 保留运行时的灵活性和原有设计哲学 * fix: add missing validations for TMinS/TMulS/TMaxS and TPart ops - TMinS/TMulS/TMaxS: enable requireValidRowsEqualOnA5 to match ISA BinaryInstr behavior (ISA uses validRow for loop bounds, mismatch causes out-of-bounds access) - TPart: add validation that at least one src valid_shape matches dst valid_shape (ISA TPartInstr requires this for TPartCopyInstr/TPartOps) - TStore: improve code readability with explicit dim check and rows/cols naming * fix: correct TStore validation and revert TPart validation - TStore: use srcValid (not srcShape) for isSpecialCase and alignment checks (valid_shape represents actual data range, shape is tile allocation) - TPart: remove incorrect 'at least one src == dst' validation (ISA allows src0/src1 to be smaller in different dimensions) - Keep: TMinS/TMulS/TMaxS requireValidRowsEqualOnA5=true (valid fix) * fix: revert TStore to use srcShape for alignment checks - TStore alignment check should use srcShape (tile allocation size) not srcValid (valid data range) - Example: cols=256, v_col=255 → 256 satisfies 32-byte alignment, 255 doesn't - Hardware alignment requirement is based on tile allocation, not data range * fix: correct indentation for TStore verification code - Fix missing indentation on line 2285 (int32_t bl declaration) - Code formatting consistency with surrounding context - No functional changes, only whitespace fix * fix: update PTO.cpp * Fix TPart*Op validation: only use verifyPartialValidPatternLoose for A5 - Added verifyPartialValidPatternLoose function (checks only <= dst, not == dst) - Modified only A5 branches to use loose validation (4 locations) - A2A3 branches continue to use strict verifyPartialValidPattern (requires at least one == dst) - Fixes tpartmin/tpartmax test failures --------- Co-authored-by: User <user@example.com>

Input: kernel.pto with cube/vector kernel modules, or a single kernel module normalized into that form. Output: the final VPTO fatobj object at the requested -o path. Steps: 1. Normalize the VPTO input and lower cube/vector modules to LLVM. 2. Compile cube/vector LLVM IR to device objects and merge them. 3. Generate the host stub internally and repack it with the merged device object into the final fatobj. Tests: VPTO host validation now links the ptoas-produced fatobj directly; cases use one kernel.pto and no longer need cube.pto or hand-written stub.cpp.

Input: single VPTO module with pto.aicore functions containing pto.section.vector and pto.section.cube regions. Output: canonical VPTO container with vector/cube child modules carrying pto.kernel_kind, ready for the existing LLVM/fatobj path. Key steps: 1. Add vpto-split-cv-module to clone the input per core kind, drop the opposite section, and inline the selected section body. 2. Add vpto-normalize-container to wrap single kernel modules and verify the final container shape. 3. Run both passes in the VPTO backend pipeline and convert the mixed cbuf/ubuf SIM case to the section input form.

Input: VPTO and TileLang ST .pto testcases\nOutput: ptoas fatobj validation flow\n\n1. Remove stub.cpp and old ll/repack path\n2. Route TileLang ST and VPTO scripts through fatobj\n3. Normalize testcase inputs to the new module form

* Add HP (HIGH_PRECISION) support for TLog * Refactor tlog_template: extract precision mode into subfunctions Extract HIGH_PRECISION and DEFAULT implementations into separate inline_proc functions for better code organization per review feedback. * fix(tlog): add missing pto.aicore attribute to _hp functions --------- Co-authored-by: caojian5 <caojian5@huawei.com>

Move lit-discovered .pto tests out of test/basic into test/lit/pto or test/lit/vpto, and update related documentation paths. Relocate misplaced VPTO sim cases under the micro-op hierarchy, remove a duplicated nested sim case, and add local guidance for choosing PTOAS test frameworks.

Update VPTO lit tests to inspect explicit VPTO IR with --emit-vpto where they check VPTO-level rewrites. Route *_vpto_llvm tests through the VPTO LLVM lowering dump and check llvm.hivm calls from LLVM dialect MLIR instead of stale VPTO output.

Relocate FileCheck-based VPTO .pto tests from test/vpto to test/lit/vpto so they are discovered by the lit framework. Wrap auto-vecscope tests in a vector kernel submodule to match the current VPTO container form.

* add tcolargmax/min tileops lib implementation * fix tcolargmax/min implementations and tile_buf declaration * add init texp high precision implementation * add texp high precision implementation * remove exp high precision code in math.py(already in custom.py) * add empty lines in math.py * fix texp high precision test case's eps * fix texp test case * move exp hp from custom to exp_hp.py

* Rename VPTO MTE semantic ops Input: existing VPTO memory-transfer semantic ops and tests using dma/cube/acc_store-style names. Output: public semantic ops use mte_<src>_<dst> names, with ptr addrspace aliases for the new memory-space spelling. Key steps: update ODS/parser/lowering/expand/ptr-normalize paths; migrate lit, VPTO, and TileLang test inputs; refresh VPTO/ISA docs and legacy raw-op wrapper mappings. --------- Co-authored-by: mouliangyu <mouliangyu@huawei.com>

Lok and others added 30 commits April 28, 2026 04:36

fix: correct event ID explanation in comparison table

46f1a8a

- set_flag/wait_flag: 2 IDs per buffer (1 forward + 1 reverse pipe-pair) - get_buf/rls_buf: 1 ID per buffer (handles both directions automatically) - 8 per pipe-pair is HW limit, not a formula

fix: clarify event ID management in comparison table

3df56a7

- set_flag/wait_flag: 8 IDs per pipe-pair direction (HW limit) - get_buf/rls_buf: 1 buffer ID per shared resource (HW limit: 32 global), same ID used across all pipelines

fix: simplify event ID explanation and drain example

31540f2

- Event ID mgmt: each buffer occupies 1 ID per direction (removed misleading 4 IDs calc) - Drain example: use concrete EVT_*_0/EVT_*_1 instead of {(N-1)%2} expressions

fix: correct set_flag/wait_flag count in quick example

440925c

- 4 set_flag + 4 wait_flag (not 8) - 4 IDs = 2 pipe-pair directions × 2 ping/pong buffers

fix: add concrete 1:N example for loop peeling comparison

a7e5451

- set_flag/wait_flag: 1 MTE2 load, 8 Vector slices — must peel set/wait outside loop - get_buf/rls_buf: same pattern but acquire/release can stay inside or outside

fix: show get_buf/rls_buf inside scf loop for 1:N example

4d5dac8

- Acquire/release per slice inside loop - Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately

refactor vpto llvm emiter

dcd8586

add online softmax (q * k is ready) case

227da17

docs: add VPTO spec v0.3 release draft

dbfe96b

Add the merged v0.3 PTO micro-instruction release spec document for A5, including ISA group references and updated synchronization notes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

add PTO-Gym submodule

a77c8b2

feat: add PTO-Gym guide skill

949a928

Add tilelang dsl implementation

495a02c

Update openspec

f5ee676

Support more syntax in tilelang dsl

47a102f

Update openspec

5ac3dba

tilelang dsl first working version

0fdf4d3

Update openspec

0d8e71e

Support more dsl syntax

5d96301

Amend

0945be0

Update docs

bd86d21

docs: add TileOp expand design and demo

11bc085

brough back the lost pass and pto op after rebase

f640129

feat: tile lang dsl

40b2c93

Update DSL user guide

203ea91

Support more tile attributes

51112bb

Support more event IDs

30dd61a

Align copy_ubuf_to_ubuf

2e63819

Support more load/store ops

f0aeccb

zhangstevenunity and others added 29 commits May 9, 2026 09:29

Merge pull request hw-native-sys#641 from Zhendong404/fix-tmrgsort-fo…

cdfe56c

…rmat2 fix(pto): relax tmrgsort format2 tmp shape check

cleanup(ptoas): drop duplicate pass registration

d482a2a

fix(dsl): support builtin vector operand matching (#340)

fbc34a9

test: add fixpipe cube-cv mixed st case

4ef828c

test: add fixpipe cc-to-gm and cc-to-ub st case

04ae949

test: add fixpipe quant ub cv st case

0cfecb2

feat: add fp_load wrapper op

f972ff3

Signed-off-by: FangRui <fangrui_95@163.com>

feat: fold fpc into acc_store wrappers

21b594a

feat: sync testcase

608a6cb

feat: enable smoke test of tilelang dsl ci

1e930e3

test: switch validation suites to fatobj

26ac78d

Input: VPTO and TileLang ST .pto testcases\nOutput: ptoas fatobj validation flow\n\n1. Remove stub.cpp and old ll/repack path\n2. Route TileLang ST and VPTO scripts through fatobj\n3. Normalize testcase inputs to the new module form

chore: fix PR386 headers for VPTO helpers

954b8c6

Remove legacy VPTO target CLI knobs

ec8a724

Add get_vms4_sr four-result support

1d16d43

Fix VPTO lit checks for fatobj workflow

a04ba13

Update VPTO lit tests to inspect explicit VPTO IR with --emit-vpto where they check VPTO-level rewrites. Route *_vpto_llvm tests through the VPTO LLVM lowering dump and check llvm.hivm calls from LLVM dialect MLIR instead of stale VPTO output.

Move misplaced VPTO lit cases

26a4b0d

Relocate FileCheck-based VPTO .pto tests from test/vpto to test/lit/vpto so they are discovered by the lit framework. Wrap auto-vecscope tests in a vector kernel submodule to match the current VPTO container form.

Fix stale PTO lit cases

7f24b72

feat: fixpipe design

581f380

Add VPTO MAD semantic ops

a26c1e5

Finalize cube semantic docs and ops

3146654

Add VPTO flash attention runtime case

b63f857

mouliangyu force-pushed the feature-vpto-backend branch from 5e223fb to 42b74f9 Compare May 14, 2026 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VPTO flash attention runtime case#365

Add VPTO flash attention runtime case#365
mouliangyu wants to merge 252 commits into
feature-vpto-backendfrom
feature-vpto-flash-attention

mouliangyu commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Conversation

mouliangyu commented May 13, 2026

Summary

Workflow Covered

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants