Skip to content

Add VPTO flash attention runtime case#365

Open
mouliangyu wants to merge 252 commits into
feature-vpto-backendfrom
feature-vpto-flash-attention
Open

Add VPTO flash attention runtime case#365
mouliangyu wants to merge 252 commits into
feature-vpto-backendfrom
feature-vpto-flash-attention

Conversation

@mouliangyu
Copy link
Copy Markdown
Owner

Summary

Adds a VPTO runtime Flash Attention case under test/vpto/cases/kernels/flash-attention.

Workflow Covered

  1. Uses a mixed pto.section.cube / pto.section.vector kernel in a single kernel.pto.
  2. Computes Q @ K^T on cube with mte_gm_l1_frac, mte_l1_l0a/l0b, mad, and mte_l0c_ub.
  3. Runs tiled online softmax and V accumulation on vector, then stores the output to GM.
  4. Launches two 16-row blocks for a 32-row input to cover multi-block execution.

Validation

WORK_SPACE=/tmp/pto-vpto-flash CASE_NAME='kernels/flash-attention' \
  test/vpto/scripts/run_host_vpto_validation.sh

Result: All 1 VPTO case(s) passed.

Lok and others added 30 commits April 28, 2026 04:36
- Add detailed mode parameter documentation (mode=0 vs mode=1)
- Add 'Why get_buf/rls_buf is More Programmer-Friendly' section:
  - No manual priming/draining for ping/pong loops
  - No loop peeling for complex/nested loop dependencies
  - Simpler mental model (buffer ID + program order)
- Add quick example comparison showing set_flag overhead vs get_buf simplicity
- Update Example 2 and 3b with explicit mode=0 in code
- Update comparison table with 'Loop peeling' row
- set_flag/wait_flag: 2 IDs per buffer (1 forward + 1 reverse pipe-pair)
- get_buf/rls_buf: 1 ID per buffer (handles both directions automatically)
- 8 per pipe-pair is HW limit, not a formula
- set_flag/wait_flag: 8 IDs per pipe-pair direction (HW limit)
- get_buf/rls_buf: 1 buffer ID per shared resource (HW limit: 32 global), same ID used across all pipelines
- Event ID mgmt: each buffer occupies 1 ID per direction (removed misleading 4 IDs calc)
- Drain example: use concrete EVT_*_0/EVT_*_1 instead of {(N-1)%2} expressions
- 4 set_flag + 4 wait_flag (not 8)
- 4 IDs = 2 pipe-pair directions × 2 ping/pong buffers
- set_flag/wait_flag: 1 MTE2 load, 8 Vector slices — must peel set/wait outside loop
- get_buf/rls_buf: same pattern but acquire/release can stay inside or outside
- Acquire/release per slice inside loop
- Iteration 0 blocks until MTE2 done, iterations 1-7 proceed immediately
Add the merged v0.3 PTO micro-instruction release spec document for A5,
including ISA group references and updated synchronization notes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
    Introduce a three-pass pipeline that lowers PTO tile ops to vector-level
    implementations via TileLang DSL templates:

    - ExpandTileOp: invokes TileLang Python DSL to instantiate template
      functions and replaces tile ops with func.call. SpecKey covers all
      operands; tile_buf operands are passed through without bridging.
    - PTOInlineLibCall: extended to recognize tilelang instance functions via
      the  attribute set by the DSL frontend.
    - FoldTileBufIntrinsics: resolves pto.tile_buf_addr / tile_valid_rows /
      tile_valid_cols, including dynamic valid-shape via pto.bind_tile chain
      tracing.
    - MemrefToTileBuf: recovers tile_buf types from memref + bind_tile
      metadata after PlanMemory/InsertSync.
    - PTOViewToMemref: insert pto.bind_tile anchors for tile_buf function
      args so MemrefToTileBuf can recover them.

    Adds new PTO ops (tile_buf_addr/tile_valid_rows/tile_valid_cols),
    ptoas pipeline wiring, design docs, and unit tests.
zhangstevenunity and others added 29 commits May 9, 2026 09:29
…rmat2

fix(pto): relax tmrgsort format2 tmp shape check
Signed-off-by: FangRui <fangrui_95@163.com>
- Add tilelang-dsl cube surface design and implementation
- Add vpto lowering for cube operations
- Update frontend AST, semantic analysis, and lowering passes
- Add tests for tilelang-dsl v1 cube features
- Add design specs and task documentation
…303)

* feat: enhance verification for multiple PTO operations in A5 architecture

- TLoadOp: add vec dst layout validation (ND/DN/NZ) and shape matching
- TStoreOp: add vec layout consistency and alignment checks
- TAddSOp: enable requireValidRowsEqual validation for A5
- TCvtOp: add A5 type conversion validation
- TPartAddOp/TPartMaxOp/TPartMinOp/TPartMulOp: add verifyPartialValidPattern for A5
- TSelSOp: add mask row-major layout check
- TShlOp/TShrOp: add dst element type validation in verifyShiftLikeBinaryTileOpCommon

* feat: add 10 test cases and update PTO.cpp verification

Test cases added to test/basic:
- tadds_validrow_mismatch.pto
- tcvt_type_combination.pto
- tload_vec_layout_mismatch.pto
- tpartadd_valid_shape.pto
- tpartmax_valid_shape.pto
- tpartmin_valid_shape.pto
- tpartmul_valid_shape.pto
- tshl_dtype_mismatch.pto
- tshr_dtype_mismatch.pto
- tstore_vec_alignment.pto

PTO.cpp updates:
- Simplify TLoadOp NZ validation
- TStoreOp verification refinements

* fix: update TPart*Op test cases and PTO.cpp verification

* fix: update PTO.cpp and remove tcvt_type_combination test

* fix: update PTO.cpp verification logic

* fix: update PTO.cpp and test cases

* fix: update PTO.cpp

* fix: update PTO.cpp verification

* fix: update PTO.cpp

* fix: update PTO.cpp

* fix: update PTO.cpp

* fix: update PTO.cpp

* revert: 回退PTO.cpp校验逻辑到正确版本

回退到commit 0ec558d(最后能通过的版本)

原因:
- 检视意见误解了ISA编译时static_assert vs AS运行时校验的差异
- 盲目照搬ISA规则导致破坏原有设计
- 添加的错误校验导致CI反复失败(5次失败)

删除的错误校验:
- TLoad的ValidCol/shape匹配强制校验
- TLoad的NZ shape[3]/shape[4]强制校验
- TPart的'至少一个src==dst'校验
- TStore过度严格的对齐校验
- fp4特殊处理逻辑

恢复的正确设计:
- TLoad只校验布局类型匹配(ND/DN/NZ)
- TStore标准对齐校验
- TPart只校验dst>=src
- 保留运行时的灵活性和原有设计哲学

* fix: add missing validations for TMinS/TMulS/TMaxS and TPart ops

- TMinS/TMulS/TMaxS: enable requireValidRowsEqualOnA5 to match ISA BinaryInstr behavior
  (ISA uses validRow for loop bounds, mismatch causes out-of-bounds access)
- TPart: add validation that at least one src valid_shape matches dst valid_shape
  (ISA TPartInstr requires this for TPartCopyInstr/TPartOps)
- TStore: improve code readability with explicit dim check and rows/cols naming

* fix: correct TStore validation and revert TPart validation

- TStore: use srcValid (not srcShape) for isSpecialCase and alignment checks
  (valid_shape represents actual data range, shape is tile allocation)
- TPart: remove incorrect 'at least one src == dst' validation
  (ISA allows src0/src1 to be smaller in different dimensions)
- Keep: TMinS/TMulS/TMaxS requireValidRowsEqualOnA5=true (valid fix)

* fix: revert TStore to use srcShape for alignment checks

- TStore alignment check should use srcShape (tile allocation size)
  not srcValid (valid data range)
- Example: cols=256, v_col=255 → 256 satisfies 32-byte alignment, 255 doesn't
- Hardware alignment requirement is based on tile allocation, not data range

* fix: correct indentation for TStore verification code

- Fix missing indentation on line 2285 (int32_t bl declaration)
- Code formatting consistency with surrounding context
- No functional changes, only whitespace fix

* fix: update PTO.cpp

* Fix TPart*Op validation: only use verifyPartialValidPatternLoose for A5

- Added verifyPartialValidPatternLoose function (checks only <= dst, not == dst)
- Modified only A5 branches to use loose validation (4 locations)
- A2A3 branches continue to use strict verifyPartialValidPattern (requires at least one == dst)
- Fixes tpartmin/tpartmax test failures

---------

Co-authored-by: User <user@example.com>
Input: kernel.pto with cube/vector kernel modules, or a single kernel module normalized into that form.

Output: the final VPTO fatobj object at the requested -o path.

Steps:

1. Normalize the VPTO input and lower cube/vector modules to LLVM.

2. Compile cube/vector LLVM IR to device objects and merge them.

3. Generate the host stub internally and repack it with the merged device object into the final fatobj.

Tests: VPTO host validation now links the ptoas-produced fatobj directly; cases use one kernel.pto and no longer need cube.pto or hand-written stub.cpp.
Input: single VPTO module with pto.aicore functions containing pto.section.vector and pto.section.cube regions.

Output: canonical VPTO container with vector/cube child modules carrying pto.kernel_kind, ready for the existing LLVM/fatobj path.

Key steps:

1. Add vpto-split-cv-module to clone the input per core kind, drop the opposite section, and inline the selected section body.

2. Add vpto-normalize-container to wrap single kernel modules and verify the final container shape.

3. Run both passes in the VPTO backend pipeline and convert the mixed cbuf/ubuf SIM case to the section input form.
Input: VPTO and TileLang ST .pto testcases\nOutput: ptoas fatobj validation flow\n\n1. Remove stub.cpp and old ll/repack path\n2. Route TileLang ST and VPTO scripts through fatobj\n3. Normalize testcase inputs to the new module form
* Add HP (HIGH_PRECISION) support for TLog

* Refactor tlog_template: extract precision mode into subfunctions

Extract HIGH_PRECISION and DEFAULT implementations into separate
inline_proc functions for better code organization per review feedback.

* fix(tlog): add missing pto.aicore attribute to _hp functions

---------

Co-authored-by: caojian5 <caojian5@huawei.com>
Move lit-discovered .pto tests out of test/basic into test/lit/pto or test/lit/vpto, and update related documentation paths.

Relocate misplaced VPTO sim cases under the micro-op hierarchy, remove a duplicated nested sim case, and add local guidance for choosing PTOAS test frameworks.
Update VPTO lit tests to inspect explicit VPTO IR with --emit-vpto where they check VPTO-level rewrites.

Route *_vpto_llvm tests through the VPTO LLVM lowering dump and check llvm.hivm calls from LLVM dialect MLIR instead of stale VPTO output.
Relocate FileCheck-based VPTO .pto tests from test/vpto to test/lit/vpto so they are discovered by the lit framework.

Wrap auto-vecscope tests in a vector kernel submodule to match the current VPTO container form.
* add tcolargmax/min tileops lib implementation

* fix tcolargmax/min implementations and tile_buf declaration

* add init texp high precision implementation

* add texp high precision implementation

* remove exp high precision code in math.py(already in custom.py)

* add empty lines in math.py

* fix texp high precision test case's eps

* fix texp test case

* move exp hp from custom to exp_hp.py
* Rename VPTO MTE semantic ops

Input: existing VPTO memory-transfer semantic ops and tests using dma/cube/acc_store-style names.

Output: public semantic ops use mte_<src>_<dst> names, with ptr addrspace aliases for the new memory-space spelling.

Key steps: update ODS/parser/lowering/expand/ptr-normalize paths; migrate lit, VPTO, and TileLang test inputs; refresh VPTO/ISA docs and legacy raw-op wrapper mappings.

---------

Co-authored-by: mouliangyu <mouliangyu@huawei.com>
@mouliangyu mouliangyu force-pushed the feature-vpto-backend branch from 5e223fb to 42b74f9 Compare May 14, 2026 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.