diff --git a/docs/PTO_IR_manual.md b/docs/PTO_IR_manual.md index a32de72f7..1050009dc 100644 --- a/docs/PTO_IR_manual.md +++ b/docs/PTO_IR_manual.md @@ -8346,11 +8346,21 @@ frontend/framework generated IR. The detailed design document is: function. - `slot_size` is expressed in bytes and uses the pre-split logical pipe-entry size. +- `slot_num` is an optional compile-time integer attribute on + `pto.aic_initialize_pipe` / `pto.aiv_initialize_pipe`. It controls the pipe + FIFO depth. The `effective_slot_num` is the explicit value when present, or + the default value: `8` for `dir_mask = 1/2` or `4` for `dir_mask = 3`. - `local_slot_num` is an optional compile-time integer attribute on `pto.aic_initialize_pipe` / `pto.aiv_initialize_pipe`. On A2/A3 it overrides the default consumer-side local FIFO slot count only when the pipe uses a local consumer FIFO buffer. Global-only GM FIFO pipes omit it. +- `pto.reserve_buffer.size` is the byte size of the consumer-side local FIFO + buffer. For A2/A3 local FIFO pipes, it should be + `slot_size * effective_local_slot_num`, where `effective_local_slot_num` is + the explicit `local_slot_num` when present or the effective `slot_num` + otherwise. For A5 local FIFO pipes, `local_slot_num` is not configurable and + the reserved byte size should be `slot_size * effective_slot_num`. - `nosplit` is an optional compile-time boolean attribute on `pto.aic_initialize_pipe` / `pto.aiv_initialize_pipe`. - `split` is a compile-time attribute, not a runtime SSA operand. @@ -8369,9 +8379,10 @@ frontend/framework generated IR. The detailed design document is: (`pto.initialize_l2g2l_pipe`). It does not implicitly execute `pto.tstore` or `pto.tload`; callers move data explicitly before `tpush` or after `tpop`. - When every transfer op bound to one pipe id uses a global entry, the pipe is - a global-only GM FIFO. Its frontend initialize op carries only - `gm_slot_tensor`; `gm_slot_buffer`, `c2v_consumer_buf`, `v2c_consumer_buf`, `local_slot_num`, - `pto.reserve_buffer`, and `pto.import_reserved_buffer` are not used. + a global-only GM FIFO. Its frontend initialize op carries `gm_slot_tensor` + and may carry `slot_num`; `gm_slot_buffer`, `c2v_consumer_buf`, + `v2c_consumer_buf`, `local_slot_num`, `pto.reserve_buffer`, and + `pto.import_reserved_buffer` are not used. - For global entries, the matched initialize op's `gm_slot_tensor` describes one FIFO slot entry, not the full multi-slot FIFO buffer. Its dtype, shape, stride, and layout must match the `tensor_view` returned by `talloc` / @@ -8441,7 +8452,10 @@ When the address is already fixed in the input IR: **Arguments:** - `name`: string attribute identifying the logical reserved buffer -- `size`: reserved buffer size in bytes +- `size`: reserved buffer size in bytes. For A2/A3 local FIFO pipes this is + `slot_size * effective_local_slot_num`; for A5 local FIFO pipes this is + `slot_size * effective_slot_num`. Global-only GM FIFO pipes do not use + `pto.reserve_buffer`. - `location`: local address-space attribute, typically `vec` or `mat` - `auto`: boolean allocation-mode flag in textual IR - `base`: optional explicit local base address @@ -8452,6 +8466,9 @@ When the address is already fixed in the input IR: - Multiple `pto.reserve_buffer` ops are allowed in one function, but `name` must be unique within that function +- `size` must be greater than `0`; PTOAS allocates exactly the requested byte + size, so it should match the local FIFO sizing rule of the pipe that consumes + this buffer - `location` must be a supported local address space - Op-level verification requires: - `auto = false` must provide `base` @@ -8505,7 +8522,7 @@ this op. ```mlir // A2/A3 (with GM slot buffer): -pto.aic_initialize_pipe {id = 0, dir_mask = 1, slot_size = 1024, local_slot_num = 1} +pto.aic_initialize_pipe {id = 0, dir_mask = 1, slot_size = 1024, slot_num = 2, local_slot_num = 1} (gm_slot_buffer = %gm_buf : !pto.ptr, c2v_consumer_buf = %c2v_import : i32, v2c_consumer_buf = %c0_i32 : i32) @@ -8529,6 +8546,8 @@ pto.aic_initialize_pipe {id = 0, dir_mask = 1, slot_size = 1024, nosplit = true} the same function - `dir_mask`: communication direction encoding - `slot_size`: logical slot size in bytes +- `slot_num`: optional GM ring FIFO slot count; omitted defaults to `8` for + `dir_mask = 1/2` or `4` for `dir_mask = 3` - `local_slot_num`: optional A2/A3-only local FIFO slot count override for the lowered `pto.initialize_l2g2l_pipe`; omitted for global-only GM FIFO - `nosplit`: optional compile-time boolean controlling no-split pipe mode @@ -8551,12 +8570,16 @@ pto.aic_initialize_pipe {id = 0, dir_mask = 1, slot_size = 1024, nosplit = true} - Must appear in Cube kernels - Multiple `pto.aic_initialize_pipe` ops are allowed in one Cube function, but `id` must be unique among frontend initialize ops in that function +- If `slot_num` is present, it must be greater than `0` - If `local_slot_num` is present, it must be greater than `0` and no greater - than the legacy slot count implied by `dir_mask` - (`8` for `dir_mask = 1/2`, `4` for `dir_mask = 3`) + than the effective `slot_num` +- On A5, `local_slot_num` must be omitted; A5 frontend pipes lower to + `pto.initialize_l2l_pipe`, which does not use a local FIFO slot-count + template parameter. Its consumer-side `pto.reserve_buffer.size` should be + `slot_size * effective_slot_num` - A global-only GM FIFO initialize carries only `gm_slot_tensor`; it must not carry `gm_slot_buffer`, `local_slot_num`, `c2v_consumer_buf`, or - `v2c_consumer_buf` + `v2c_consumer_buf`; it may carry `slot_num` - For global-only GM FIFO, `slot_size` must match the byte size of `gm_slot_tensor` - Global-entry `talloc` / `tpush` / `tpop` / `tfree` entry types must match the @@ -8576,7 +8599,7 @@ pto.aic_initialize_pipe {id = 0, dir_mask = 1, slot_size = 1024, nosplit = true} ```mlir // A2/A3 (with GM slot buffer): -pto.aiv_initialize_pipe {id = 0, dir_mask = 1, slot_size = 1024, local_slot_num = 1} +pto.aiv_initialize_pipe {id = 0, dir_mask = 1, slot_size = 1024, slot_num = 2, local_slot_num = 1} (gm_slot_buffer = %gm_buf : !pto.ptr, c2v_consumer_buf = %c2v_local : i32, v2c_consumer_buf = %c0_i32 : i32) diff --git a/docs/designs/ptoas-tpush-tpop-design.md b/docs/designs/ptoas-tpush-tpop-design.md index ffb7eb623..13526419c 100644 --- a/docs/designs/ptoas-tpush-tpop-design.md +++ b/docs/designs/ptoas-tpush-tpop-design.md @@ -380,7 +380,14 @@ func.func @vector_kernel(%gm_slot_buffer : !pto.ptr, - 单函数允许多条 `import_reserved_buffer` - `DIR_MASK` 只允许 `1`、`2`、`3` - `SLOT_SIZE > 0` -- 使用 consumer 侧 local FIFO buffer 时,`reserve_buffer.size == SLOT_SIZE * SLOT_NUM` +- 使用 consumer 侧 local FIFO buffer 时,`reserve_buffer.size` 表示该 + consumer FIFO 实际预留的本地字节数。A2/A3 GM FIFO 路径要求 + `reserve_buffer.size == SLOT_SIZE * EFFECTIVE_LOCAL_SLOT_NUM`,其中 + `EFFECTIVE_LOCAL_SLOT_NUM` 为显式 `local_slot_num`,缺省时为有效 + `slot_num`。A5 L2L 路径不支持 `local_slot_num`,要求 + `reserve_buffer.size == SLOT_SIZE * EFFECTIVE_SLOT_NUM`。这里的 + `EFFECTIVE_SLOT_NUM` 为显式 `slot_num`,缺省时 `DIR_MASK=1/2` 为 `8`、 + `DIR_MASK=3` 为 `4` - 使用 consumer 侧 local FIFO buffer 时,C2V consumer 的 `reserve_buffer.location` 必须是 `VEC` - 使用 consumer 侧 local FIFO buffer 时,V2C consumer 的 `reserve_buffer.location` 必须是 `MAT` - `reserve_buffer.name` 在本函数内必须唯一 @@ -390,7 +397,7 @@ func.func @vector_kernel(%gm_slot_buffer : !pto.ptr, - 启用 local address planning 的编译流程:`reserve_buffer` 只允许 `auto = true` - 跳过 local address planning 的编译流程:`reserve_buffer` 只允许 `auto = false` 且显式提供 `base` - `import_reserved_buffer` 必须能在 `peer_func` 中找到同名 `reserve_buffer` -- global-only GM FIFO 的 initialize 只提供 `gm_slot_tensor`,不提供 `gm_slot_buffer`、`local_slot_num`、`c2v_consumer_buf`、`v2c_consumer_buf`,且不要求成对的 `reserve_buffer` / `import_reserved_buffer` +- global-only GM FIFO 的 initialize 只提供 `gm_slot_tensor`(可附带 `slot_num`),不提供 `gm_slot_buffer`、`local_slot_num`、`c2v_consumer_buf`、`v2c_consumer_buf`,且不要求成对的 `reserve_buffer` / `import_reserved_buffer` ## 4. 核心约定 @@ -515,8 +522,10 @@ DIR_BOTH 示例: `pto.aic_initialize_pipe` / `pto.aiv_initialize_pipe` 提供并在 A2/A3 lowering 时转发 - 表示 GM 路径下 consumer 侧 local slot buffer 的槽数,仅在存在 local FIFO buffer 的 tile-entry 路径有意义 - 仅在通过 GM 传递时对底层 `TPipe` 模板参数有意义,不改变 GM FIFO 的 `slot_num` + - A2/A3 consumer 侧 `reserve_buffer.size` 应按 + `slot_size * effective_local_slot_num` 预留 - 存在 local FIFO buffer 且缺省时,默认值等于该内部 pipe 的 `slot_num` - - 因此当前固定规则下: + - 因此前端未显式指定 `slot_num` 时: - `DIR_MASK=1/2` 直接 lowering 时,`local_slot_num = 8` - `DIR_MASK=3` 单条 DIR_BOTH pipe,`local_slot_num = 4` - global-only GM FIFO 不携带 `local_slot_num` @@ -658,20 +667,24 @@ pto.tfree(%entry, %pipe : !pto.tensor_view<128x512xf32>, !pto.pipe) {split = 0} #### A2/A3 - `pto.aic_initialize_pipe` 和 `pto.aiv_initialize_pipe` lower 为 `pto.initialize_l2g2l_pipe` -- 若前端 init 只提供 `gm_slot_tensor`,则 lower 为只携带 `gm_slot_tensor` 的 global-only GM FIFO;不补 `local_slot_num`,不生成 local consumer address operand,也不依赖 `reserve_buffer` / `import_reserved_buffer` +- 若前端 init 只提供 `gm_slot_tensor`(可附带 `slot_num`),则 lower 为只携带 `gm_slot_tensor` 的 global-only GM FIFO;不补 `local_slot_num`,不生成 local consumer address operand,也不依赖 `reserve_buffer` / `import_reserved_buffer` - 若前端提供了 consumer 侧 local FIFO buffer,且提供了 `local_slot_num`,则直接转发到 lowered `pto.initialize_l2g2l_pipe` -- 若前端提供了 consumer 侧 local FIFO buffer 但未提供更具体信息,lowering 默认补上 `local_slot_num = slot_num` +- 若前端提供了 consumer 侧 local FIFO buffer 但未提供 `local_slot_num`,lowering 默认补上 `local_slot_num = slot_num` #### A5 - `pto.aic_initialize_pipe` 和 `pto.aiv_initialize_pipe` lower 为 `pto.initialize_l2l_pipe` +- A5 不支持 `local_slot_num`;前端 init 若显式携带该属性,verifier 会报错 +- A5 的 consumer 侧 `reserve_buffer.size` 不由 `local_slot_num` 决定;A5 + L2L pipe 本地 FIFO 地址按 `slot_num` 取模,按 + `slot_size * effective_slot_num` 预留本地 FIFO buffer ### 6.2 `DIR_MASK=1/2` - 只生成一条内部 pipe -- `slot_num = 8` -- 对带 consumer 侧 local FIFO buffer 的 `initialize_l2g2l_pipe`,默认 `local_slot_num = 8` +- `slot_num` 缺省为 `8`,也可由前端显式指定 +- 对带 consumer 侧 local FIFO buffer 的 `initialize_l2g2l_pipe`,默认 `local_slot_num = slot_num` - 若前端显式提供 `local_slot_num`,则使用显式值 - global-only GM FIFO 不携带 `local_slot_num`,地址/descriptor 操作数只有 `gm_slot_tensor` @@ -679,8 +692,8 @@ pto.tfree(%entry, %pipe : !pto.tensor_view<128x512xf32>, !pto.pipe) {split = 0} 前端一个 init op 生成**单条** DIR_BOTH 内部 pipe: -- `%pipe`:`dir_mask = 3`,`slot_num = 4` -- 若 lowering 为带 consumer 侧 local FIFO buffer 的 `initialize_l2g2l_pipe`,默认 `local_slot_num = 4` +- `%pipe`:`dir_mask = 3`,`slot_num` 缺省为 `4`,也可由前端显式指定 +- 若 lowering 为带 consumer 侧 local FIFO buffer 的 `initialize_l2g2l_pipe`,默认 `local_slot_num = slot_num` - 若前端显式提供 `local_slot_num`,则使用显式值 地址选择规则: @@ -977,13 +990,16 @@ pass 在模块级按两步执行: ### 9.1 前端 verifier -前端 verifier 负责检查: +前端 IR 需满足以下约束: - 每个函数 init op 数量是否合法 - 每个函数 `reserve_buffer` / `import_reserved_buffer` 数量是否合法 - `DIR_MASK` 取值是否合法 - `SLOT_SIZE > 0` -- 使用 consumer 侧 local FIFO buffer 时,`reserve_buffer.size == SLOT_SIZE * SLOT_NUM` +- 使用 consumer 侧 local FIFO buffer 时,`reserve_buffer.size` 必须匹配对应 + pipe 的本地 FIFO 字节数:A2/A3 GM FIFO 路径为 + `SLOT_SIZE * EFFECTIVE_LOCAL_SLOT_NUM`,A5 L2L 路径为 + `SLOT_SIZE * EFFECTIVE_SLOT_NUM` - 使用 consumer 侧 local FIFO buffer 时,`reserve_buffer.location` 与 consumer 函数类型匹配 - `reserve_buffer.name` 在函数内唯一 - `import_reserved_buffer` 的 `(name, peer_func)` 在函数内唯一 @@ -995,7 +1011,7 @@ pass 在模块级按两步执行: - 方向相关 op 只能出现在合法 kernel 中 - 前端数据传输 op 的 `split` 必须是合法的编译期常量属性 - `global` entry 形式的 `talloc_to_*` / `tpush_to_*` / `tpop_from_*` / `tfree_from_*` 只能绑定到 GM FIFO pipe(A2/A3 `initialize_l2g2l_pipe` 路径) -- 绑定到 global-only GM FIFO 的 initialize 只允许携带 `gm_slot_tensor`,不得携带 `gm_slot_buffer`、`local_slot_num`、`c2v_consumer_buf`、`v2c_consumer_buf`;该路径不要求 `reserve_buffer` / `import_reserved_buffer` +- 绑定到 global-only GM FIFO 的 initialize 只允许携带 `gm_slot_tensor`(可附带 `slot_num`),不得携带 `gm_slot_buffer`、`local_slot_num`、`c2v_consumer_buf`、`v2c_consumer_buf`;该路径不要求 `reserve_buffer` / `import_reserved_buffer` - `gm_slot_tensor` 本身描述单个 slot entry;其字节数必须匹配 `slot_size` - `talloc_to_*` / `tpop_from_*` 返回的 `tensor_view` 类型必须匹配 `gm_slot_tensor` - `global` entry 的 dtype、shape 与 stride/layout 必须足以生成底层 `GlobalTensor` 类型 @@ -1008,11 +1024,12 @@ pass 在模块级按两步执行: 内部 verifier 负责检查: - `slot_size > 0` -- `slot_num` 只允许 `8` 或 `4` -- `DIR_MASK=1/2` 时,`slot_num` 必须与单向/双向 lowering 规则一致 +- `slot_num >= 1` +- legacy 前端 `pto.aic_initialize_pipe` / `pto.aiv_initialize_pipe` 可显式提供 + `slot_num`;缺省时 `DIR_MASK=1/2` 使用 `8`,`DIR_MASK=3` 使用 `4` - `local_slot_num` 若出现,可出现在 `pto.initialize_l2g2l_pipe` 或 legacy 前端 `pto.aic_initialize_pipe` / `pto.aiv_initialize_pipe` 上,且必须大于 `0` - 且不大于其对应 lowering 规则下的 `slot_num`;global-only GM FIFO 不携带 `local_slot_num` + 且不大于其有效 `slot_num`;A5 和 global-only GM FIFO 不携带 `local_slot_num` - `flag_base` 若出现,必须满足基本合法性;是否已填写以及具体分配值由 flag 分配保证 - `pto.initialize_l2g2l_pipe` 必须提供 `gm_addr` 或 `gm_slot_tensor`;只有存在 consumer 侧 local FIFO buffer 时才提供 `local_addr` / `peer_local_addr` - `pto.initialize_l2l_pipe` 必须提供 `local_addr` diff --git a/include/PTO/IR/PTOOps.td b/include/PTO/IR/PTOOps.td index 344b399a6..5a7dd18f5 100644 --- a/include/PTO/IR/PTOOps.td +++ b/include/PTO/IR/PTOOps.td @@ -1505,6 +1505,7 @@ def AicInitializePipeOp : PTO_Op<"aic_initialize_pipe", DefaultValuedOptionalAttr:$id, I8Attr:$dir_mask, I32Attr:$slot_size, + OptionalAttr:$slot_num, OptionalAttr:$local_slot_num, OptionalAttr:$nosplit, Optional:$gm_slot_buffer, @@ -1526,6 +1527,7 @@ def AivInitializePipeOp : PTO_Op<"aiv_initialize_pipe", DefaultValuedOptionalAttr:$id, I8Attr:$dir_mask, I32Attr:$slot_size, + OptionalAttr:$slot_num, OptionalAttr:$local_slot_num, OptionalAttr:$nosplit, Optional:$gm_slot_buffer, diff --git a/lib/PTO/IR/PTO.cpp b/lib/PTO/IR/PTO.cpp index 5427ac36e..37d0dd8e3 100644 --- a/lib/PTO/IR/PTO.cpp +++ b/lib/PTO/IR/PTO.cpp @@ -11457,6 +11457,7 @@ static ParseResult parseFrontendInitializePipeOp(OpAsmParser &parser, bool sawId = false; bool sawDirMask = false; bool sawSlotSize = false; + bool sawSlotNum = false; bool sawLocalSlotNum = false; bool sawNoSplit = false; @@ -11495,6 +11496,15 @@ static ParseResult parseFrontendInitializePipeOp(OpAsmParser &parser, "slot_size", attrs)) return failure(); sawSlotSize = true; + } else if (keyword == "slot_num") { + if (sawSlotNum) + return parser.emitError(parser.getCurrentLocation(), + "duplicate 'slot_num' clause"); + IntegerAttr slotNumAttr; + if (parser.parseAttribute(slotNumAttr, parser.getBuilder().getI32Type(), + "slot_num", attrs)) + return failure(); + sawSlotNum = true; } else if (keyword == "local_slot_num") { if (sawLocalSlotNum) return parser.emitError(parser.getCurrentLocation(), @@ -11632,6 +11642,8 @@ static void printFrontendInitializePipeOp(InitOpT op, OpAsmPrinter &p) { printClause("id", op.getId()); printClause("dir_mask", static_cast(op.getDirMask())); printClause("slot_size", op.getSlotSize()); + if (auto slotNumAttr = op.getSlotNumAttr()) + printClause("slot_num", slotNumAttr.getInt()); if (auto localSlotNumAttr = op.getLocalSlotNumAttr()) printClause("local_slot_num", localSlotNumAttr.getInt()); if (auto noSplitAttr = op.getNosplitAttr()) @@ -11658,7 +11670,8 @@ static void printFrontendInitializePipeOp(InitOpT op, OpAsmPrinter &p) { p << ")"; p.printOptionalAttrDict( op->getAttrs(), - /*elidedAttrs=*/{"id", "dir_mask", "slot_size", "local_slot_num", + /*elidedAttrs=*/{"id", "dir_mask", "slot_size", "slot_num", + "local_slot_num", "nosplit", "operandSegmentSizes"}); } @@ -11744,6 +11757,13 @@ static LogicalResult verifyFrontendInitCommon(InitOpT op, return op.emitOpError("expects 'dir_mask' to be 1, 2, or 3"); if (op.getSlotSize() <= 0) return op.emitOpError("expects 'slot_size' to be greater than 0"); + int32_t slotNum = dirMask == 3 ? 4 : 8; + if (auto slotNumAttr = op.getSlotNumAttr()) { + slotNum = slotNumAttr.getInt(); + if (slotNum <= 0) + return op.emitOpError("expects 'slot_num' to be greater than 0"); + } + PTOArch arch = getTargetArch(op.getOperation()); bool hasGlobalSlotTensor = static_cast(op.getGmSlotTensor()); bool hasC2vConsumerBuf = static_cast(op.getC2vConsumerBuf()); @@ -11757,7 +11777,7 @@ static LogicalResult verifyFrontendInitCommon(InitOpT op, if (op.getLocalSlotNumAttr()) return op.emitOpError( "globaltensor pipe init does not use 'local_slot_num'"); - if (getTargetArch(op.getOperation()) == PTOArch::A5) { + if (arch == PTOArch::A5) { return op.emitOpError( "globaltensor pipe entries are supported for a2/a3 l2g2l pipes"); } @@ -11776,14 +11796,16 @@ static LogicalResult verifyFrontendInitCommon(InitOpT op, } if (auto localSlotNumAttr = op.getLocalSlotNumAttr()) { + if (arch == PTOArch::A5) + return op.emitOpError( + "'local_slot_num' is only supported for a2/a3 frontend pipe lowering"); int32_t localSlotNum = localSlotNumAttr.getInt(); if (localSlotNum <= 0) return op.emitOpError("expects 'local_slot_num' to be greater than 0"); - int32_t loweredSlotNum = dirMask == 3 ? 4 : 8; - if (localSlotNum > loweredSlotNum) { + if (localSlotNum > slotNum) { return op.emitOpError() - << "expects 'local_slot_num' to be less than or equal to " - << loweredSlotNum << " for dir_mask = " << static_cast(dirMask); + << "expects 'local_slot_num' to be less than or equal to slot_num (" + << slotNum << ") for dir_mask = " << static_cast(dirMask); } } @@ -12060,8 +12082,8 @@ static LogicalResult verifyPipeShape(Operation *op, int8_t dirMask, int32_t slot return op->emitOpError("expects 'dir_mask' to be 1, 2, or 3"); if (slotSize <= 0) return op->emitOpError("expects 'slot_size' to be greater than 0"); - if (slotNum != 4 && slotNum != 8) - return op->emitOpError("expects 'slot_num' to be 4 or 8"); + if (slotNum <= 0) + return op->emitOpError("expects 'slot_num' to be greater than 0"); if (flagBase && *flagBase < 0) return op->emitOpError("expects 'flag_base' to be non-negative when present"); if (flagBase) { diff --git a/lib/PTO/Transforms/GraphSyncSolver/SyncSolver.cpp b/lib/PTO/Transforms/GraphSyncSolver/SyncSolver.cpp index 23a4032a6..e4c9ff8e3 100644 --- a/lib/PTO/Transforms/GraphSyncSolver/SyncSolver.cpp +++ b/lib/PTO/Transforms/GraphSyncSolver/SyncSolver.cpp @@ -128,7 +128,8 @@ bool Solver::checkSkipParallelLoop(Occurrence *occ1, Occurrence *occ2) { auto [parOcc1, parOcc2] = Occurrence::getLCAPair(occ1, occ2); assert(parOcc1 != nullptr && parOcc2 != nullptr); auto *parentLCALoopOcc = Occurrence::getParentloop(parOcc1); - assert(parentLCALoopOcc != nullptr); + if (parentLCALoopOcc == nullptr) + return false; auto *parentLCALoopOp = llvm::cast(parentLCALoopOcc->op); return parentLCALoopOp->isParallel; } diff --git a/lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp b/lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp index 162e7e9b5..e54047b09 100644 --- a/lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp +++ b/lib/PTO/Transforms/PTOLowerFrontendPipeOpsPass.cpp @@ -65,6 +65,15 @@ static void propagateFrontendIdAttr(InitOpT initOp, Operation *pipeOp, rewriter.getI32IntegerAttr(initOp.getId())); } +template +static int32_t getFrontendSlotNum(InitOpT initOp) { + if (auto slotNumAttr = initOp.getSlotNumAttr()) + return slotNumAttr.getInt(); + return initOp.getDirMask() == kBidirectionalDirMask + ? kBidirectionalSlotNum + : kSingleDirectionSlotNum; +} + static std::optional getStaticIndexLikeValue(Value value) { if (auto cst = value.getDefiningOp()) return cst.value(); @@ -166,9 +175,10 @@ static FailureOr lowerSingleDirectionFrontendInit(InitOpT initOp, IRRewriter &rewriter, PTOArch arch, Type pipeTy, int8_t dirMask, Value localAddr) { + int32_t slotNum = getFrontendSlotNum(initOp); auto pipeOr = - createFrontendPipe(initOp, rewriter, arch, pipeTy, dirMask, - kSingleDirectionSlotNum, localAddr); + createFrontendPipe(initOp, rewriter, arch, pipeTy, dirMask, slotNum, + localAddr); if (failed(pipeOr)) return failure(); @@ -190,9 +200,9 @@ template static FailureOr lowerBidirectionalFrontendInit(InitOpT initOp, IRRewriter &rewriter, PTOArch arch, Type pipeTy) { + int32_t slotNum = getFrontendSlotNum(initOp); auto pipeOr = createFrontendPipe(initOp, rewriter, arch, pipeTy, - kBidirectionalDirMask, - kBidirectionalSlotNum, + kBidirectionalDirMask, slotNum, initOp.getC2vConsumerBuf(), initOp.getV2cConsumerBuf()); if (failed(pipeOr)) diff --git a/test/lit/pto/tpush_tpop_frontend_local_slot_num_a5_invalid.pto b/test/lit/pto/tpush_tpop_frontend_local_slot_num_a5_invalid.pto new file mode 100644 index 000000000..06b9d25a3 --- /dev/null +++ b/test/lit/pto/tpush_tpop_frontend_local_slot_num_a5_invalid.pto @@ -0,0 +1,14 @@ +// RUN: not ptoas --pto-arch=a5 %s 2>&1 | FileCheck %s + +module { + func.func @cube_kernel() + attributes {pto.kernel_kind = #pto.kernel_kind} { + %c0_i32 = arith.constant 0 : i32 + pto.aic_initialize_pipe {id = 0, dir_mask = 1, slot_size = 1024, local_slot_num = 1} + (c2v_consumer_buf = %c0_i32 : i32, + v2c_consumer_buf = %c0_i32 : i32) + return + } +} + +// CHECK: error: 'pto.aic_initialize_pipe' op 'local_slot_num' is only supported for a2/a3 frontend pipe lowering diff --git a/test/lit/pto/tpush_tpop_frontend_local_slot_num_invalid.pto b/test/lit/pto/tpush_tpop_frontend_local_slot_num_invalid.pto index 67084f2ac..6f0f76cc6 100644 --- a/test/lit/pto/tpush_tpop_frontend_local_slot_num_invalid.pto +++ b/test/lit/pto/tpush_tpop_frontend_local_slot_num_invalid.pto @@ -12,4 +12,4 @@ module { } } -// CHECK: error: 'pto.aic_initialize_pipe' op expects 'local_slot_num' to be less than or equal to 4 for dir_mask = 3 +// CHECK: error: 'pto.aic_initialize_pipe' op expects 'local_slot_num' to be less than or equal to slot_num (4) for dir_mask = 3 diff --git a/test/lit/pto/tpush_tpop_frontend_slot_num_a3.pto b/test/lit/pto/tpush_tpop_frontend_slot_num_a3.pto new file mode 100644 index 000000000..4195e7544 --- /dev/null +++ b/test/lit/pto/tpush_tpop_frontend_slot_num_a3.pto @@ -0,0 +1,49 @@ +// RUN: ptoas --pto-arch=a3 %s 2>&1 | FileCheck %s --check-prefix=A3 + +module { + func.func @cube_kernel(%gm_slot_buffer: !pto.ptr) + attributes {pto.kernel_kind = #pto.kernel_kind} { + %c0_i32 = arith.constant 0 : i32 + %v2c_local = pto.reserve_buffer { + name = "v2c_fifo", + size = 2048, + location = #pto.address_space, + auto = true + } -> i32 + pto.aic_initialize_pipe {id = 0, dir_mask = 2, slot_size = 1024, slot_num = 2} + (gm_slot_buffer = %gm_slot_buffer : !pto.ptr, + c2v_consumer_buf = %c0_i32 : i32, + v2c_consumer_buf = %v2c_local : i32) + + %recv_tile = pto.tpop_from_aiv {id = 0, split = 0} + -> !pto.tile_buf + pto.tfree_from_aiv {id = 0, split = 0} + return + } + + func.func @vector_kernel(%gm_slot_buffer: !pto.ptr) + attributes {pto.kernel_kind = #pto.kernel_kind} { + %c0_i32 = arith.constant 0 : i32 + %v2c_import = pto.import_reserved_buffer { + name = "v2c_fifo", + peer_func = @cube_kernel + } -> i32 + pto.aiv_initialize_pipe {id = 0, dir_mask = 2, slot_size = 1024, slot_num = 2} + (gm_slot_buffer = %gm_slot_buffer : !pto.ptr, + c2v_consumer_buf = %c0_i32 : i32, + v2c_consumer_buf = %v2c_import : i32) + + %vec_tile = pto.alloc_tile : !pto.tile_buf + pto.tpush_to_aic(%vec_tile : !pto.tile_buf) {id = 0, split = 0} + return + } +} + +// A3-LABEL: AICORE void cube_kernel(__gm__ float* +// A3: auto {{v[0-9]+}} = TPipe<0, Direction::DIR_V2C, 1024, 2, 2, true>( +// A3: TPOP +// A3: TFREE, TileSplitAxis::TILE_NO_SPLIT>( + +// A3-LABEL: AICORE void vector_kernel(__gm__ float* +// A3: auto {{v[0-9]+}} = TPipe<0, Direction::DIR_V2C, 1024, 2, 2, true>( +// A3: TPUSH diff --git a/test/lit/pto/tpush_tpop_frontend_slot_num_invalid.pto b/test/lit/pto/tpush_tpop_frontend_slot_num_invalid.pto new file mode 100644 index 000000000..7648f291b --- /dev/null +++ b/test/lit/pto/tpush_tpop_frontend_slot_num_invalid.pto @@ -0,0 +1,15 @@ +// RUN: not ptoas --pto-arch=a3 %s 2>&1 | FileCheck %s + +module { + func.func @cube_kernel(%gm_slot_buffer: !pto.ptr) + attributes {pto.kernel_kind = #pto.kernel_kind} { + %c0_i32 = arith.constant 0 : i32 + pto.aic_initialize_pipe {id = 0, dir_mask = 1, slot_size = 1024, slot_num = 0} + (gm_slot_buffer = %gm_slot_buffer : !pto.ptr, + c2v_consumer_buf = %c0_i32 : i32, + v2c_consumer_buf = %c0_i32 : i32) + return + } +} + +// CHECK: error: 'pto.aic_initialize_pipe' op expects 'slot_num' to be greater than 0 diff --git a/test/lit/pto/tpush_tpop_frontend_slot_num_local_invalid.pto b/test/lit/pto/tpush_tpop_frontend_slot_num_local_invalid.pto new file mode 100644 index 000000000..3f7a3da25 --- /dev/null +++ b/test/lit/pto/tpush_tpop_frontend_slot_num_local_invalid.pto @@ -0,0 +1,15 @@ +// RUN: not ptoas --pto-arch=a3 %s 2>&1 | FileCheck %s + +module { + func.func @cube_kernel(%gm_slot_buffer: !pto.ptr) + attributes {pto.kernel_kind = #pto.kernel_kind} { + %c0_i32 = arith.constant 0 : i32 + pto.aic_initialize_pipe {id = 0, dir_mask = 2, slot_size = 1024, slot_num = 2, local_slot_num = 3} + (gm_slot_buffer = %gm_slot_buffer : !pto.ptr, + c2v_consumer_buf = %c0_i32 : i32, + v2c_consumer_buf = %c0_i32 : i32) + return + } +} + +// CHECK: error: 'pto.aic_initialize_pipe' op expects 'local_slot_num' to be less than or equal to slot_num (2) for dir_mask = 2 diff --git a/test/lit/pto/tpush_tpop_internal_slot_num_a3.pto b/test/lit/pto/tpush_tpop_internal_slot_num_a3.pto new file mode 100644 index 000000000..2d314ad3f --- /dev/null +++ b/test/lit/pto/tpush_tpop_internal_slot_num_a3.pto @@ -0,0 +1,20 @@ +// RUN: ptoas --pto-arch=a3 %s 2>&1 | FileCheck %s --check-prefix=A3 + +module { + func.func @cube_kernel(%gm_slot_buffer: memref<256xf32, #pto.address_space>, + %c2v_consumer_buf: i32) + attributes {pto.kernel_kind = #pto.kernel_kind} { + %pipe = pto.initialize_l2g2l_pipe { + dir_mask = 1, + slot_size = 1024, + slot_num = 2, + local_slot_num = 1, + flag_base = 0 + }(%gm_slot_buffer : memref<256xf32, #pto.address_space>, + %c2v_consumer_buf : i32) -> !pto.pipe + return + } +} + +// A3-LABEL: AICORE void cube_kernel( +// A3: auto {{v[0-9]+}} = TPipe<0, Direction::DIR_C2V, 1024, 2, 1, false>(