Pythonic interface and JIT compiler for PTO-ISA
PTO-DSL provides a programming abstraction similar to cuTile, but native to NPU.
Key features:
- Automatic software pipelining without manual synchronization
- Easily interface with torch-npu
- Lightweight, open-source compiler stack using PTO Assembler
See docker/README.md for full reproducible dependencies on NPU.
Then, install this lightweight DSL package itself:
# install latest commit
pip install git+https://github.com/huawei-csl/pto-dsl.git
# or stable tag
pip install git+https://github.com/huawei-csl/pto-dsl.git@0.1.0For in-place development:
git clone https://github.com/huawei-csl/pto-dsl.git
cd pto-dsl
pip install -e .PTO-DSL aims for low-level, explicit, NPU-native primitives that can match the performance of programming in hardware intrinsics. Compared to other (also very good) kernel programming frameworks, it has a bit different scope by design:
- vs tilelang-ascend: tilelang can also use PTO-ISA as codegen backend. PTO-DSL intentionally exposes lower-level control, for example L2 swizzling is one-liner
T.use_swizzlein tilelang, but is a user-defined custom function in PTO-DSL -- see this matmul optimization example. Once PTO-DSL is more stabilized, it might serve as a component like the CuteDSL backend for tilelang. - vs triton-ascend: Both frameworks automate software pipelining based on some MLIR dialects for NPU. PTO-DSL exposes more NPU-native memory hierarchy such as
L0/L1/UB. Also,pto.load/pto.storealways maps to native efficient DMA instructions, whiletl.load/tl.storetries to do GPU-style memory coalescing. - vs Catlass: Catlass provides expert-optimized template collections, while PTO-DSL is more like the CuteDSL layer of Cutlass, offering explicit low-level primitives.
- vs PyPTO: PyPTO is a full MPMD dynamic runtime stack, which also uses PTO-ISA as lowest-level primitive. PyPTO's Tensor API abstraction is closer to PyTorch/JAX level, while a PTO-DSL kernel is still SPMD and is closer to CuTile/CuteDSL level.