git clone <repo-url>
cd simplerThe pto-isa dependency will be automatically cloned when you first run an example that needs it.
The pto-isa repository provides header files needed for kernel compilation on the a2a3 (hardware) platform.
The test framework automatically handles PTO_ISA_ROOT setup:
- Checks if
PTO_ISA_ROOTis already set - If not, clones pto-isa to
examples/scripts/_deps/pto-isaon first run - Passes the resolved path to the kernel compiler
Automatic Setup (Recommended): Just run your example - pto-isa will be cloned automatically on first run:
python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3simBy default, the auto-clone uses SSH (git@github.com:...). In CI or environments without SSH keys, use --clone-protocol https:
python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3sim --clone-protocol httpsManual Setup (if auto-setup fails or you prefer manual control):
mkdir -p examples/scripts/_deps
git clone --branch main git@github.com:PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa
# Or use HTTPS
git clone --branch main https://github.com/PTO-ISA/pto-isa.git examples/scripts/_deps/pto-isa
# Set environment variable (optional - auto-detected if in standard location)
export PTO_ISA_ROOT=$(pwd)/examples/scripts/_deps/pto-isaUsing a Different Location:
export PTO_ISA_ROOT=/path/to/your/pto-isaTroubleshooting:
- If git is not available: Clone pto-isa manually and set
PTO_ISA_ROOT - If clone fails due to network: Try again or clone manually
- If SSH clone fails (e.g., in CI): Use
--clone-protocol httpsor clone manually with HTTPS
Note: For the simulation platform (a2a3sim), PTO ISA headers are optional and only needed if your kernels use PTO ISA intrinsics.
- CMake 3.15+
- CANN toolkit with:
cceccompiler (AICore Bisheng CCE)- Cross-compiler for AICPU (aarch64-target-linux-gnu-gcc/g++)
- Standard C/C++ compiler (gcc/g++) for host
- Python 3 with development headers
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latestThe RuntimeCompiler class handles compilation of all three components separately:
from runtime_compiler import RuntimeCompiler
# For real Ascend hardware (requires CANN toolkit)
compiler = RuntimeCompiler(platform="a2a3")
# For simulation (no Ascend SDK needed)
compiler = RuntimeCompiler(platform="a2a3sim")
# Compile each component to independent binaries
aicore_binary = compiler.compile("aicore", include_dirs, source_dirs) # → .o file
aicpu_binary = compiler.compile("aicpu", include_dirs, source_dirs) # → .so file
host_binary = compiler.compile("host", include_dirs, source_dirs) # → .so fileToolchains used:
- AICore: Bisheng CCE (
cceccompiler) →.oobject file (a2a3 only) - AICPU: aarch64 cross-compiler →
.soshared object (a2a3 only) - Host: Standard gcc/g++ →
.soshared library - HostSim: Standard gcc/g++ for all targets (a2a3sim)
# Simulation platform (no hardware required)
python examples/scripts/run_example.py \
-k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3sim
# Hardware platform (requires Ascend device)
python examples/scripts/run_example.py \
-k examples/a2a3/host_build_graph/vector_example/kernels \
-g examples/a2a3/host_build_graph/vector_example/golden.py \
-p a2a3Expected output:
=== Building Runtime: host_build_graph (platform: a2a3sim) ===
...
=== Comparing Results ===
Comparing f: shape=(16384,), dtype=float32
f: PASS (16384/16384 elements matched)
============================================================
TEST PASSED
============================================================
from task_interface import ChipWorker, CallConfig
from runtime_builder import RuntimeBuilder
# Build or locate pre-built runtime binaries
builder = RuntimeBuilder(platform="a2a3sim")
binaries = builder.get_binaries("tensormap_and_ringbuffer")
# Create worker and initialize with platform binaries
worker = ChipWorker()
worker.init(device_id=0, host_path=str(binaries.host_path),
aicpu_binary=binaries.aicpu_path.read_bytes(),
aicore_binary=binaries.aicore_path.read_bytes())
# Execute callable on device
worker.run(chip_callable, orch_args, CallConfig(block_dim=24))
# Cleanup
worker.reset()In src/{arch}/runtime/host_build_graph/runtime/runtime.h:
#define RUNTIME_MAX_TASKS 131072 // Maximum number of tasks
#define RUNTIME_MAX_ARGS 16 // Maximum arguments per task
#define RUNTIME_MAX_FANOUT 512 // Maximum successors per taskrunner.init(
device_id=0, # Device ID (0-15)
num_cores=3, # Number of cores for handshake
aicpu_binary=..., # AICPU .so binary
aicore_binary=..., # AICore .o binary
pto_isa_root="/path/to/pto-isa" # PTO-ISA headers location
)- Device IDs: 0-15 (typically device 9 used for examples)
- Handshake cores: Usually 3 (1c2v configuration: 1 core, 2 vector units)
- Kernel compilation: Requires
ASCEND_HOME_PATHenvironment variable - Memory management: MemoryAllocator automatically tracks allocations
- Python requirement: NumPy for efficient array operations
Device logs written to ~/ascend/log/debug/device-<id>/
Kernel uses macros:
DEV_INFO: Informational messagesDEV_DEBUG: Debug messagesDEV_WARN: WarningsDEV_ERROR: Error messages