Skip to content

Revert "Feat: AICPU launch via dispatcher bootstrap and per-task rtsLaunchCpuKernel"#867

Merged
ChaoWao merged 1 commit into
mainfrom
revert-537-feat/issue-356-aicpu-launch-new-interface
May 27, 2026
Merged

Revert "Feat: AICPU launch via dispatcher bootstrap and per-task rtsLaunchCpuKernel"#867
ChaoWao merged 1 commit into
mainfrom
revert-537-feat/issue-356-aicpu-launch-new-interface

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 27, 2026

Reverts #537

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies the AICPU kernel loading and launch path by removing the transient bootstrap-only dispatcher (libsimpler_aicpu_dispatcher.so) and the host-side LoadAicpuOp loader. The simpler_init API has been updated across all platforms to remove dispatcher-related parameters, and AICPU kernels are now launched directly using rtAicpuKernelLaunchExWithArgs. Feedback on these changes highlights several opportunities to add defensive null-pointer checks for input arguments in launch_aicpu_kernel and simpler_init to prevent potential segmentation faults.

Comment on lines 1218 to +1228
int DeviceRunner::launch_aicpu_kernel(rtStream_t stream, KernelArgs *k_args, const char *kernel_name, int aicpu_num) {
// kernel_name is host::KernelNames::InitName / RunName — the runtime SO's
// actual exported symbol (simpler_aicpu_init / simpler_aicpu_exec). The
// Mode A type 2 launch in LaunchBuiltInOp embeds it in the args struct
// for the main aicpu_scheduler to dlsym.
return load_aicpu_op_.LaunchBuiltInOp(stream, k_args, aicpu_num, kernel_name);
struct Args {
KernelArgs k_args;
char kernel_name[32];
const char so_name[32] = {"libaicpu_extend_kernels.so"};
const char op_name[32] = {""};
} args;

args.k_args = *k_args;
std::strncpy(args.kernel_name, kernel_name, sizeof(args.kernel_name) - 1);
args.kernel_name[sizeof(args.kernel_name) - 1] = '\0';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Defensively validate k_args and kernel_name pointers before dereferencing them to prevent potential null pointer dereferences and segmentation faults.

int DeviceRunner::launch_aicpu_kernel(rtStream_t stream, KernelArgs *k_args, const char *kernel_name, int aicpu_num) {
    if (k_args == nullptr || kernel_name == nullptr) {
        LOG_ERROR("%s", "Invalid arguments: k_args or kernel_name is null");
        return -1;
    }

    struct Args {
        KernelArgs k_args;
        char kernel_name[32];
        const char so_name[32] = {"libaicpu_extend_kernels.so"};
        const char op_name[32] = {""};
    } args;

    args.k_args = *k_args;
    std::strncpy(args.kernel_name, kernel_name, sizeof(args.kernel_name) - 1);
    args.kernel_name[sizeof(args.kernel_name) - 1] = '\\0';

Comment on lines 1024 to +1034
int DeviceRunner::launch_aicpu_kernel(rtStream_t stream, KernelArgs *k_args, const char *kernel_name, int aicpu_num) {
// kernel_name is host::KernelNames::InitName / RunName — the runtime SO's
// actual exported symbol (simpler_aicpu_init / simpler_aicpu_exec). The
// Mode A type 2 launch in LaunchBuiltInOp embeds it in the args struct
// for the main aicpu_scheduler to dlsym.
return load_aicpu_op_.LaunchBuiltInOp(stream, k_args, aicpu_num, kernel_name);
struct Args {
KernelArgs k_args;
char kernel_name[32];
const char so_name[32] = {"libaicpu_extend_kernels.so"};
const char op_name[32] = {""};
} args;

args.k_args = *k_args;
std::strncpy(args.kernel_name, kernel_name, sizeof(args.kernel_name) - 1);
args.kernel_name[sizeof(args.kernel_name) - 1] = '\0';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Defensively validate k_args and kernel_name pointers before dereferencing them to prevent potential null pointer dereferences and segmentation faults.

int DeviceRunner::launch_aicpu_kernel(rtStream_t stream, KernelArgs *k_args, const char *kernel_name, int aicpu_num) {
    if (k_args == nullptr || kernel_name == nullptr) {
        LOG_ERROR("%s", "Invalid arguments: k_args or kernel_name is null");
        return -1;
    }

    struct Args {
        KernelArgs k_args;
        char kernel_name[32];
        const char so_name[32] = {"libaicpu_extend_kernels.so"};
        const char op_name[32] = {""};
    } args;

    args.k_args = *k_args;
    std::strncpy(args.kernel_name, kernel_name, sizeof(args.kernel_name) - 1);
    args.kernel_name[sizeof(args.kernel_name) - 1] = '\\0';

Comment on lines 227 to 231
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size
const uint8_t *aicore_binary, size_t aicore_size
) {
if (ctx == NULL) return -1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Defensively validate aicpu_binary and aicore_binary pointers before constructing vectors from them to prevent potential null pointer dereferences.

Suggested change
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size
const uint8_t *aicore_binary, size_t aicore_size
) {
if (ctx == NULL) return -1;
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size
) {
if (ctx == nullptr || aicpu_binary == nullptr || aicore_binary == nullptr) return -1;

Comment on lines 296 to 300
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size
const uint8_t *aicore_binary, size_t aicore_size
) {
if (ctx == NULL) return -1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Defensively validate aicpu_binary and aicore_binary pointers before constructing vectors from them to prevent potential null pointer dereferences.

Suggested change
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size
const uint8_t *aicore_binary, size_t aicore_size
) {
if (ctx == NULL) return -1;
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size
) {
if (ctx == nullptr || aicpu_binary == nullptr || aicore_binary == nullptr) return -1;

Comment on lines 218 to 222
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size
const uint8_t *aicore_binary, size_t aicore_size
) {
// Sim has no AICPU dispatcher (the simulator runs AICPU in-process). Accept
// the parameters for ABI parity with the onboard implementation and ignore
// them — callers that pass dispatcher bytes get the same shape as onboard,
// and Mode B path on sim isn't taken anyway.
(void)dispatcher_binary;
(void)dispatcher_size;

if (ctx == NULL) return -1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Defensively validate aicpu_binary and aicore_binary pointers before constructing vectors from them to prevent potential null pointer dereferences.

Suggested change
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size
const uint8_t *aicore_binary, size_t aicore_size
) {
// Sim has no AICPU dispatcher (the simulator runs AICPU in-process). Accept
// the parameters for ABI parity with the onboard implementation and ignore
// them — callers that pass dispatcher bytes get the same shape as onboard,
// and Mode B path on sim isn't taken anyway.
(void)dispatcher_binary;
(void)dispatcher_size;
if (ctx == NULL) return -1;
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size
) {
if (ctx == nullptr || aicpu_binary == nullptr || aicore_binary == nullptr) return -1;

Comment on lines 218 to 222
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size
const uint8_t *aicore_binary, size_t aicore_size
) {
// Sim has no AICPU dispatcher (the simulator runs AICPU in-process). See
// a2a3 sim sibling for rationale; parameters accepted for ABI parity.
(void)dispatcher_binary;
(void)dispatcher_size;

if (ctx == NULL) return -1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Defensively validate aicpu_binary and aicore_binary pointers before constructing vectors from them to prevent potential null pointer dereferences.

Suggested change
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size
const uint8_t *aicore_binary, size_t aicore_size
) {
// Sim has no AICPU dispatcher (the simulator runs AICPU in-process). See
// a2a3 sim sibling for rationale; parameters accepted for ABI parity.
(void)dispatcher_binary;
(void)dispatcher_size;
if (ctx == NULL) return -1;
int simpler_init(
DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size,
const uint8_t *aicore_binary, size_t aicore_size
) {
if (ctx == nullptr || aicpu_binary == nullptr || aicore_binary == nullptr) return -1;

@ChaoWao ChaoWao merged commit c7bc732 into main May 27, 2026
15 checks passed
@ChaoWao ChaoWao deleted the revert-537-feat/issue-356-aicpu-launch-new-interface branch May 27, 2026 05:56
ChaoWao added a commit that referenced this pull request May 27, 2026
…handle (#870)

Re-applies PR #537 (reverted in PR #867 because the prepared_callable
TestPreparedCallableHbgA5 suite OOM'd at first AICore launch on a5/onboard)
on top of a fix for the underlying leak that PR #537 exposed.

## The bug PR #537 surfaced (and this PR fixes)

`launch_aicore_kernel` was calling `rtRegisterAllKernel` on every run,
binding the returned `bin_handle` to a stack-local that vanished at
function exit. CANN has no public `rtUnregisterAllKernel`, so each
register pinned another device-side copy of the AICore ELF (~365 KB on
a5/hbg) and there was no path to ever release it. The leak was pre-PR-537
too but masked by lower steady-state HBM use. PR #537 made
`rtsBinaryLoadFromFile` keep the AICPU SO loaded for the DeviceRunner
lifetime — enough extra resident HBM that the very first AICore launch
on a5/hbg tipped into 207001 (ACL_ERROR_RT_MEMORY_ALLOCATION) and the
broken driver state cascaded into 507899 at the next `rtStreamCreate`.

a2a3 stayed lucky because its AICore ELF is ~5x smaller (78 KB vs 365 KB
on a5 — MIX-mode binary + heavier debug info on a5 — `.text` is 10.8 KB
vs 2.7 KB).

## Fix

Cache the AICore `rtRegisterAllKernel` handle in `aicore_bin_handle_` and
register lazily on first `launch_aicore_kernel`. Reset to nullptr in
`finalize()`; CANN releases the device-side state implicitly when the
device context tears down. Applied symmetrically to a2a3 and a5 — a2a3
had the same latent leak, fixing only a5 would leave it as a time-bomb
the next time HBM headroom shrinks elsewhere.

## What's the same as PR #537

Everything else: dispatcher SO build (libsimpler_aicpu_dispatcher.so
per-arch), LoadAicpuOp bootstrap + per-task rtsLaunchCpuKernel,
content-fingerprinted simpler_inner_<fp>.so preinstall write,
process-level fingerprint cache, RuntimeBinaries.dispatcher_path
threading.

## Verification

Built locally on a5, ran on device 2:
  - tests/st/a5/host_build_graph: 7 passed (incl. all 5
    TestPreparedCallableHbgA5 cases that originally failed)
  - tests/st/a5/tensormap_and_ringbuffer + examples: 22 passed (the 2
    sim-only failures are pre-existing g++-15 env issues unrelated to
    this change)

Fixes #356 (closes the gap that caused #867).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant