Skip to content

Add incore-profiling skill: msprof op-simulator kernel profiler#1386

Draft
Hzfengsy wants to merge 4 commits into
hw-native-sys:mainfrom
Hzfengsy:incore-profiling-skill
Draft

Add incore-profiling skill: msprof op-simulator kernel profiler#1386
Hzfengsy wants to merge 4 commits into
hw-native-sys:mainfrom
Hzfengsy:incore-profiling-skill

Conversation

@Hzfengsy
Copy link
Copy Markdown
Member

Summary

  • Add an incore-profiling project skill: cycle-accurate, single-AI-core profiling of every kernel in a PyPTO build via the Ascend msprof op simulator.
  • New self-contained .claude/skills/incore-profiling/ folder — incore_profile.py (the profiler driver), a vendored generate_testcase.py harness generator with its .in templates, and SKILL.md.
  • Environment-agnostic: a single --target {a2a3,a5} knob plus CANN and camodel-SoC auto-discovery, so the tool runs on any Ascend setup without hardcoded paths. No PTOAS source checkout required (the generator is vendored).

Testing

  • Full 17/17 kernel export verified on a Qwen3 decode build (--target a2a3, CANN / SoC / arch all auto-resolved).
  • All pre-commit hooks pass — ruff, ruff-format, pyright, check-headers, markdownlint.

Related Issues

None.

Hzfengsy added 4 commits May 17, 2026 16:47
Relocate the msprof op-simulator kernel profiler into a self-contained
.claude/skills/incore-profiling/ folder: incore_profile.py plus a vendored
generate_testcase.py harness generator. Add a --target {a2a3,a5} knob with
CANN and camodel-SoC auto-discovery so the tool runs on any Ascend setup.
Copilot AI review requested due to automatic review settings May 17, 2026 08:48
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a complete in-core profiling skill for PyPTO kernels using CANN's msprof op simulator. It includes skill documentation, a Python orchestration script that discovers kernels and runs msprof collection, and supporting code-generation templates for golden reference generation, output comparison, and build automation.

Changes

In-Core Profiling Skill for PyPTO Kernels

Layer / File(s) Summary
Skill Documentation
.claude/skills/incore-profiling/SKILL.md
Explains when to use the profiling skill, lists prerequisites (CANN environment, case layout, ptoas-bin, testcase generator), provides quick-start CLI examples with --target/--build-dir/--case/--func options, describes output artifacts (trace.json, per-core CSV), troubleshoots common issues (target mismatch, missing camodel, environment paths), and documents the per-kernel workflow.
CLI Parsing and Toolchain Validation
.claude/skills/incore-profiling/incore_profile.py (lines 617–760)
Defines argument groups for build input, toolchain selection, output naming, and profiling parameters; parses extra args after -- for case script passthrough; validates msprof and generate_testcase availability after sourcing CANN environment; implements build directory resolution via case command or existing directory.
Environment and Camodel Discovery
.claude/skills/incore-profiling/incore_profile.py (lines 1–187)
Establishes target-family-to-SoC mapping, auto-discovers CANN set_env.sh and camodel SoC versions, provides SoC selection with explicit override, introduces StepError for error signaling, and supplies foundational utilities for logging, path handling, environment sourcing, command execution with logging, and safe shell quoting.
Build and Output Directory Orchestration
.claude/skills/incore-profiling/incore_profile.py (lines 189–266)
Wraps command invocation with combined stdout/stderr logging and optional exception raising; constructs run commands from --run-cmd or --case with --run-env prefixes; discovers existing build output directories; detects valid case-build layouts; selects newest matching build post-run.
PTOAS Source and Kernel Symbol Discovery
.claude/skills/incore-profiling/incore_profile.py (lines 268–366)
Discovers PTOAS .cpp sources from explicit or default locations; extracts kernel entrypoint names via regex; detects host triplets from ASCEND libraries; constructs LD_LIBRARY_PATH with devlib and simulator paths; resolves kernel symbols from compiled libraries via nm/c++filt with preferred-name matching and detailed failure reporting.
Artifact Collection and Golden Reference
.claude/skills/incore-profiling/incore_profile.py (lines 368–414)
Runs golden.py in case directory; locates msprof export dump directories; propagates pc_start_addr.txt when needed; collects artifact counts and first-found paths from simulator export trace/visualization directories.
Per-Function Export Pipeline
.claude/skills/incore-profiling/incore_profile.py (lines 416–615)
Iterates over kernel sources to generate testcases, build in isolation, resolve symbols, run golden baseline, execute msprof collection, conditionally run msprof export when collection skipped artifacts, collect artifact metadata, log step failures with tails and return codes, and convert uncaught exceptions into failed results.
Main Entry Point and Orchestration
.claude/skills/incore-profiling/incore_profile.py (lines 762–864)
Validates AICore arch defaults; decides toolchain validation scope (skipping for list-only ops); resolves build directory; discovers and filters sources; optionally lists kernel names; creates run root with pointer file; writes README with metadata; iterates export_one with per-func indexing; writes outputs after each iteration; enforces early-stop or keep-going on failure; converts StepError to exit status.
Code Generation Templates
.claude/skills/incore-profiling/vendor/templates/compare_template.py.in, golden_template.py.in, main_template.cpp.in, run_sh_template.sh.in
compare_template.py.in: BF16 conversion and array comparison with ULP-distance and NaN/zero handling; generic numeric comparison via np.allclose with configurable tolerance; packed predicate mask exact comparison; strict/non-strict failure gating. golden_template.py.in: Python template with deterministic RNG seed and placeholder for generated input code. main_template.cpp.in: ACL-based C++ runtime with device selection, stream lifecycle, error logging, and cleanup. run_sh_template.sh.in: Build orchestration, CANN environment sourcing, simulator/NPU library path handling, CMake/Make execution, golden mode control flow (sim/npu/skip), and strict comparison enforcement.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • lyfne123

Poem

🐇 Behold! A skill to profile kernels bright,
With msprof traces shining in the night,
From source to symbols, testcases to gold,
Each cycle counted—stories to be told!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding an incore-profiling skill for msprof op-simulator kernel profiling, which is the primary focus of this PR.
Description check ✅ Passed The description clearly relates to the changeset, explaining the new incore-profiling skill, its components (incore_profile.py, vendored generate_testcase.py, templates, SKILL.md), design approach (environment-agnostic with --target option), and testing verification.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Hzfengsy Hzfengsy marked this pull request as draft May 17, 2026 08:50
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new in-core kernel profiling skill for PyPTO using the Ascend msprof op-simulator, including an automation script and several test harness templates. Feedback focused on improving the robustness of environment variable parsing to avoid shell noise and addressing security vulnerabilities related to command injection and insecure library path configurations in both the Python script and shell templates.

Comment on lines +168 to +181
cmd = f"source {sh_quote(str(set_env_path))} >/dev/null && env -0"
cp = subprocess.run(
["bash", "-lc", cmd],
capture_output=True,
check=False,
)
if cp.returncode != 0:
raise StepError(cp.stderr.decode("utf-8", errors="replace"))
env = base_env.copy()
for item in cp.stdout.split(b"\0"):
if not item or b"=" not in item:
continue
key, val = item.split(b"=", 1)
env[key.decode()] = val.decode("utf-8", errors="replace")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using bash -lc to source the environment can capture unexpected output from shell profiles (e.g., .bashrc or /etc/profile) into stdout. This noise can corrupt the environment parsing logic, especially if the noise contains = or if it merges with the first environment variable. Using a unique marker to separate the shell noise from the env -0 output makes the parsing more robust. Additionally, ensure that failures in the external process are handled explicitly with informative error messages to avoid silent failures and aid debugging.

Suggested change
cmd = f"source {sh_quote(str(set_env_path))} >/dev/null && env -0"
cp = subprocess.run(
["bash", "-lc", cmd],
capture_output=True,
check=False,
)
if cp.returncode != 0:
raise StepError(cp.stderr.decode("utf-8", errors="replace"))
env = base_env.copy()
for item in cp.stdout.split(b"\0"):
if not item or b"=" not in item:
continue
key, val = item.split(b"=", 1)
env[key.decode()] = val.decode("utf-8", errors="replace")
marker = b"---ENV_START---"
cmd = f"source {sh_quote(str(set_env_path))} >/dev/null && echo -n {marker.decode()} && env -0"
cp = subprocess.run(
["bash", "-lc", cmd],
capture_output=True,
check=False,
)
if cp.returncode != 0:
raise StepError(cp.stderr.decode("utf-8", errors="replace"))
if marker not in cp.stdout:
raise StepError(f"CANN set_env.sh failed or produced no environment: {set_env_path}")
env = base_env.copy()
_, env_data = cp.stdout.split(marker, 1)
for item in env_data.split(b"\0"):
if not item or b"=" not in item:
continue
key, val = item.split(b"=", 1)
env[key.decode()] = val.decode("utf-8", errors="replace")
References
  1. When running external processes, handle cases where the process fails with a non-zero exit code but produces no output. Log an informative error message to prevent silent failures and aid debugging.

if "=" not in item:
raise StepError(f"--run-env expects KEY=VALUE, got: {item}")
prefix.append(item)
base_cmd = " ".join(prefix + ["python", sh_quote(str(case))] + [sh_quote(x) for x in case_args])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The items in prefix (derived from --run-env) are joined into base_cmd without quoting. If a user provides an environment variable with spaces (e.g., --run-env "VAR=val with spaces"), the resulting shell command will be malformed. Quoting each item ensures the environment variables are correctly interpreted by the shell and prevents potential command injection.

Suggested change
base_cmd = " ".join(prefix + ["python", sh_quote(str(case))] + [sh_quote(x) for x in case_args])
base_cmd = " ".join([sh_quote(p) for p in prefix] + ["python", sh_quote(str(case))] + [sh_quote(x) for x in case_args])


# Improve runtime linking robustness.
if [[ -n "${ASCEND_HOME_PATH:-}" ]]; then
export LD_LIBRARY_PATH="${ASCEND_HOME_PATH}/lib64:${LD_LIBRARY_PATH:-}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Appending :${LD_LIBRARY_PATH:-} when LD_LIBRARY_PATH is empty or unset results in a trailing colon. In many shells, a trailing colon in LD_LIBRARY_PATH is interpreted as including the current directory (.), which is a security risk and can lead to unexpected behavior. Use the ${VAR:+:${VAR}} pattern to safely append the separator only when the variable is non-empty.

  export LD_LIBRARY_PATH="${ASCEND_HOME_PATH}/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"

"${ASCEND_HOME_PATH}/aarch64-linux/simulator/${SIM_SOC_VERSION}/lib" \
"${ASCEND_HOME_PATH}/simulator/${SIM_SOC_VERSION}/lib" \
"${ASCEND_HOME_PATH}/tools/simulator/${SIM_SOC_VERSION}/lib"; do
[[ -d "$d" ]] && LD_LIBRARY_PATH_SIM="$d:${LD_LIBRARY_PATH_SIM}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

Similar to the previous issue, prepending $d: when LD_LIBRARY_PATH_SIM is empty results in a trailing colon, which includes the current directory in the library search path. Use the ${VAR:+:${VAR}} pattern for safe concatenation.

    [[ -d "$d" ]] && LD_LIBRARY_PATH_SIM="$d${LD_LIBRARY_PATH_SIM:+:${LD_LIBRARY_PATH_SIM}}"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new .claude project skill for profiling PyPTO PTOAS kernels with Ascend msprof op simulator, including the profiling driver, documentation, and a vendored testcase generator/templates so profiling can run without a PTOAS source checkout.

Changes:

  • Adds incore_profile.py to discover builds/kernels, generate standalone simulator cases, run msprof, and collect trace artifacts.
  • Adds SKILL.md usage documentation for the new in-core profiling workflow.
  • Vendors generate_testcase.py plus CMake/run/compare/golden/main templates used to build and validate simulator testcases.

Reviewed changes

Copilot reviewed 2 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
.claude/skills/incore-profiling/SKILL.md Documents when and how to use the new profiling skill.
.claude/skills/incore-profiling/incore_profile.py Implements build discovery, toolchain setup, testcase generation, simulator profiling, and artifact collection.
.claude/skills/incore-profiling/vendor/generate_testcase.py Vendored testcase generator adapted to local templates and profiling use cases.
.claude/skills/incore-profiling/vendor/templates/compare_template.py.in Generates output comparison helpers for simulator/NPU validation.
.claude/skills/incore-profiling/vendor/templates/golden_template.py.in Generates input/golden data creation script.
.claude/skills/incore-profiling/vendor/templates/main_template.cpp.in Generates ACL host runner source for standalone kernels.
.claude/skills/incore-profiling/vendor/templates/run_sh_template.sh.in Generates shell runner for building and validating standalone cases.

Comment on lines +98 to +102
candidates += [Path(ascend_home) / "set_env.sh", Path(ascend_home).parent / "set_env.sh"]
for root in ("/usr/local/Ascend", str(Path.home() / "Ascend"), "/opt/Ascend"):
candidates += [
Path(root) / "ascend-toolkit/set_env.sh",
Path(root) / "ascend-toolkit/latest/set_env.sh",
Comment on lines +330 to +339
for p in [ascend_home / "lib64", ascend_home / "devlib"]:
if p.exists():
parts.append(str(p))
for triplet in detect_host_triplets(ascend_home):
for p in [
ascend_home / triplet / "devlib",
ascend_home / triplet / "simulator" / soc_version / "lib",
]:
if p.exists():
parts.append(str(p))
if golden.shape != output.shape:
print(f"[ERROR] Shape mismatch: {golden_path} {golden.shape} vs {output_path} {output.shape}")
return False
if not np.allclose(golden, output, atol=eps, rtol=eps, equal_nan=True):
# Maps a user-facing NPU target to its compile arch and the SoC-name keywords
# used to pick a camodel simulator directory.
TARGET_PROFILES: dict[str, dict[str, object]] = {
"a2a3": {"aicore_arch": "dav-c220", "soc_keywords": ("910b",)},
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/incore-profiling/incore_profile.py:
- Around line 108-112: The code in discover_camodel_socs treats a missing
ASCEND_HOME_PATH as Path(""), which resolves to the current directory; change
the logic to first read the raw value = env.get("ASCEND_HOME_PATH") and if not
value (None or empty string) return [] immediately, otherwise construct
ascend_home = Path(value). Update the same pattern in the other functions that
currently use Path(env.get("ASCEND_HOME_PATH", "")) (the block around lines
326-339) so they all check for a truthy env value before creating a Path.
- Around line 683-687: The current --aicore-arch argument uses
os.environ.get("AICORE_ARCH") as its default which lets a stale env var override
the target-derived default; change the CLI handling so the add_argument for
"--aicore-arch" sets default=None (or no env fallback) and after parsing compute
args.aicore_arch from args.target using the existing target->arch mapping, only
using os.environ.get("AICORE_ARCH") as a last-resort fallback when target is not
provided; apply the same change where the other occurrence (lines around the
second add_argument) sets the default.

In @.claude/skills/incore-profiling/SKILL.md:
- Around line 88-90: The doc text claiming "`--target` must currently be passed
explicitly" is out of sync with the CLI which defaults --target to "a2a3";
update the SKILL.md text to state that --target is optional and defaults to
"a2a3" (and remains an override), or alternatively change the CLI default in the
argument parser in incore_profile.py (the add_argument for "--target" /
parse_args) to require explicit passing; pick one fix so the documentation and
the --target behavior (incore_profile.py's argument definition) match.

In @.claude/skills/incore-profiling/vendor/templates/compare_template.py.in:
- Around line 315-370: The code confuses whether cols is "predicate bit count"
or "buffer byte width" in compare_packed_pred_mask; fix by treating cols as the
number of predicate bits: compute row_bytes = ((cols + 63) // 64) * 8 (remove
the min(row_bytes, cols) cap), then set need = rows * row_bytes (instead of rows
* cols), read/reshape using that need and slice golden/output with [:row_bytes]
(golden_sel/output_sel use those bytes); update the docstring to state cols is
predicate bit count and keep variable names rows and cols, or rename cols to
num_pred_bits if you prefer.

In @.claude/skills/incore-profiling/vendor/templates/main_template.cpp.in:
- Around line 9-16: Remove the redundant copyright block: delete the second
documentation comment block (the /** ... */ block that contains the 2025 Huawei
copyright text) in main_template.cpp.in so only the original 2026 header
remains; locate the duplicate block by searching for the /** ... */ comment
containing the 2025 date and remove it, leaving the first header intact.

In @.claude/skills/incore-profiling/vendor/templates/run_sh_template.sh.in:
- Around line 122-133: The npu branch currently runs golden.py, the NPU
executable, copy_outputs_as_golden, then golden.py + executable again before
running compare.py, which makes the two runs use different regenerated inputs;
decide the intent and adjust: for determinism testing remove the second
invocation of golden.py so both runs use identical inputs (change the sequence
around the golden.py call referenced as golden.py and the second golden.py
call), or for golden-vs-test ensure copy_outputs_as_golden runs after the
initial golden.py/generation step and before regenerating inputs (i.e., move or
reorder the copy_outputs_as_golden call relative to the golden.py calls), and
add a short comment describing the chosen intent near GOLDEN_MODE=npu, RUN_MODE
check and around copy_outputs_as_golden, golden.py and compare.py invocations to
make the behavior explicit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 18ce266c-bbf3-453c-9af5-5f68f34dd206

📥 Commits

Reviewing files that changed from the base of the PR and between 95f793f and 0e6f34b.

📒 Files selected for processing (7)
  • .claude/skills/incore-profiling/SKILL.md
  • .claude/skills/incore-profiling/incore_profile.py
  • .claude/skills/incore-profiling/vendor/generate_testcase.py
  • .claude/skills/incore-profiling/vendor/templates/compare_template.py.in
  • .claude/skills/incore-profiling/vendor/templates/golden_template.py.in
  • .claude/skills/incore-profiling/vendor/templates/main_template.cpp.in
  • .claude/skills/incore-profiling/vendor/templates/run_sh_template.sh.in

Comment on lines +108 to +112
def discover_camodel_socs(env: dict[str, str]) -> list[str]:
"""List simulator SoC names that ship libruntime_camodel.so in the CANN install."""
ascend_home = Path(env.get("ASCEND_HOME_PATH", ""))
if not ascend_home.is_dir():
return []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat missing ASCEND_HOME_PATH as unset instead of current directory.

Using Path("") makes discovery/path assembly operate on . when ASCEND_HOME_PATH is absent. That can pick unintended local directories and produce invalid SoC/LD paths.

Suggested fix
 def discover_camodel_socs(env: dict[str, str]) -> list[str]:
     """List simulator SoC names that ship libruntime_camodel.so in the CANN install."""
-    ascend_home = Path(env.get("ASCEND_HOME_PATH", ""))
-    if not ascend_home.is_dir():
+    ascend_home_raw = env.get("ASCEND_HOME_PATH")
+    if not ascend_home_raw:
+        return []
+    ascend_home = Path(ascend_home_raw)
+    if not ascend_home.is_dir():
         return []
@@
 def make_ld_library_path(build_dir: Path, env: dict[str, str], soc_version: str) -> str:
-    ascend_home = Path(env.get("ASCEND_HOME_PATH", ""))
+    ascend_home_raw = env.get("ASCEND_HOME_PATH")
     parts = [str(build_dir)]
-    if ascend_home:
+    if ascend_home_raw:
+        ascend_home = Path(ascend_home_raw)
         for p in [ascend_home / "lib64", ascend_home / "devlib"]:
             if p.exists():
                 parts.append(str(p))

Also applies to: 326-339

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/incore-profiling/incore_profile.py around lines 108 - 112,
The code in discover_camodel_socs treats a missing ASCEND_HOME_PATH as Path(""),
which resolves to the current directory; change the logic to first read the raw
value = env.get("ASCEND_HOME_PATH") and if not value (None or empty string)
return [] immediately, otherwise construct ascend_home = Path(value). Update the
same pattern in the other functions that currently use
Path(env.get("ASCEND_HOME_PATH", "")) (the block around lines 326-339) so they
all check for a truthy env value before creating a Path.

Comment on lines +683 to +687
"--aicore-arch",
default=os.environ.get("AICORE_ARCH"),
help="AICore arch for generate_testcase.py; defaults from --target "
"(a2a3 -> dav-c220, a5 -> dav-c310)",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

AICORE_ARCH currently overrides target-derived defaults unexpectedly.

Help text says the arch defaults from --target, but current defaults pull from AICORE_ARCH first. If that env var is stale, kernels can be generated for the wrong arch.

Suggested fix
-    tool_group.add_argument(
-        "--aicore-arch",
-        default=os.environ.get("AICORE_ARCH"),
-        help="AICore arch for generate_testcase.py; defaults from --target "
-        "(a2a3 -> dav-c220, a5 -> dav-c310)",
-    )
+    tool_group.add_argument(
+        "--aicore-arch",
+        default=None,
+        help="AICore arch for generate_testcase.py; defaults from --target "
+        "(a2a3 -> dav-c220, a5 -> dav-c310)",
+    )
...
-    if not args.aicore_arch:
+    if args.aicore_arch is None:
         args.aicore_arch = TARGET_PROFILES[args.target]["aicore_arch"]

Also applies to: 770-771

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/incore-profiling/incore_profile.py around lines 683 - 687,
The current --aicore-arch argument uses os.environ.get("AICORE_ARCH") as its
default which lets a stale env var override the target-derived default; change
the CLI handling so the add_argument for "--aicore-arch" sets default=None (or
no env fallback) and after parsing compute args.aicore_arch from args.target
using the existing target->arch mapping, only using
os.environ.get("AICORE_ARCH") as a last-resort fallback when target is not
provided; apply the same change where the other occurrence (lines around the
second add_argument) sets the default.

Comment on lines +88 to +90
`--target` must currently be passed explicitly. Once the build pipeline records
the backend arch into the build folder, `incore_profile.py` can auto-detect it
and `--build-dir` alone will suffice. `--target` will remain as an override.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align --target requirement text with actual CLI behavior.

This section says --target must be passed explicitly, but the CLI currently defaults --target to a2a3. Please update either the docs or argument default so they match.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/incore-profiling/SKILL.md around lines 88 - 90, The doc text
claiming "`--target` must currently be passed explicitly" is out of sync with
the CLI which defaults --target to "a2a3"; update the SKILL.md text to state
that --target is optional and defaults to "a2a3" (and remains an override), or
alternatively change the CLI default in the argument parser in incore_profile.py
(the add_argument for "--target" / parse_args) to require explicit passing; pick
one fix so the documentation and the --target behavior (incore_profile.py's
argument definition) match.

Comment on lines +315 to +370
def compare_packed_pred_mask(golden_path, output_path, rows, cols):
"""
Compare outputs of pto.tcmp / pto.tcmps.

These ops produce a *packed predicate mask* and do not define every byte in
the logical u8 tile buffer. In practice, only the first N bytes of each row
are meaningful (packed as 64-bit chunks). Ignore the rest to avoid flaky
compares caused by undefined bytes.
"""
if not os.path.exists(output_path):
print(f"[ERROR] Output missing: {output_path}")
return False
if not os.path.exists(golden_path):
print(f"[ERROR] Golden missing: {golden_path}")
return False
try:
rows = int(rows)
cols = int(cols)
except Exception:
print(f"[ERROR] Invalid rows/cols for packed mask compare: rows={rows} cols={cols}")
return False
if rows <= 0 or cols <= 0:
print(f"[ERROR] Invalid rows/cols for packed mask compare: rows={rows} cols={cols}")
return False

golden = np.fromfile(golden_path, dtype=np.uint8)
output = np.fromfile(output_path, dtype=np.uint8)

need = rows * cols
if golden.size < need or output.size < need:
print(
f"[ERROR] Packed mask buffer too small: need={need} bytes, "
f"golden={golden.size}, out={output.size}"
)
return False

golden = golden[:need].reshape(rows, cols)
output = output[:need].reshape(rows, cols)

# Packed mask layout: 1 predicate bit per element, packed into 64-bit words
# per row (so 8 bytes per 64 columns). For cols <= 64 we still use one word.
row_bytes = ((cols + 63) // 64) * 8
row_bytes = min(row_bytes, cols)

golden_sel = golden[:, :row_bytes].reshape(-1)
output_sel = output[:, :row_bytes].reshape(-1)

if not np.array_equal(golden_sel, output_sel):
diff = np.nonzero(golden_sel != output_sel)[0]
idx = int(diff[0]) if diff.size else 0
print(
f"[ERROR] Mismatch (packed mask): {golden_path} vs {output_path}, first diff at idx={idx} "
f"(golden={int(golden_sel[idx])}, out={int(output_sel[idx])}, rows={rows}, cols={cols}, row_bytes={row_bytes})"
)
return False
return True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify the rows/cols parameters in compare_packed_pred_mask.

The function signature and internal logic create ambiguity about what cols represents:

  • Line 343 (need = rows * cols) treats cols as the buffer byte-width per row
  • Line 356 (row_bytes = ((cols + 63) // 64) * 8) treats cols as the number of predicate bits
  • The comment on line 355 mentions "8 bytes per 64 columns" but doesn't clarify the unit

For small cols values (e.g., cols=5 predicate bits), line 356 computes row_bytes = 8 (one 64-bit word), but line 357 caps it to min(8, 5) = 5, which would under-read the packed data.

Consider either:

  1. Renaming parameters to distinguish buffer dimensions from predicate count (e.g., buffer_cols_bytes and num_predicate_bits), or
  2. Adding a docstring clarifying that cols must equal both the buffer byte-width and the number of predicate bits, or
  3. Removing line 357 if cols strictly represents predicate bits and the buffer is always sized accordingly
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/incore-profiling/vendor/templates/compare_template.py.in
around lines 315 - 370, The code confuses whether cols is "predicate bit count"
or "buffer byte width" in compare_packed_pred_mask; fix by treating cols as the
number of predicate bits: compute row_bytes = ((cols + 63) // 64) * 8 (remove
the min(row_bytes, cols) cap), then set need = rows * row_bytes (instead of rows
* cols), read/reshape using that need and slice golden/output with [:row_bytes]
(golden_sel/output_sel use those bytes); update the docstring to state cols is
predicate bit count and keep variable names rows and cols, or rename cols to
num_pred_bits if you prefer.

Comment on lines +9 to +16
/**
Copyright (c) 2025 Huawei Technologies Co., Ltd.
This program is free software, you can redistribute it and/or modify it under the terms and conditions of
Please refer to the License for details. You may not use this file except in compliance with the License.
THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
See LICENSE in the root of the software repository for the full text of the License.
*/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove duplicate copyright header.

Lines 1-8 already contain a 2026 copyright header. The second header on lines 9-16 (with a 2025 date) is redundant and should be removed.

🧹 Proposed fix
 // See LICENSE in the root of the software repository for the full text of the License.
 
-/**
-Copyright (c) 2025 Huawei Technologies Co., Ltd.
-This program is free software, you can redistribute it and/or modify it under the terms and conditions of
-Please refer to the License for details. You may not use this file except in compliance with the License.
-THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
-INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
-See LICENSE in the root of the software repository for the full text of the License.
-*/
-
 `#include` "test_common.h"
 `#include` "acl/acl.h"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/incore-profiling/vendor/templates/main_template.cpp.in around
lines 9 - 16, Remove the redundant copyright block: delete the second
documentation comment block (the /** ... */ block that contains the 2025 Huawei
copyright text) in main_template.cpp.in so only the original 2026 header
remains; locate the duplicate block by searching for the /** ... */ comment
containing the 2025 date and remove it, leaving the first header intact.

Comment on lines +122 to +133
npu)
if [[ "${RUN_MODE}" != "npu" ]]; then
echo "[ERROR] GOLDEN_MODE=npu requires RUN_MODE=npu" >&2
exit 2
fi
python3 "${ROOT_DIR}/golden.py"
LD_LIBRARY_PATH="${LD_LIBRARY_PATH_NPU}" "${ROOT_DIR}/${BUILD_DIR}/@EXECUTABLE@"
copy_outputs_as_golden
python3 "${ROOT_DIR}/golden.py"
LD_LIBRARY_PATH="${LD_LIBRARY_PATH_NPU}" "${ROOT_DIR}/${BUILD_DIR}/@EXECUTABLE@"
COMPARE_STRICT=1 python3 "${ROOT_DIR}/compare.py"
;;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify the redundant execution sequence in GOLDEN_MODE=npu.

The npu mode runs the following sequence:

  1. golden.py (line 20, script startup)
  2. golden.py (line 127)
  3. NPU executable (line 128)
  4. copy_outputs_as_golden (line 129)
  5. golden.py (line 130)
  6. NPU executable (line 131)
  7. compare.py (line 132)

This copies the first run's outputs (step 3) as golden (step 4), then regenerates inputs (step 5) and runs again (step 6), effectively comparing two separate NPU runs with potentially different inputs. If the intent is determinism testing (same inputs, two runs), step 5 should be removed. If the intent is golden-vs-test comparison, step 4 should happen before step 2.

Consider adding a comment explaining the rationale, or restructuring to make the intent clearer.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/incore-profiling/vendor/templates/run_sh_template.sh.in
around lines 122 - 133, The npu branch currently runs golden.py, the NPU
executable, copy_outputs_as_golden, then golden.py + executable again before
running compare.py, which makes the two runs use different regenerated inputs;
decide the intent and adjust: for determinism testing remove the second
invocation of golden.py so both runs use identical inputs (change the sequence
around the golden.py call referenced as golden.py and the second golden.py
call), or for golden-vs-test ensure copy_outputs_as_golden runs after the
initial golden.py/generation step and before regenerating inputs (i.e., move or
reorder the copy_outputs_as_golden call relative to the golden.py calls), and
add a short comment describing the chosen intent near GOLDEN_MODE=npu, RUN_MODE
check and around copy_outputs_as_golden, golden.py and compare.py invocations to
make the behavior explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants