Move measurement out of kernel.py into the harness for 5 tasks by sdubagun-amd · Pull Request #36 · AMD-AGI/AgentKernelArena

sdubagun-amd · 2026-05-06T06:11:13Z

Closes the "kernel.py contains the timer" attack surface in geak_eval tasks where test_kernel_harness.py imported benchmark_config (or run_benchmark/run_correctness/run_profile) from kernel.py. An agent optimising kernel.py could legitimately rewrite the imported timer function and inflate the reported speedup with no actual GPU change (observed previously on refk_identity at 12.07x).

For each of the 5 in-scope tasks the harness now follows the fused_rms_fp8 structural convention:

refk_identity, refk_fp8_blockwise_mm, ff_backward, lean_atten_paged, gemm_a16wfp4

Per task:

kernel.py: deleted benchmark_config / check_correctness / run_* / evaluate / main blocks, the in-kernel WARMUP/ITERATIONS module constants, the torch_op wrapper, and any *_pytorch reference.
test_kernel_harness.py: rewritten as a self-contained script that
- imports only kernel callables, get_inputs, shape configs, and tolerance constants from kernel.py;
- carries the moved PyTorch reference and uses it ONLY in run_correctness via torch.testing.assert_close;
- times only the candidate kernel via torch.cuda.Event median;
- exposes the standard --correctness / --benchmark / --full-benchmark / --profile modes plus --warmup (default 50) and --iterations (default 200);
- emits GEAK_SHAPES_USED=[...] and GEAK_RESULT_LATENCY_MS=;
- does NOT compute a speedup in-process (no GEAK_RESULT_GEOMEAN_SPEEDUP). Speedup is now the orchestrator's job: run the harness twice (unmodified vs patched kernel) and divide.

Notes:

gemm_a16wfp4: kernel.py was already kernel-only. Harness rewritten to swap perf_counter timing for cuda.Event median and standardise the iteration defaults to 50/200 (was 5/10 and 5/20).
refk_fp8_blockwise_mm: the new benchmark loop allocates the output c once per cfg (the OLD benchmark_config did c.clone() per iteration, which incorrectly timed an allocation alongside the kernel). The baseline absolute ms will drop slightly on this task as a result.
lean_atten_paged: --correctness uses CORRECTNESS_CONFIGS (not HARNESS_SHAPES) to match the kernel.py-side convention.

Out of scope (intentionally): orchestrator file-allowlist on patches, which is the sufficient condition to fully defend against an agent that patches the harness file directly.

Closes the "kernel.py contains the timer" attack surface in geak_eval tasks where test_kernel_harness.py imported benchmark_config (or run_benchmark/run_correctness/run_profile) from kernel.py. An agent optimising kernel.py could legitimately rewrite the imported timer function and inflate the reported speedup with no actual GPU change (observed previously on refk_identity at 12.07x). For each of the 5 in-scope tasks the harness now follows the fused_rms_fp8 structural convention: - refk_identity, refk_fp8_blockwise_mm, ff_backward, lean_atten_paged, gemm_a16wfp4 Per task: - kernel.py: deleted benchmark_config / check_correctness / run_* / evaluate / __main__ blocks, the in-kernel WARMUP/ITERATIONS module constants, the torch_op wrapper, and any *_pytorch reference. - test_kernel_harness.py: rewritten as a self-contained script that * imports only kernel callables, get_inputs, shape configs, and tolerance constants from kernel.py; * carries the moved PyTorch reference and uses it ONLY in run_correctness via torch.testing.assert_close; * times only the candidate kernel via torch.cuda.Event median; * exposes the standard --correctness / --benchmark / --full-benchmark / --profile modes plus --warmup (default 50) and --iterations (default 200); * emits GEAK_SHAPES_USED=[...] and GEAK_RESULT_LATENCY_MS=<x>; * does NOT compute a speedup in-process (no GEAK_RESULT_GEOMEAN_SPEEDUP). Speedup is now the orchestrator's job: run the harness twice (unmodified vs patched kernel) and divide. Notes: - gemm_a16wfp4: kernel.py was already kernel-only. Harness rewritten to swap perf_counter timing for cuda.Event median and standardise the iteration defaults to 50/200 (was 5/10 and 5/20). - refk_fp8_blockwise_mm: the new benchmark loop allocates the output c once per cfg (the OLD benchmark_config did c.clone() per iteration, which incorrectly timed an allocation alongside the kernel). The baseline absolute ms will drop slightly on this task as a result. - lean_atten_paged: --correctness uses CORRECTNESS_CONFIGS (not HARNESS_SHAPES) to match the kernel.py-side convention. Out of scope (intentionally): orchestrator file-allowlist on patches, which is the sufficient condition to fully defend against an agent that patches the harness file directly. Co-authored-by: Cursor <cursoragent@cursor.com>

irvineoy · 2026-05-06T18:18:56Z

One issue I noticed with the sampled harnesses: GEAK_SHAPES_USED no longer reports the original ALL_SHAPES indices for tasks that downsample.

For example:

refk_fp8_blockwise_mm/test_kernel_harness.py builds HARNESS_SHAPES from _harness_indices, but later prints GEAK_SHAPES_USED={list(range(len(shapes)))}.
gemm_a16wfp4/test_kernel_harness.py does the same.

On this PR, the actual sampled source indices are different from the emitted values:

refk_fp8_blockwise_mm: sampled [0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, 26, 27, 28], but reports [0..24].
gemm_a16wfp4: sampled [0, 2, 4, 7, 9, 11, 14, 16, 18, 20, 22, 25, 27, 29, 32, 34, 36, 38, 40, 43, 45, 47, 50, 52, 54], but reports [0..24].

If the orchestrator trusts GEAK_SHAPES_USED for attribution or replaying base-vs-patched comparisons, this can point it at the wrong shapes. Could we preserve/pass the selected source indices and print those instead of the local positions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move measurement out of kernel.py into the harness for 5 tasks#36

Move measurement out of kernel.py into the harness for 5 tasks#36
sdubagun-amd wants to merge 1 commit intoAMD-AGI:geak-triton-common-benchmarkfrom
sdubagun-amd:standardize-exploit-relevant-harnesses

sdubagun-amd commented May 6, 2026

Uh oh!

irvineoy commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sdubagun-amd commented May 6, 2026

Uh oh!

irvineoy commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants