Skip to content

Move measurement out of kernel.py into the harness for 5 tasks#36

Open
sdubagun-amd wants to merge 1 commit intoAMD-AGI:geak-triton-common-benchmarkfrom
sdubagun-amd:standardize-exploit-relevant-harnesses
Open

Move measurement out of kernel.py into the harness for 5 tasks#36
sdubagun-amd wants to merge 1 commit intoAMD-AGI:geak-triton-common-benchmarkfrom
sdubagun-amd:standardize-exploit-relevant-harnesses

Conversation

@sdubagun-amd
Copy link
Copy Markdown
Collaborator

Closes the "kernel.py contains the timer" attack surface in geak_eval tasks where test_kernel_harness.py imported benchmark_config (or run_benchmark/run_correctness/run_profile) from kernel.py. An agent optimising kernel.py could legitimately rewrite the imported timer function and inflate the reported speedup with no actual GPU change (observed previously on refk_identity at 12.07x).

For each of the 5 in-scope tasks the harness now follows the fused_rms_fp8 structural convention:

  • refk_identity, refk_fp8_blockwise_mm, ff_backward, lean_atten_paged, gemm_a16wfp4

Per task:

  • kernel.py: deleted benchmark_config / check_correctness / run_* / evaluate / main blocks, the in-kernel WARMUP/ITERATIONS module constants, the torch_op wrapper, and any *_pytorch reference.
  • test_kernel_harness.py: rewritten as a self-contained script that
    • imports only kernel callables, get_inputs, shape configs, and tolerance constants from kernel.py;
    • carries the moved PyTorch reference and uses it ONLY in run_correctness via torch.testing.assert_close;
    • times only the candidate kernel via torch.cuda.Event median;
    • exposes the standard --correctness / --benchmark / --full-benchmark / --profile modes plus --warmup (default 50) and --iterations (default 200);
    • emits GEAK_SHAPES_USED=[...] and GEAK_RESULT_LATENCY_MS=;
    • does NOT compute a speedup in-process (no GEAK_RESULT_GEOMEAN_SPEEDUP). Speedup is now the orchestrator's job: run the harness twice (unmodified vs patched kernel) and divide.

Notes:

  • gemm_a16wfp4: kernel.py was already kernel-only. Harness rewritten to swap perf_counter timing for cuda.Event median and standardise the iteration defaults to 50/200 (was 5/10 and 5/20).
  • refk_fp8_blockwise_mm: the new benchmark loop allocates the output c once per cfg (the OLD benchmark_config did c.clone() per iteration, which incorrectly timed an allocation alongside the kernel). The baseline absolute ms will drop slightly on this task as a result.
  • lean_atten_paged: --correctness uses CORRECTNESS_CONFIGS (not HARNESS_SHAPES) to match the kernel.py-side convention.

Out of scope (intentionally): orchestrator file-allowlist on patches, which is the sufficient condition to fully defend against an agent that patches the harness file directly.

Closes the "kernel.py contains the timer" attack surface in
geak_eval tasks where test_kernel_harness.py imported benchmark_config
(or run_benchmark/run_correctness/run_profile) from kernel.py.  An agent
optimising kernel.py could legitimately rewrite the imported timer
function and inflate the reported speedup with no actual GPU change
(observed previously on refk_identity at 12.07x).

For each of the 5 in-scope tasks the harness now follows the
fused_rms_fp8 structural convention:

- refk_identity, refk_fp8_blockwise_mm, ff_backward, lean_atten_paged,
  gemm_a16wfp4

Per task:

- kernel.py: deleted benchmark_config / check_correctness / run_*
  / evaluate / __main__ blocks, the in-kernel WARMUP/ITERATIONS
  module constants, the torch_op wrapper, and any *_pytorch reference.
- test_kernel_harness.py: rewritten as a self-contained script that
    * imports only kernel callables, get_inputs, shape configs, and
      tolerance constants from kernel.py;
    * carries the moved PyTorch reference and uses it ONLY in
      run_correctness via torch.testing.assert_close;
    * times only the candidate kernel via torch.cuda.Event median;
    * exposes the standard --correctness / --benchmark /
      --full-benchmark / --profile modes plus --warmup (default 50)
      and --iterations (default 200);
    * emits GEAK_SHAPES_USED=[...] and GEAK_RESULT_LATENCY_MS=<x>;
    * does NOT compute a speedup in-process (no
      GEAK_RESULT_GEOMEAN_SPEEDUP).  Speedup is now the orchestrator's
      job: run the harness twice (unmodified vs patched kernel) and
      divide.

Notes:

- gemm_a16wfp4: kernel.py was already kernel-only.  Harness rewritten
  to swap perf_counter timing for cuda.Event median and standardise
  the iteration defaults to 50/200 (was 5/10 and 5/20).
- refk_fp8_blockwise_mm: the new benchmark loop allocates the output
  c once per cfg (the OLD benchmark_config did c.clone() per iteration,
  which incorrectly timed an allocation alongside the kernel).  The
  baseline absolute ms will drop slightly on this task as a result.
- lean_atten_paged: --correctness uses CORRECTNESS_CONFIGS (not
  HARNESS_SHAPES) to match the kernel.py-side convention.

Out of scope (intentionally): orchestrator file-allowlist on patches,
which is the sufficient condition to fully defend against an agent
that patches the harness file directly.

Co-authored-by: Cursor <cursoragent@cursor.com>
@irvineoy
Copy link
Copy Markdown
Collaborator

irvineoy commented May 6, 2026

One issue I noticed with the sampled harnesses: GEAK_SHAPES_USED no longer reports the original ALL_SHAPES indices for tasks that downsample.

For example:

  • refk_fp8_blockwise_mm/test_kernel_harness.py builds HARNESS_SHAPES from _harness_indices, but later prints GEAK_SHAPES_USED={list(range(len(shapes)))}.
  • gemm_a16wfp4/test_kernel_harness.py does the same.

On this PR, the actual sampled source indices are different from the emitted values:

  • refk_fp8_blockwise_mm: sampled [0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23, 24, 26, 27, 28], but reports [0..24].
  • gemm_a16wfp4: sampled [0, 2, 4, 7, 9, 11, 14, 16, 18, 20, 22, 25, 27, 29, 32, 34, 36, 38, 40, 43, 45, 47, 50, 52, 54], but reports [0..24].

If the orchestrator trusts GEAK_SHAPES_USED for attribution or replaying base-vs-patched comparisons, this can point it at the wrong shapes. Could we preserve/pass the selected source indices and print those instead of the local positions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants