Move measurement out of kernel.py into the harness for 5 tasks#36
Open
sdubagun-amd wants to merge 1 commit intoAMD-AGI:geak-triton-common-benchmarkfrom
Open
Conversation
Closes the "kernel.py contains the timer" attack surface in
geak_eval tasks where test_kernel_harness.py imported benchmark_config
(or run_benchmark/run_correctness/run_profile) from kernel.py. An agent
optimising kernel.py could legitimately rewrite the imported timer
function and inflate the reported speedup with no actual GPU change
(observed previously on refk_identity at 12.07x).
For each of the 5 in-scope tasks the harness now follows the
fused_rms_fp8 structural convention:
- refk_identity, refk_fp8_blockwise_mm, ff_backward, lean_atten_paged,
gemm_a16wfp4
Per task:
- kernel.py: deleted benchmark_config / check_correctness / run_*
/ evaluate / __main__ blocks, the in-kernel WARMUP/ITERATIONS
module constants, the torch_op wrapper, and any *_pytorch reference.
- test_kernel_harness.py: rewritten as a self-contained script that
* imports only kernel callables, get_inputs, shape configs, and
tolerance constants from kernel.py;
* carries the moved PyTorch reference and uses it ONLY in
run_correctness via torch.testing.assert_close;
* times only the candidate kernel via torch.cuda.Event median;
* exposes the standard --correctness / --benchmark /
--full-benchmark / --profile modes plus --warmup (default 50)
and --iterations (default 200);
* emits GEAK_SHAPES_USED=[...] and GEAK_RESULT_LATENCY_MS=<x>;
* does NOT compute a speedup in-process (no
GEAK_RESULT_GEOMEAN_SPEEDUP). Speedup is now the orchestrator's
job: run the harness twice (unmodified vs patched kernel) and
divide.
Notes:
- gemm_a16wfp4: kernel.py was already kernel-only. Harness rewritten
to swap perf_counter timing for cuda.Event median and standardise
the iteration defaults to 50/200 (was 5/10 and 5/20).
- refk_fp8_blockwise_mm: the new benchmark loop allocates the output
c once per cfg (the OLD benchmark_config did c.clone() per iteration,
which incorrectly timed an allocation alongside the kernel). The
baseline absolute ms will drop slightly on this task as a result.
- lean_atten_paged: --correctness uses CORRECTNESS_CONFIGS (not
HARNESS_SHAPES) to match the kernel.py-side convention.
Out of scope (intentionally): orchestrator file-allowlist on patches,
which is the sufficient condition to fully defend against an agent
that patches the harness file directly.
Co-authored-by: Cursor <cursoragent@cursor.com>
Collaborator
|
One issue I noticed with the sampled harnesses: For example:
On this PR, the actual sampled source indices are different from the emitted values:
If the orchestrator trusts |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the "kernel.py contains the timer" attack surface in geak_eval tasks where test_kernel_harness.py imported benchmark_config (or run_benchmark/run_correctness/run_profile) from kernel.py. An agent optimising kernel.py could legitimately rewrite the imported timer function and inflate the reported speedup with no actual GPU change (observed previously on refk_identity at 12.07x).
For each of the 5 in-scope tasks the harness now follows the fused_rms_fp8 structural convention:
Per task:
Notes:
Out of scope (intentionally): orchestrator file-allowlist on patches, which is the sufficient condition to fully defend against an agent that patches the harness file directly.