Fix Princeton cross-entropy replay exploit via phase-specific inputs by Ammaar-Alam · Pull Request #142 · gpu-mode/reference-kernels

Ammaar-Alam · 2026-04-08T20:30:31Z

Summary

Fixes a benchmark replay exploit in problems/princeton/cross_entropy_py/eval.py where a submission can exploit the evaluator's reuse of the exact same (logits, targets, grad_output) tensors across warmup and timed phases.

In the current evaluator, a benchmark-aware submission can:

compute correct outputs during warmup or earlier timed phases
arm a replay path keyed on input identity / benchmark phase structure
skip the real computation during the ranked combined phase while still returning already-populated correct outputs

This is especially severe for the Princeton task because the leaderboard ranks by median combined time only.

Changes

add a private per-run seed schedule for benchmark inputs
generate distinct inputs for each benchmark phase:
- warmup
- forward-only timing
- backward-only timing
- combined timing
reuse the same private seed schedule for both the baseline and the submission under test so the speedup comparison still sees matched workloads

Why this fix

The exploit depends on the ranked combined loop reusing the same exact tensors that were already seen during warmup or earlier timing loops.

This patch breaks that assumption while keeping input generation outside the measured CUDA event region:

the combined timed inputs are no longer identical to warmup inputs
the combined timed inputs are no longer identical to forward-only or backward-only timed inputs
each combined timed iteration gets its own distinct inputs

That removes the replay surface without turning the benchmark into a measurement of evaluator overhead.

Why this approach instead of per-call sync timing

PR #104 discusses a stronger per-call timing / sync approach for grouped GEMM exploits. That style of fix is useful when call deferral across a window is the main threat, but it also adds meaningful evaluator overhead.

For this Princeton task, the primary vulnerability is simpler: the evaluator reuses the exact same tensors across warmup and the ranked combined loop. Distinct phase-specific inputs are enough to close that loophole while preserving the current low-overhead timing structure.

Maintainer note

After merge, the Princeton leaderboard should be resynced / redeployed so the hosted evaluator picks up the patched eval.py.

Ammaar-Alam · 2026-04-08T20:33:32Z

(this pr was automated entirely w/ codex so double check carefully pls)

S1ro1

LGTM.

Fix Princeton cross-entropy replay exploit

5a11ebd

S1ro1 approved these changes Apr 8, 2026

View reviewed changes

S1ro1 merged commit 16022b0 into gpu-mode:main Apr 8, 2026

Ammaar-Alam deleted the fix/princeton-cross-entropy-replay-exploit branch April 8, 2026 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Princeton cross-entropy replay exploit via phase-specific inputs#142

Fix Princeton cross-entropy replay exploit via phase-specific inputs#142
S1ro1 merged 1 commit intogpu-mode:mainfrom
Ammaar-Alam:fix/princeton-cross-entropy-replay-exploit

Ammaar-Alam commented Apr 8, 2026

Uh oh!

Ammaar-Alam commented Apr 8, 2026

Uh oh!

S1ro1 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ammaar-Alam commented Apr 8, 2026

Summary

Changes

Why this fix

Why this approach instead of per-call sync timing

Maintainer note

Uh oh!

Ammaar-Alam commented Apr 8, 2026

Uh oh!

S1ro1 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants