Fix Princeton cross-entropy replay exploit via phase-specific inputs#142
Merged
S1ro1 merged 1 commit intogpu-mode:mainfrom Apr 8, 2026
Merged
Conversation
Contributor
Author
|
(this pr was automated entirely w/ codex so double check carefully pls) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a benchmark replay exploit in
problems/princeton/cross_entropy_py/eval.pywhere a submission can exploit the evaluator's reuse of the exact same(logits, targets, grad_output)tensors across warmup and timed phases.In the current evaluator, a benchmark-aware submission can:
This is especially severe for the Princeton task because the leaderboard ranks by median combined time only.
Changes
Why this fix
The exploit depends on the ranked combined loop reusing the same exact tensors that were already seen during warmup or earlier timing loops.
This patch breaks that assumption while keeping input generation outside the measured CUDA event region:
That removes the replay surface without turning the benchmark into a measurement of evaluator overhead.
Why this approach instead of per-call sync timing
PR #104 discusses a stronger per-call timing / sync approach for grouped GEMM exploits. That style of fix is useful when call deferral across a window is the main threat, but it also adds meaningful evaluator overhead.
For this Princeton task, the primary vulnerability is simpler: the evaluator reuses the exact same tensors across warmup and the ranked combined loop. Distinct phase-specific inputs are enough to close that loophole while preserving the current low-overhead timing structure.
Maintainer note
After merge, the Princeton leaderboard should be resynced / redeployed so the hosted evaluator picks up the patched
eval.py.