Clarify and harden cleanroom rules around reference executable instrumentation

Hi ProgramBench team, thanks for releasing the benchmark.

I wanted to ask for clarification about the intended cleanroom boundary for the provided reference executable during inference.

The current ProgramBench / mini-SWE-agent instructions are very clear that agents should infer behavior only by running the provided executable and reading bundled documentation. The default ProgramBench config also disallows internet access (`--network none`), runs as a non-root user, and drops `SYS_PTRACE`. The prompt explicitly forbids source lookup, wrapping/reusing the original binary, decompilation/disassembly, and `strace`/`ltrace` or similar instrumentation.

My question is whether the following should also be explicitly considered cleanroom violations, which I have observed during inference of my tested model:

- executing the reference with a polluted environment, e.g. `PATH=/tmp:$PATH ./executable ...` to make it call agent-written fake dependencies
- using loader instrumentation such as `LD_PRELOAD`, `LD_LIBRARY_PATH`, or dynamic linker tricks against `./executable`
- inspecting runtime process state via `/proc/$pid/{maps,fd,environ,cmdline}` or core/memory dumps
- changing cwd/tmp/config files specifically to observe implementation-level effects rather than normal public CLI behavior

These are different from normal allowed black-box probing, such as running `./executable --help`, passing inputs, and observing stdout/stderr/exit codes/filesystem side effects.

The current recommended Docker setup appears to run the agent and the reference executable in the same container. This blocks important classes of abuse (`--network none`, non-root user, `SYS_PTRACE` dropped), but it does not fully isolate the true reference executable from agent-controlled environment variables, writable `/tmp`/workspace state, PATH/cwd pollution, or same-container process observation.

Would you consider either:

1. documenting the above behaviors explicitly as disallowed instrumentation/wrapping of the oracle, and/or
2. adding an optional hardened inference harness where the true reference executable runs behind a separate sanitized oracle process/container, with the agent only able to issue structured black-box execution requests?

This is not intended as a security vulnerability report. It is a benchmark semantics / reproducibility question: the goal is to ensure scores measure behavioral inference from the public interface, not implementation-level oracle instrumentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify and harden cleanroom rules around reference executable instrumentation #44

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Clarify and harden cleanroom rules around reference executable instrumentation #44

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions