Hi ProgramBench team, thanks for releasing the benchmark.
I wanted to ask for clarification about the intended cleanroom boundary for the provided reference executable during inference.
The current ProgramBench / mini-SWE-agent instructions are very clear that agents should infer behavior only by running the provided executable and reading bundled documentation. The default ProgramBench config also disallows internet access (--network none), runs as a non-root user, and drops SYS_PTRACE. The prompt explicitly forbids source lookup, wrapping/reusing the original binary, decompilation/disassembly, and strace/ltrace or similar instrumentation.
My question is whether the following should also be explicitly considered cleanroom violations, which I have observed during inference of my tested model:
- executing the reference with a polluted environment, e.g.
PATH=/tmp:$PATH ./executable ... to make it call agent-written fake dependencies
- using loader instrumentation such as
LD_PRELOAD, LD_LIBRARY_PATH, or dynamic linker tricks against ./executable
- inspecting runtime process state via
/proc/$pid/{maps,fd,environ,cmdline} or core/memory dumps
- changing cwd/tmp/config files specifically to observe implementation-level effects rather than normal public CLI behavior
These are different from normal allowed black-box probing, such as running ./executable --help, passing inputs, and observing stdout/stderr/exit codes/filesystem side effects.
The current recommended Docker setup appears to run the agent and the reference executable in the same container. This blocks important classes of abuse (--network none, non-root user, SYS_PTRACE dropped), but it does not fully isolate the true reference executable from agent-controlled environment variables, writable /tmp/workspace state, PATH/cwd pollution, or same-container process observation.
Would you consider either:
- documenting the above behaviors explicitly as disallowed instrumentation/wrapping of the oracle, and/or
- adding an optional hardened inference harness where the true reference executable runs behind a separate sanitized oracle process/container, with the agent only able to issue structured black-box execution requests?
This is not intended as a security vulnerability report. It is a benchmark semantics / reproducibility question: the goal is to ensure scores measure behavioral inference from the public interface, not implementation-level oracle instrumentation.
Hi ProgramBench team, thanks for releasing the benchmark.
I wanted to ask for clarification about the intended cleanroom boundary for the provided reference executable during inference.
The current ProgramBench / mini-SWE-agent instructions are very clear that agents should infer behavior only by running the provided executable and reading bundled documentation. The default ProgramBench config also disallows internet access (
--network none), runs as a non-root user, and dropsSYS_PTRACE. The prompt explicitly forbids source lookup, wrapping/reusing the original binary, decompilation/disassembly, andstrace/ltraceor similar instrumentation.My question is whether the following should also be explicitly considered cleanroom violations, which I have observed during inference of my tested model:
PATH=/tmp:$PATH ./executable ...to make it call agent-written fake dependenciesLD_PRELOAD,LD_LIBRARY_PATH, or dynamic linker tricks against./executable/proc/$pid/{maps,fd,environ,cmdline}or core/memory dumpsThese are different from normal allowed black-box probing, such as running
./executable --help, passing inputs, and observing stdout/stderr/exit codes/filesystem side effects.The current recommended Docker setup appears to run the agent and the reference executable in the same container. This blocks important classes of abuse (
--network none, non-root user,SYS_PTRACEdropped), but it does not fully isolate the true reference executable from agent-controlled environment variables, writable/tmp/workspace state, PATH/cwd pollution, or same-container process observation.Would you consider either:
This is not intended as a security vulnerability report. It is a benchmark semantics / reproducibility question: the goal is to ensure scores measure behavioral inference from the public interface, not implementation-level oracle instrumentation.