Skip to content

Clarify and harden cleanroom rules around reference executable instrumentation #44

@BearCleverProud

Description

@BearCleverProud

Hi ProgramBench team, thanks for releasing the benchmark.

I wanted to ask for clarification about the intended cleanroom boundary for the provided reference executable during inference.

The current ProgramBench / mini-SWE-agent instructions are very clear that agents should infer behavior only by running the provided executable and reading bundled documentation. The default ProgramBench config also disallows internet access (--network none), runs as a non-root user, and drops SYS_PTRACE. The prompt explicitly forbids source lookup, wrapping/reusing the original binary, decompilation/disassembly, and strace/ltrace or similar instrumentation.

My question is whether the following should also be explicitly considered cleanroom violations, which I have observed during inference of my tested model:

  • executing the reference with a polluted environment, e.g. PATH=/tmp:$PATH ./executable ... to make it call agent-written fake dependencies
  • using loader instrumentation such as LD_PRELOAD, LD_LIBRARY_PATH, or dynamic linker tricks against ./executable
  • inspecting runtime process state via /proc/$pid/{maps,fd,environ,cmdline} or core/memory dumps
  • changing cwd/tmp/config files specifically to observe implementation-level effects rather than normal public CLI behavior

These are different from normal allowed black-box probing, such as running ./executable --help, passing inputs, and observing stdout/stderr/exit codes/filesystem side effects.

The current recommended Docker setup appears to run the agent and the reference executable in the same container. This blocks important classes of abuse (--network none, non-root user, SYS_PTRACE dropped), but it does not fully isolate the true reference executable from agent-controlled environment variables, writable /tmp/workspace state, PATH/cwd pollution, or same-container process observation.

Would you consider either:

  1. documenting the above behaviors explicitly as disallowed instrumentation/wrapping of the oracle, and/or
  2. adding an optional hardened inference harness where the true reference executable runs behind a separate sanitized oracle process/container, with the agent only able to issue structured black-box execution requests?

This is not intended as a security vulnerability report. It is a benchmark semantics / reproducibility question: the goal is to ensure scores measure behavioral inference from the public interface, not implementation-level oracle instrumentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions