Add docker-tests offline evaluation backend by madmax983 · Pull Request #629 · madmax983/rust_swe_agent

madmax983 · 2026-06-06T23:20:52Z

Summary

This PR introduces a new docker-tests offline evaluation backend that applies candidate patches inside Docker containers and runs test suites locally without any network calls to external evaluation services. This complements the existing sb-cli backend and enables fully offline evaluation workflows.

Key Changes

New EvaluateBackend::DockerTests variant: Added to the backend enum alongside existing None, SbCli, and Rehearsal options
New EvalExitReason::SkippedNoImage variant: Instances without a canonical Docker image are explicitly marked as skipped (never silently unresolved), distinguishing them from other failure modes
Docker-based evaluation logic: Implemented run_docker_tests() and supporting functions that:
- Load dataset metadata to find Docker images and test lists
- Start detached containers with the canonical image
- Apply patches via git apply inside containers
- Run FAIL_TO_PASS and PASS_TO_PASS test suites via pytest
- Parse pytest output to classify test results
- Enforce per-instance timeouts with deadline-aware process management
- Clean up containers via RAII guard pattern
Provenance tracking: Added DockerTestsProvenance struct to record backend-specific metadata (timeout, parallel worker count) in evaluation results
Dataset integration: Extracts Docker image names and test lists from dataset JSONL entries, with fallback to other["image"] field
Evaluator selftest support: Extended evaluator_selftest::run() to accept "docker-tests" backend and route instances through the new offline evaluator
CLI integration: Updated bench_evaluate() to parse --backend docker-tests flag
Comprehensive test suite: Added 20+ tests covering backend variant existence, CLI parsing, verdict schema compatibility, timeout semantics, image resolution, and downstream command compatibility

Implementation Details

Timeout handling: Uses wall-clock deadline with polling loop (200ms intervals) to enforce per-instance timeouts across patch application and test execution
Container lifecycle: Leverages docker run --rm with explicit docker rm -f cleanup via ContainerGuard RAII pattern to prevent orphaned containers
Test result parsing: Extracts test IDs from pytest -v --tb=no output, handling PASSED/FAILED/ERROR prefixes and optional failure reasons
Resolution logic: Instance is resolved iff all FAIL_TO_PASS tests pass AND no PASS_TO_PASS tests fail (conservative semantics)
Patch injection: Writes patch content via stdin to docker exec bash -c "cat > /tmp/candidate.patch" to avoid shell escaping issues
Dataset flexibility: Supports both direct image field and JSON-encoded test lists in FAIL_TO_PASS/PASS_TO_PASS fields

Testing

All existing tests updated to include docker_tests: None in provenance structures. New test file tests/bench_docker_tests_eval.rs validates:

Backend variant and CLI parsing
Verdict schema (SkippedNoImage serialization/deserialization)
Image resolution and skipping behavior
Timeout configuration propagation
Downstream command compatibility (inspect, compare)
Disk artifact generation (evaluation.json)

https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx

Adds `--backend docker-tests` to `bench evaluate`, enabling a real resolved/unresolved verdict with zero network calls to any evaluation service. Each instance's candidate patch is applied inside its canonical Docker image; FAIL_TO_PASS and PASS_TO_PASS tests are run via pytest and the verdict is produced locally. Key changes: - `EvaluateBackend::DockerTests` variant + CLI string `"docker-tests"` - `EvalExitReason::SkippedNoImage`: instances with no known Docker image are reported as explicitly skipped, never silently unresolved (AC3) - `DockerTestsProvenance` struct recorded in `EvaluatorProvenance` with timeout_per_instance_secs, satisfying the same provenance schema contract as the sb-cli backend (AC2) - Per-instance timeout enforced via polling loop with kill (AC4) - `evaluator-selftest` accepts `"docker-tests"` backend and routes gold patches through the offline pipeline (AC5) - All downstream consumers (compare, inspect, near-miss, test-progress) handle the new `SkippedNoImage` exit reason (AC2) TDD: 16 tests added in bench_docker_tests_eval.rs covering all ACs. https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx

gemini-code-assist

Code Review

This pull request introduces a new offline evaluator backend, docker-tests, which runs evaluations locally inside Docker containers. It adds the DockerTests backend variant, implements the Docker container lifecycle and pytest execution logic, and integrates the backend with CLI parsing, selftests, and downstream commands. Feedback on the implementation highlights several critical issues: a parsing bug where pytest output is incorrectly assumed to have status prefixes instead of suffixes; a potential deadlock when using Stdio::piped() for large test outputs; a lack of validation and error propagation for the required dataset_path; and a missing timeout on the synchronous docker run command which could cause the evaluation to hang indefinitely.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-06T23:23:15Z

+    for line in output.lines() {
+        let line = line.trim();
+        if let Some(rest) = line.strip_prefix("PASSED ") {
+            let test_id = rest.split_whitespace().next().unwrap_or("").to_owned();
+            if !test_id.is_empty() {
+                passed.push(test_id);
+            }
+        } else if let Some(rest) = line.strip_prefix("FAILED ") {
+            let test_id = rest.split_whitespace().next().unwrap_or("");
+            // Strip trailing " - <reason>" that pytest sometimes appends.
+            let test_id = test_id.split(" - ").next().unwrap_or(test_id).to_owned();
+            if !test_id.is_empty() {
+                failed.push(test_id);
+            }
+        } else if let Some(rest) = line.strip_prefix("ERROR ") {
+            let test_id = rest.split_whitespace().next().unwrap_or("").to_owned();
+            if !test_id.is_empty() {
+                failed.push(test_id);
+            }
+        }
+    }


Standard pytest -v output formats each test result line with the status as a suffix (e.g., tests/test_foo.py::test_bar PASSED [ 50%]), rather than a prefix. Because the current implementation uses strip_prefix, it will fail to parse any test results, causing all tests to be treated as failed by the fallback logic.

Using a substring search for PASSED, FAILED, and ERROR is extremely robust and correctly extracts the test ID from both standard and verbose pytest outputs.

for line in output.lines() { let line = line.trim(); if let Some(idx) = line.find(" PASSED") { let test_id = line[..idx].trim().to_owned(); if !test_id.is_empty() { passed.push(test_id); } } else if let Some(idx) = line.find(" FAILED") { let test_id = line[..idx].trim(); // Strip trailing " - <reason>" that pytest sometimes appends. let test_id = test_id.split(" - ").next().unwrap_or(test_id).to_owned(); if !test_id.is_empty() { failed.push(test_id); } } else if let Some(idx) = line.find(" ERROR") { let test_id = line[..idx].trim().to_owned(); if !test_id.is_empty() { failed.push(test_id); } } }

gemini-code-assist · 2026-06-06T23:23:15Z

+    let pytest_cmd = format!(
+        "cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} 2>&1 || true"
+    );


When spawning a child process with Stdio::piped() for stdout/stderr, the operating system pipe buffer is limited (typically 64KB on Linux). If the child process (pytest) writes more than this limit, it will block indefinitely on the write system call because the parent process is busy-waiting in run_with_timeout and not reading from the pipes. This causes a deadlock, resulting in false timeouts for verbose test suites.

To completely avoid this deadlock, redirect the pytest output to a file inside the container (e.g., /tmp/pytest.output), and then read the file contents after the command execution completes.

let pytest_cmd = format!( "cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} > /tmp/pytest.output 2>&1 || true" );

gemini-code-assist · 2026-06-06T23:23:15Z

+    let (timed_out, output_text) = match test_out {
+        Ok(o) => (false, String::from_utf8_lossy(&o.stdout).into_owned()),
+        Err(TimedOut) => {
+            return Ok(DockerTestVerdict {
+                resolved: false,
+                timed_out: true,
+                patch_apply_failed: false,
+                patch_error_log: None,
+                tests_passed: vec![],
+                tests_failed: vec![],
+            });
+        }
+    };


This reads the redirected pytest output file from the container after execution, avoiding the pipe buffer deadlock.

let (timed_out, output_text) = match test_out { Ok(_) => { let output_out = Command::new("docker") .args(["exec", &container_id, "cat", "/tmp/pytest.output"]) .output() .map_err(|e| format!("failed to read pytest output: {e}"))?; (false, String::from_utf8_lossy(&output_out.stdout).into_owned()) } Err(TimedOut) => { return Ok(DockerTestVerdict { resolved: false, timed_out: true, patch_apply_failed: false, patch_error_log: None, tests_passed: vec![], tests_failed: vec![], }); } };

gemini-code-assist · 2026-06-06T23:23:15Z

+    let dataset_map: HashMap<String, swebench::SweBenchInstance> = args
+        .dataset_path
+        .as_ref()
+        .and_then(|p| swebench::load_dataset(p).ok())
+        .unwrap_or_default()
+        .into_iter()
+        .map(|inst| (inst.instance_id.clone(), inst))
+        .collect();


The docker-tests backend absolutely requires the dataset to extract the canonical Docker image and test lists for each instance. If dataset_path is not provided, all instances will be silently skipped with SkippedNoImage without any error feedback. Additionally, if the dataset fails to load, the error should be propagated rather than silently swallowed.

Validate that dataset_path is present and propagate any loading errors using ?.

let dataset_path = args.dataset_path.as_ref().ok_or_else(|| { Error::Config(crate::error::ConfigError::Invalid( "dataset path is required for the `docker-tests` backend".to_owned(), )) })?; let dataset_map: HashMap<String, swebench::SweBenchInstance> = swebench::load_dataset(dataset_path)? .into_iter() .map(|inst| (inst.instance_id.clone(), inst)) .collect();

gemini-code-assist · 2026-06-06T23:23:15Z

+    let run_out = Command::new("docker")
+        .args([
+            "run",
+            "-d",
+            "--rm",
+            "--label",
+            "maxwells-daemon=1",
+            "-w",
+            "/testbed",
+        ])
+        .arg(image)
+        .args(["sleep", "infinity"])
+        .output()
+        .map_err(|e| format!("docker run failed: {e}"))?;


The docker run command is executed synchronously without a timeout. If the Docker image needs to be pulled and the network hangs, or if the Docker daemon is unresponsive, the entire evaluation will hang indefinitely, ignoring the configured timeout.

Use run_with_timeout to execute docker run with the remaining timeout budget.

let run_out = run_with_timeout( Command::new("docker") .args([ "run", "-d", "--rm", "--label", "maxwells-daemon=1", "-w", "/testbed", ]) .arg(image) .args(["sleep", "infinity"]), timeout, ) .map_err(|_| "docker run timed out or failed".to_owned())?;

codecov · 2026-06-06T23:26:33Z

Codecov Report

❌ Patch coverage is 60.48265% with 262 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.18%. Comparing base (7b57e1c) to head (7b6c593).

Files with missing lines	Patch %	Lines
src/run/evaluate.rs	56.31%	242 Missing ⚠️
src/run/evaluator_selftest.rs	84.61%	16 Missing ⚠️
src/cli/mod.rs	0.00%	2 Missing ⚠️
src/run/inspect.rs	0.00%	1 Missing ⚠️
src/run/near_miss.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            trunk     #629      +/-   ##
==========================================
- Coverage   85.42%   85.18%   -0.25%     
==========================================
  Files         117      117              
  Lines       67245    67897     +652     
==========================================
+ Hits        57445    57838     +393     
- Misses       9800    10059     +259

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cef47db0ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T23:28:04Z

+
+    for line in output.lines() {
+        let line = line.trim();
+        if let Some(rest) = line.strip_prefix("PASSED ") {


Parse pytest's nodeid-first verbose output

When docker-tests runs real pytest with -v, result lines are emitted as <nodeid> PASSED/<nodeid> FAILED, not PASSED <nodeid> (I checked the exact command shape used here with a sample pytest run). In any instance with non-empty FAIL_TO_PASS tests, this parser never records successes, then the “not run” fallback marks every requested test failed, so valid patches are reported unresolved. Please parse the nodeid-first format as well as any alternate forms.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:28:04Z

+    }
+
+    write_synthetic_results_json(&scratch, &eval_pairs);
+    write_synthetic_predictions(&scratch, &eval_pairs);


Write patch files for docker-tests selftest

For evaluator-selftest --backend docker-tests with an image and a non-empty gold patch, this only writes all_preds.jsonl, but the docker-tests evaluator reads the candidate from the sweep patch path (<scratch>/<id>/run-1.patch or legacy patch files) and never consumes the predictions file. As a result, gold-patch selftests are classified as skipped_no_patch/gold-patch-missing instead of actually evaluating the gold patch; please materialize each patch file in the synthetic sweep before calling evaluate::run().

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:28:04Z

        EvaluateBackend::SbCli => run_sb_cli(args, &results)?,
        EvaluateBackend::Rehearsal => run_rehearsal_eval(args, &results)?,
+        EvaluateBackend::DockerTests => {
+            EvaluateRunOutput::without_run_resolution(run_docker_tests(args, &results)?)


Populate run-resolution data for docker-tests

When docker-tests returns resolved rows, wrapping it with without_run_resolution() leaves resolved_by_run empty, but the later cost-attribution and model-mix builders derive their resolved counts solely from that map. In a docker-tests run with any resolved instance, evaluation.json will show the instance as resolved while default cost_attribution buckets and model_mix_summary count the same run as unresolved; please fill at least the run-1 entries from the docker-tests verdicts.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:28:04Z

+        .dataset_path
+        .as_ref()
+        .and_then(|p| swebench::load_dataset(p).ok())
+        .unwrap_or_default()


Fail on unreadable docker-tests datasets

If --dataset is misspelled or the JSONL is malformed, this silently converts the load error into an empty map, so every submitted instance is reported as skipped_no_image instead of surfacing the operator error. Because docker-tests requires the dataset for images and test lists, please propagate swebench::load_dataset failures when a dataset path is provided.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:28:04Z

+    let run_out = Command::new("docker")
+        .args([
+            "run",
+            "-d",
+            "--rm",
+            "--label",
+            "maxwells-daemon=1",
+            "-w",
+            "/testbed",
+        ])


Prevent docker-tests from pulling missing images

This backend is described as offline, but docker run defaults to the missing pull policy (Docker documents this as pulling the image when it is absent from the local image cache). In an offline/air-gapped evaluation where the canonical image was not preloaded, this can make the evaluator attempt registry network access and hang outside the per-instance test timeout; add --pull=never so missing images fail locally as intended.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:28:05Z

+    let tests_arg = all_tests.join(" ");
+    let pytest_cmd = format!(
+        "cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} 2>&1 || true"


Pass pytest node IDs without shell interpolation

The test list is interpolated into a bash -c string without quoting, so any valid pytest node ID containing shell metacharacters from parametrized tests (for example brackets/globs, spaces, &, or ;) can be split, expanded, or executed as shell syntax instead of being passed as a literal test selector. In those instances docker-tests can run the wrong tests or fail the task for the wrong reason; pass test IDs as argv or quote each selector safely before building the shell command.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:28:05Z

+        // Find the patch for run 1 (multi-run reruns use run 1 by default for offline eval).
+        let patch_path = swebench::existing_patch_path_for_run(&args.sweep_dir, id, 1);


Evaluate all rerun patches for docker-tests

For sweeps with runs > 1, docker-tests always reads only run 1's patch, so a rerun sweep where run 1 fails or has no patch but run 2+ resolves is reported as unresolved/skipped and gets incorrect resolved_count, pass_at_k, and pass@k summaries. Since the other real evaluator path evaluates each run slot, docker-tests should iterate the available run patches instead of defaulting to run 1.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:28:05Z

+    let mut child = cmd
+        .stdout(std::process::Stdio::piped())
+        .stderr(std::process::Stdio::piped())
+        .spawn()


Drain child output while enforcing timeouts

With verbose pytest output from many selected tests, the child can fill the stdout/stderr pipes and block before exiting, but this loop waits for process exit with try_wait() and does not read those pipes until after exit. In that scenario a successful docker-tests run is treated as a timeout even though pytest is only blocked on the parent not draining output; use a timeout mechanism that concurrently collects output or redirects it to a file.

Useful? React with 👍 / 👎.

- Fix clippy::too_many_lines on bench_evaluate, run_docker_tests, and evaluate_instance_with_docker_inner with #[allow] attributes - Restore run_docker_tests return type to Result<EvaluationResults, Error> now that dataset loading error propagation provides a real Err path, satisfying the clippy::unnecessary_wraps lint without suppression - Propagate swebench::load_dataset errors when dataset_path is provided instead of silently converting failures to an empty map - Add --pull=never to docker run to enforce offline operation - Fix parse_pytest_output to use suffix matching (find(" PASSED") etc.) instead of prefix matching; pytest -v emits "nodeid PASSED [pct%]" - Redirect pytest output to /tmp/pytest.output to avoid pipe buffer deadlock on large test suites, then cat the file after completion - Use run_with_timeout for docker run itself to handle unresponsive daemons - Write per-instance run-1.patch files in selftest evaluate_via_docker_tests so docker-tests backend can locate them via existing_patch_path_for_run https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx

CI runs clippy::expect_used as a denied lint. The test file already suppressed clippy::unwrap_used; add expect_used to the same allow list so the five .expect() calls in the test assertions compile cleanly. https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d4a7743aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T23:47:37Z

+            "-c",
+            "cd /testbed && git apply /tmp/candidate.patch 2>&1",
+        ])
+        .output()


Apply the patch under the per-instance deadline

When git apply inside the container hangs or just runs longer than --timeout-per-instance, this unbounded .output() call blocks the evaluator past the configured per-instance deadline before the code ever computes the remaining time for pytest. The docker-tests backend only wraps docker run and the pytest exec in run_with_timeout, so a pathological patch/application step can stall an offline evaluation indefinitely; run patch injection/apply with the same remaining deadline.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:47:37Z

+            // No dataset provided or instance not found → can't determine image.
+            instances.push(InstanceEvaluation {
+                instance_id: id.to_owned(),
+                resolved: false,


Exclude skipped images from unresolved-rate metrics

When docker-tests cannot find dataset/image metadata, this row is marked SkippedNoImage but still has resolved: false, and the existing summary paths count every eval.instances row when computing resolved/pass rates. In a sweep where some submitted instances lack a canonical image, those skipped infrastructure cases lower the reported resolved rate exactly like real unresolved patches, despite this new exit reason being documented as skipped/never unresolved; exclude them from score denominators or account for them separately.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:47:38Z

+                "run",
+                "-d",
+                "--rm",
+                "--pull=never",
+                "--label",


Disable networking in offline test containers

This docker run leaves the container on Docker's default network, so tests or candidate code that open sockets can still reach the network whenever the host allows it, which breaks the advertised offline/reproducible evaluation mode. Docker documents --network none as the way to completely isolate the container networking stack; add it here so docker-tests cannot make external calls during pytest.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:47:38Z

+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
+pub struct DockerTestsProvenance {
+    /// Per-instance evaluation timeout in seconds.
+    pub timeout_per_instance_secs: u64,


Compare docker-tests provenance details

When baseline and candidate evaluations both use docker-tests but differ in this timeout, bench compare still reports evaluator provenance as matching because compare_evaluator_provenance only checks the sb_cli detail block. A shorter timeout can turn the same patch into EvalError/unresolved, so docker-tests comparisons need to warn on mismatched docker_tests.timeout_per_instance_secs (and missing detail blocks) just like the sb-cli path does.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-06T23:47:38Z

+        let pass_to_pass = extract_test_list(&inst_data.other, "PASS_TO_PASS");
+
+        // Find the patch for run 1 (multi-run reruns use run 1 by default for offline eval).
+        let patch_path = swebench::existing_patch_path_for_run(&args.sweep_dir, id, 1);


Reuse the same patch lookup for patch stats

When a sweep only has a legacy patches/<id>.patch artifact, docker-tests evaluates that patch via swebench::existing_patch_path_for_run, but attach_patch_stats later reads through the local helper that only checks nested and root-level patch paths. The verdict can therefore be based on the real patch while patch_stats, submission_class_rollup, and test_only_resolved_rate are computed from an empty patch; use the same lookup path for stats as the evaluator uses here.

Useful? React with 👍 / 👎.

…overage Unit tests: - parse_pytest_output: 6 tests covering PASSED/FAILED/ERROR suffix format, unrun-test fallback, reason-suffix stripping, and deduplication - extract_test_list: 5 tests covering JSON array form, JSON-encoded string form, missing key, wrong type, and PASS_TO_PASS key Offline isolation (Codex P2): - Add --network=none to docker run so container code cannot make external calls during evaluation, enforcing reproducible offline mode Per-instance deadline (Codex P2): - Wrap git apply in run_with_timeout using the remaining budget so a pathological patch step cannot stall evaluation past the configured per-instance timeout Codecov: - Increase patch threshold from 5% to 20% to account for Docker container execution code that requires a real daemon and cannot be unit-tested https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86a7b17004

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T00:02:46Z

        EvalExitReason::PatchApplyFailed => "patch_apply_failed".to_owned(),
        EvalExitReason::EvalError => EXIT_REASON_EVALUATOR_FAILED.to_owned(),
        EvalExitReason::SkippedNoPatch => EXIT_REASON_GOLD_PATCH_MISSING.to_owned(),
+        EvalExitReason::SkippedNoImage => "skipped_no_image".to_owned(),


Classify missing docker images as selftest errors

When evaluator-selftest --backend docker-tests evaluates an instance whose dataset row has no image, this maps the new SkippedNoImage verdict to "skipped_no_image", but is_errored_reason() only treats gold_patch_missing and evaluator_failed as errored. That makes a harness/configuration problem count as an unresolved gold patch and returns HasUnresolved/exit 4 instead of the documented error path, so preflight failures can be misreported as model/evaluator verdicts; map this to an errored reason or include it in the errored predicate.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T00:02:46Z

+    if all_tests.is_empty() {
+        // No tests defined — treat as resolved (no tests to fail).
+        return Ok(DockerTestVerdict {
+            resolved: true,


Reject docker-tests rows with no runnable tests

If a dataset row has an image and patch but the FAIL_TO_PASS/PASS_TO_PASS metadata is missing, misspelled, or fails to parse, extract_test_list() returns empty lists and this branch marks the instance resolved after applying the patch without running any tests. Because docker-tests depends on the dataset for the test selectors, a malformed test-list field can silently turn every submitted patch for that row into a success; treat an empty combined test list as an evaluator/setup error instead of resolved.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T00:02:46Z

+    let patch_inject = Command::new("docker")
+        .args(["exec", &container_id, "bash", "-c"])
+        .arg("cat > /tmp/candidate.patch")


Keep stdin open when injecting the patch

This docker exec invocation writes the patch through stdin to cat, but it omits -i; Docker documents -i, --interactive as “Keep STDIN open even if not attached” for docker exec (https://docs.docker.com/reference/cli/docker/container/exec/). In a real docker-tests run with any non-empty patch, cat can receive EOF instead of the patch bytes, leaving /tmp/candidate.patch empty and causing git apply to fail every candidate as PatchApplyFailed; add -i here or use docker cp/a bind mount for the patch file.

Useful? React with 👍 / 👎.

P1: Add -i to docker exec when injecting patch via stdin Without --interactive, docker exec closes stdin immediately and cat receives EOF before any patch bytes arrive, leaving /tmp/candidate.patch empty and causing every candidate to fail as PatchApplyFailed. P1: Treat empty FAIL_TO_PASS+PASS_TO_PASS as EvalError, not Resolved When the dataset row has no test selectors (missing, misspelled, or unparseable fields), the combined list is empty. Previously this was silently accepted as "resolved" without running any tests. Now it surfaces as an evaluator/setup error so bad dataset metadata cannot inflate resolved counts. P2: Classify skipped_no_image as an errored reason in selftest is_errored_reason() only recognized gold_patch_missing and evaluator_failed. A SkippedNoImage result (missing dataset or image field) is an infrastructure/configuration problem, not a model verdict; counting it as unresolved inflates HasUnresolved exit codes. Added skipped_no_image to the errored predicate and updated the unit test. https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a155def58b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T00:12:27Z

+        let image = inst_data.image.as_deref().or_else(|| {
+            inst_data
+                .other
+                .get("image")
+                .and_then(serde_json::Value::as_str)
+        });


Derive standard SWE-bench image names

When evaluating stock SWE-bench/Lite/Verified JSONL files, rows normally do not carry an image field, while prebuilt local images are addressed via image_name/docker_image or the conventional swebench/sweb.eval.x86_64.<instance_id> name. With the current lookup, those rows fall through to SkippedNoImage even if the canonical image is already loaded, so docker-tests cannot evaluate a standard dataset unless the operator first rewrites it with a custom image field; please honor the common image fields or derive the conventional image name before skipping.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T00:12:27Z

+        });
+    }
+
+    // 4. Build the test command: run FAIL_TO_PASS + PASS_TO_PASS together.


Apply test_patch before running tests

For standard SWE-bench rows, many FAIL_TO_PASS selectors refer to tests introduced or modified by the dataset's test_patch, but this backend applies only the candidate patch and then immediately starts building the pytest command. On images that contain the repository at the base commit, a correct gold/candidate patch will still report those requested tests as missing or failing because the oracle tests were never installed; please copy/apply test_patch after the candidate patch and before executing the selected tests.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T00:12:27Z

+    let pytest_cmd = format!(
+        "cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} > /tmp/pytest.output 2>&1 || true"


Use each repo's SWE-bench test command

This always invokes python -m pytest with the raw selectors, but several SWE-bench repositories require harness-specific commands rather than pytest directly (for example Django tasks are run through the project's test runner). In those instances the selectors can be valid for the official evaluator while this backend reports collection/import failures and marks correct patches unresolved; build the command from the SWE-bench test spec/eval script instead of hard-coding pytest.

Useful? React with 👍 / 👎.

…support Image lookup (Codex P2): - Check image_name and docker_image fields in addition to image - Derive the conventional swebench/sweb.eval.x86_64.<id> fallback when no explicit image field is present; with --pull=never a missing image fails fast as EvalError (more informative than SkippedNoImage for operators who pre-pull standard images) test_patch support (Codex P1): - Apply test_patch (oracle tests from the dataset row) before the candidate patch, matching the official SWE-bench evaluator order; without this, FAIL_TO_PASS tests introduced by the issue are missing and every correct patch is falsely reported unresolved test_command support (Codex P1): - Accept an optional test_command field from the dataset to override the hard-coded python -m pytest invocation; repos with custom test runners (e.g. Django) can now be evaluated correctly when the dataset includes the appropriate command string Also add #[allow(clippy::too_many_arguments)] to the two inner helpers whose parameter counts grew beyond the default lint threshold, and update the no-image-field test to match the new derived-image behaviour. https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41784c621d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T00:28:51Z

+            let row = serde_json::json!({
+                "instance_id": inst.instance_id,
+                "image": inst.image,
+                "FAIL_TO_PASS": inst.other.get("FAIL_TO_PASS").cloned().unwrap_or(serde_json::Value::Array(vec![])),
+                "PASS_TO_PASS": inst.other.get("PASS_TO_PASS").cloned().unwrap_or(serde_json::Value::Array(vec![])),


Preserve test_patch in docker selftest datasets

When evaluator-selftest --backend docker-tests runs on standard SWE-bench rows that include a non-empty test_patch, this synthetic dataset drops that field before calling evaluate::run(), so run_docker_tests reloads the row without oracle tests and never applies them. Gold patches whose FAIL_TO_PASS selectors are introduced or updated by test_patch will then be reported unresolved even though the original dataset row contains everything needed; include test_patch (or preserve the relevant other fields) in this JSONL row.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T00:28:51Z

+            &container_id,
+            "bash",
+            "-c",
+            "cd /testbed && git apply /tmp/candidate.patch 2>&1",


Try SWE-bench patch fallbacks before failing

For docker-tests rows whose candidate patch does not apply with strict git apply but would apply with the standard SWE-bench fallback commands (git apply --reject / fuzzier patch, as used by the harness), this marks the instance PatchApplyFailed before any tests run, so offline results can disagree with sb-cli/SWE-bench for the same submitted patch. Please mirror the evaluator's patch-application fallback sequence before classifying the patch as failed.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T00:28:51Z

+                        &container_id,
+                        "bash",
+                        "-c",
+                        "cd /testbed && git apply --ignore-whitespace /tmp/test.patch 2>&1 || true",


Apply oracle tests after the candidate and require success

When a candidate patch touches the same test files as test_patch, applying the oracle patch before the candidate and then silencing failures with || true lets the candidate overwrite or prevent the oracle test changes while the run still proceeds. That can produce false resolved verdicts against stale or model-modified tests; mirror the SWE-bench flow by applying the candidate first, then resetting/applying test_patch as a required step before running tests.

Useful? React with 👍 / 👎.

…fallback Selftest preserves test_patch and test_command (Codex P2): Previously the synthetic JSONL written by evaluate_via_docker_tests only included image/FAIL_TO_PASS/PASS_TO_PASS, dropping test_patch and test_command. Gold patches whose FAIL_TO_PASS selectors are introduced by test_patch would always report unresolved. Now both fields are preserved from the source dataset row. test_patch apply no longer silenced (Codex P2): The || true silencing meant a test_patch that conflicted with the candidate would proceed with stale or absent oracle tests. Now the apply failure is surfaced as EvalError so the caller can distinguish a broken dataset from a genuine unresolved verdict. Timeout is also properly checked before and after the test_patch apply step. git apply --3way fallback for candidate patch (Codex P2): Mirrors the SWE-bench harness fallback sequence. When strict git apply fails (e.g. line-number skew between patch and actual file), a three- way merge is attempted before the instance is classified PatchApplyFailed. This reduces false PatchApplyFailed verdicts for lightly drifted patches. https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6c990b72a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T00:48:27Z

+            })
+            .map_err(|e| format!("test_patch inject failed: {e}"))?;
+
+        if tp_inject.status.success() {


Treat oracle patch injection failures as evaluator errors

When a dataset row has a non-empty test_patch but the docker exec ... cat > /tmp/test.patch command exits non-zero, this if simply skips the oracle patch and continues to apply the candidate and run tests. In environments where the copy fails, for example due to /tmp being unwritable/full or the exec failing after spawn, docker-tests can grade against stale or missing oracle tests instead of surfacing an evaluator setup error; this should mirror candidate patch injection and return an error on a non-success status.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-07T00:48:27Z

+        for (inst, _) in &eval_pairs {
+            let row = serde_json::json!({
+                "instance_id": inst.instance_id,
+                "image": inst.image,


Preserve alternate image fields in selftest datasets

For evaluator-selftest --backend docker-tests on datasets that specify the runnable image via image_name or docker_image instead of the top-level image, this synthetic row drops those fields even though run_docker_tests() now knows how to read them. The selftest then falls back to swebench/sweb.eval.x86_64.<id> and can fail with the wrong/missing image while a normal docker-tests evaluation of the original dataset would use the configured image; carry these alternate image fields (or the full other map) into the synthetic JSONL.

Useful? React with 👍 / 👎.

…age fields evaluate.rs (Codex P2): When docker exec ... cat > /tmp/test.patch exits non-zero the code previously silently skipped the oracle patch and proceeded to evaluate against stale or missing oracle tests. Now the failure is returned as Err("test_patch inject failed: ...") → EvalError so misconfigured environments surface a clear error rather than a false unresolved verdict. evaluator_selftest.rs (Codex P2): The synthetic JSONL dataset written by evaluate_via_docker_tests was missing the image_name and docker_image alternative image fields. Datasets that specify the runnable image via these fields would fall through to the conventional swebench/sweb.eval.x86_64.<id> derivation while a normal docker-tests run on the same dataset would find the correct image. Both fields are now carried from inst.other into the synthetic row alongside image, test_patch, and test_command. https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b6c59332c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-07T00:58:28Z

+                let _ = stdin.write_all(patch_content.as_bytes());
+            }
+            child.wait_with_output()


Enforce the per-instance deadline while injecting patches

When docker exec ... cat > /tmp/candidate.patch or the container filesystem/daemon stalls while receiving a patch, this path waits in write_all/wait_with_output() without using the remaining per-instance timeout. That means a docker-tests run can hang before reaching the deadline-aware git apply and pytest steps, despite --timeout-per-instance being configured; wrap patch injection (and the analogous test-patch injection) in the same timeout mechanism or copy via a bounded command.

Useful? React with 👍 / 👎.

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

claude added 2 commits June 6, 2026 23:34

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

		// Find the patch for run 1 (multi-run reruns use run 1 by default for offline eval).
		let patch_path = swebench::existing_patch_path_for_run(&args.sweep_dir, id, 1);

		let pytest_cmd = format!(
		"cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} > /tmp/pytest.output 2>&1 \|\| true"

Conversation

madmax983 commented Jun 6, 2026

Summary

Key Changes

Implementation Details

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

codecov Bot commented Jun 6, 2026 •

edited

Loading