Skip to content

Add docker-tests offline evaluation backend#629

Open
madmax983 wants to merge 8 commits into
trunkfrom
claude/trusting-cori-VKgBH
Open

Add docker-tests offline evaluation backend#629
madmax983 wants to merge 8 commits into
trunkfrom
claude/trusting-cori-VKgBH

Conversation

@madmax983
Copy link
Copy Markdown
Owner

Summary

This PR introduces a new docker-tests offline evaluation backend that applies candidate patches inside Docker containers and runs test suites locally without any network calls to external evaluation services. This complements the existing sb-cli backend and enables fully offline evaluation workflows.

Key Changes

  • New EvaluateBackend::DockerTests variant: Added to the backend enum alongside existing None, SbCli, and Rehearsal options
  • New EvalExitReason::SkippedNoImage variant: Instances without a canonical Docker image are explicitly marked as skipped (never silently unresolved), distinguishing them from other failure modes
  • Docker-based evaluation logic: Implemented run_docker_tests() and supporting functions that:
    • Load dataset metadata to find Docker images and test lists
    • Start detached containers with the canonical image
    • Apply patches via git apply inside containers
    • Run FAIL_TO_PASS and PASS_TO_PASS test suites via pytest
    • Parse pytest output to classify test results
    • Enforce per-instance timeouts with deadline-aware process management
    • Clean up containers via RAII guard pattern
  • Provenance tracking: Added DockerTestsProvenance struct to record backend-specific metadata (timeout, parallel worker count) in evaluation results
  • Dataset integration: Extracts Docker image names and test lists from dataset JSONL entries, with fallback to other["image"] field
  • Evaluator selftest support: Extended evaluator_selftest::run() to accept "docker-tests" backend and route instances through the new offline evaluator
  • CLI integration: Updated bench_evaluate() to parse --backend docker-tests flag
  • Comprehensive test suite: Added 20+ tests covering backend variant existence, CLI parsing, verdict schema compatibility, timeout semantics, image resolution, and downstream command compatibility

Implementation Details

  • Timeout handling: Uses wall-clock deadline with polling loop (200ms intervals) to enforce per-instance timeouts across patch application and test execution
  • Container lifecycle: Leverages docker run --rm with explicit docker rm -f cleanup via ContainerGuard RAII pattern to prevent orphaned containers
  • Test result parsing: Extracts test IDs from pytest -v --tb=no output, handling PASSED/FAILED/ERROR prefixes and optional failure reasons
  • Resolution logic: Instance is resolved iff all FAIL_TO_PASS tests pass AND no PASS_TO_PASS tests fail (conservative semantics)
  • Patch injection: Writes patch content via stdin to docker exec bash -c "cat > /tmp/candidate.patch" to avoid shell escaping issues
  • Dataset flexibility: Supports both direct image field and JSON-encoded test lists in FAIL_TO_PASS/PASS_TO_PASS fields

Testing

All existing tests updated to include docker_tests: None in provenance structures. New test file tests/bench_docker_tests_eval.rs validates:

  • Backend variant and CLI parsing
  • Verdict schema (SkippedNoImage serialization/deserialization)
  • Image resolution and skipping behavior
  • Timeout configuration propagation
  • Downstream command compatibility (inspect, compare)
  • Disk artifact generation (evaluation.json)

https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx

Adds `--backend docker-tests` to `bench evaluate`, enabling a real
resolved/unresolved verdict with zero network calls to any evaluation
service. Each instance's candidate patch is applied inside its canonical
Docker image; FAIL_TO_PASS and PASS_TO_PASS tests are run via pytest
and the verdict is produced locally.

Key changes:
- `EvaluateBackend::DockerTests` variant + CLI string `"docker-tests"`
- `EvalExitReason::SkippedNoImage`: instances with no known Docker image
  are reported as explicitly skipped, never silently unresolved (AC3)
- `DockerTestsProvenance` struct recorded in `EvaluatorProvenance` with
  timeout_per_instance_secs, satisfying the same provenance schema contract
  as the sb-cli backend (AC2)
- Per-instance timeout enforced via polling loop with kill (AC4)
- `evaluator-selftest` accepts `"docker-tests"` backend and routes gold
  patches through the offline pipeline (AC5)
- All downstream consumers (compare, inspect, near-miss, test-progress)
  handle the new `SkippedNoImage` exit reason (AC2)

TDD: 16 tests added in bench_docker_tests_eval.rs covering all ACs.

https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new offline evaluator backend, docker-tests, which runs evaluations locally inside Docker containers. It adds the DockerTests backend variant, implements the Docker container lifecycle and pytest execution logic, and integrates the backend with CLI parsing, selftests, and downstream commands. Feedback on the implementation highlights several critical issues: a parsing bug where pytest output is incorrectly assumed to have status prefixes instead of suffixes; a potential deadlock when using Stdio::piped() for large test outputs; a lack of validation and error propagation for the required dataset_path; and a missing timeout on the synchronous docker run command which could cause the evaluation to hang indefinitely.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/run/evaluate.rs
Comment on lines +1690 to +1710
for line in output.lines() {
let line = line.trim();
if let Some(rest) = line.strip_prefix("PASSED ") {
let test_id = rest.split_whitespace().next().unwrap_or("").to_owned();
if !test_id.is_empty() {
passed.push(test_id);
}
} else if let Some(rest) = line.strip_prefix("FAILED ") {
let test_id = rest.split_whitespace().next().unwrap_or("");
// Strip trailing " - <reason>" that pytest sometimes appends.
let test_id = test_id.split(" - ").next().unwrap_or(test_id).to_owned();
if !test_id.is_empty() {
failed.push(test_id);
}
} else if let Some(rest) = line.strip_prefix("ERROR ") {
let test_id = rest.split_whitespace().next().unwrap_or("").to_owned();
if !test_id.is_empty() {
failed.push(test_id);
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Standard pytest -v output formats each test result line with the status as a suffix (e.g., tests/test_foo.py::test_bar PASSED [ 50%]), rather than a prefix. Because the current implementation uses strip_prefix, it will fail to parse any test results, causing all tests to be treated as failed by the fallback logic.

Using a substring search for PASSED, FAILED, and ERROR is extremely robust and correctly extracts the test ID from both standard and verbose pytest outputs.

    for line in output.lines() {
        let line = line.trim();
        if let Some(idx) = line.find(" PASSED") {
            let test_id = line[..idx].trim().to_owned();
            if !test_id.is_empty() {
                passed.push(test_id);
            }
        } else if let Some(idx) = line.find(" FAILED") {
            let test_id = line[..idx].trim();
            // Strip trailing " - <reason>" that pytest sometimes appends.
            let test_id = test_id.split(" - ").next().unwrap_or(test_id).to_owned();
            if !test_id.is_empty() {
                failed.push(test_id);
            }
        } else if let Some(idx) = line.find(" ERROR") {
            let test_id = line[..idx].trim().to_owned();
            if !test_id.is_empty() {
                failed.push(test_id);
            }
        }
    }

Comment thread src/run/evaluate.rs Outdated
Comment on lines +1578 to +1580
let pytest_cmd = format!(
"cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} 2>&1 || true"
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When spawning a child process with Stdio::piped() for stdout/stderr, the operating system pipe buffer is limited (typically 64KB on Linux). If the child process (pytest) writes more than this limit, it will block indefinitely on the write system call because the parent process is busy-waiting in run_with_timeout and not reading from the pipes. This causes a deadlock, resulting in false timeouts for verbose test suites.

To completely avoid this deadlock, redirect the pytest output to a file inside the container (e.g., /tmp/pytest.output), and then read the file contents after the command execution completes.

    let pytest_cmd = format!(
        "cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} > /tmp/pytest.output 2>&1 || true"
    );

Comment thread src/run/evaluate.rs
Comment on lines +1601 to +1613
let (timed_out, output_text) = match test_out {
Ok(o) => (false, String::from_utf8_lossy(&o.stdout).into_owned()),
Err(TimedOut) => {
return Ok(DockerTestVerdict {
resolved: false,
timed_out: true,
patch_apply_failed: false,
patch_error_log: None,
tests_passed: vec![],
tests_failed: vec![],
});
}
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This reads the redirected pytest output file from the container after execution, avoiding the pipe buffer deadlock.

    let (timed_out, output_text) = match test_out {
        Ok(_) => {
            let output_out = Command::new("docker")
                .args(["exec", &container_id, "cat", "/tmp/pytest.output"])
                .output()
                .map_err(|e| format!("failed to read pytest output: {e}"))?;
            (false, String::from_utf8_lossy(&output_out.stdout).into_owned())
        }
        Err(TimedOut) => {
            return Ok(DockerTestVerdict {
                resolved: false,
                timed_out: true,
                patch_apply_failed: false,
                patch_error_log: None,
                tests_passed: vec![],
                tests_failed: vec![],
            });
        }
    };

Comment thread src/run/evaluate.rs Outdated
Comment on lines +1244 to +1251
let dataset_map: HashMap<String, swebench::SweBenchInstance> = args
.dataset_path
.as_ref()
.and_then(|p| swebench::load_dataset(p).ok())
.unwrap_or_default()
.into_iter()
.map(|inst| (inst.instance_id.clone(), inst))
.collect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docker-tests backend absolutely requires the dataset to extract the canonical Docker image and test lists for each instance. If dataset_path is not provided, all instances will be silently skipped with SkippedNoImage without any error feedback. Additionally, if the dataset fails to load, the error should be propagated rather than silently swallowed.

Validate that dataset_path is present and propagate any loading errors using ?.

    let dataset_path = args.dataset_path.as_ref().ok_or_else(|| {
        Error::Config(crate::error::ConfigError::Invalid(
            "dataset path is required for the `docker-tests` backend".to_owned(),
        ))
    })?;

    let dataset_map: HashMap<String, swebench::SweBenchInstance> = swebench::load_dataset(dataset_path)?
        .into_iter() 
        .map(|inst| (inst.instance_id.clone(), inst))
        .collect();

Comment thread src/run/evaluate.rs Outdated
Comment on lines +1480 to +1493
let run_out = Command::new("docker")
.args([
"run",
"-d",
"--rm",
"--label",
"maxwells-daemon=1",
"-w",
"/testbed",
])
.arg(image)
.args(["sleep", "infinity"])
.output()
.map_err(|e| format!("docker run failed: {e}"))?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docker run command is executed synchronously without a timeout. If the Docker image needs to be pulled and the network hangs, or if the Docker daemon is unresponsive, the entire evaluation will hang indefinitely, ignoring the configured timeout.

Use run_with_timeout to execute docker run with the remaining timeout budget.

    let run_out = run_with_timeout(
        Command::new("docker")
            .args([
                "run",
                "-d",
                "--rm",
                "--label",
                "maxwells-daemon=1",
                "-w",
                "/testbed",
            ])
            .arg(image)
            .args(["sleep", "infinity"]),
        timeout,
    )
    .map_err(|_| "docker run timed out or failed".to_owned())?;

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 6, 2026

Codecov Report

❌ Patch coverage is 60.48265% with 262 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.18%. Comparing base (7b57e1c) to head (7b6c593).

Files with missing lines Patch % Lines
src/run/evaluate.rs 56.31% 242 Missing ⚠️
src/run/evaluator_selftest.rs 84.61% 16 Missing ⚠️
src/cli/mod.rs 0.00% 2 Missing ⚠️
src/run/inspect.rs 0.00% 1 Missing ⚠️
src/run/near_miss.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            trunk     #629      +/-   ##
==========================================
- Coverage   85.42%   85.18%   -0.25%     
==========================================
  Files         117      117              
  Lines       67245    67897     +652     
==========================================
+ Hits        57445    57838     +393     
- Misses       9800    10059     +259     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cef47db0ee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/evaluate.rs Outdated

for line in output.lines() {
let line = line.trim();
if let Some(rest) = line.strip_prefix("PASSED ") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Parse pytest's nodeid-first verbose output

When docker-tests runs real pytest with -v, result lines are emitted as <nodeid> PASSED/<nodeid> FAILED, not PASSED <nodeid> (I checked the exact command shape used here with a sample pytest run). In any instance with non-empty FAIL_TO_PASS tests, this parser never records successes, then the “not run” fallback marks every requested test failed, so valid patches are reported unresolved. Please parse the nodeid-first format as well as any alternate forms.

Useful? React with 👍 / 👎.

}

write_synthetic_results_json(&scratch, &eval_pairs);
write_synthetic_predictions(&scratch, &eval_pairs);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Write patch files for docker-tests selftest

For evaluator-selftest --backend docker-tests with an image and a non-empty gold patch, this only writes all_preds.jsonl, but the docker-tests evaluator reads the candidate from the sweep patch path (<scratch>/<id>/run-1.patch or legacy patch files) and never consumes the predictions file. As a result, gold-patch selftests are classified as skipped_no_patch/gold-patch-missing instead of actually evaluating the gold patch; please materialize each patch file in the synthetic sweep before calling evaluate::run().

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs
EvaluateBackend::SbCli => run_sb_cli(args, &results)?,
EvaluateBackend::Rehearsal => run_rehearsal_eval(args, &results)?,
EvaluateBackend::DockerTests => {
EvaluateRunOutput::without_run_resolution(run_docker_tests(args, &results)?)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Populate run-resolution data for docker-tests

When docker-tests returns resolved rows, wrapping it with without_run_resolution() leaves resolved_by_run empty, but the later cost-attribution and model-mix builders derive their resolved counts solely from that map. In a docker-tests run with any resolved instance, evaluation.json will show the instance as resolved while default cost_attribution buckets and model_mix_summary count the same run as unresolved; please fill at least the run-1 entries from the docker-tests verdicts.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs Outdated
Comment on lines +1245 to +1248
.dataset_path
.as_ref()
.and_then(|p| swebench::load_dataset(p).ok())
.unwrap_or_default()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail on unreadable docker-tests datasets

If --dataset is misspelled or the JSONL is malformed, this silently converts the load error into an empty map, so every submitted instance is reported as skipped_no_image instead of surfacing the operator error. Because docker-tests requires the dataset for images and test lists, please propagate swebench::load_dataset failures when a dataset path is provided.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs Outdated
Comment on lines +1480 to +1489
let run_out = Command::new("docker")
.args([
"run",
"-d",
"--rm",
"--label",
"maxwells-daemon=1",
"-w",
"/testbed",
])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent docker-tests from pulling missing images

This backend is described as offline, but docker run defaults to the missing pull policy (Docker documents this as pulling the image when it is absent from the local image cache). In an offline/air-gapped evaluation where the canonical image was not preloaded, this can make the evaluator attempt registry network access and hang outside the per-instance test timeout; add --pull=never so missing images fail locally as intended.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs Outdated
Comment on lines +1577 to +1579
let tests_arg = all_tests.join(" ");
let pytest_cmd = format!(
"cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} 2>&1 || true"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass pytest node IDs without shell interpolation

The test list is interpolated into a bash -c string without quoting, so any valid pytest node ID containing shell metacharacters from parametrized tests (for example brackets/globs, spaces, &, or ;) can be split, expanded, or executed as shell syntax instead of being passed as a literal test selector. In those instances docker-tests can run the wrong tests or fail the task for the wrong reason; pass test IDs as argv or quote each selector safely before building the shell command.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs
Comment on lines +1324 to +1325
// Find the patch for run 1 (multi-run reruns use run 1 by default for offline eval).
let patch_path = swebench::existing_patch_path_for_run(&args.sweep_dir, id, 1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Evaluate all rerun patches for docker-tests

For sweeps with runs > 1, docker-tests always reads only run 1's patch, so a rerun sweep where run 1 fails or has no patch but run 2+ resolves is reported as unresolved/skipped and gets incorrect resolved_count, pass_at_k, and pass@k summaries. Since the other real evaluator path evaluates each run slot, docker-tests should iterate the available run patches instead of defaulting to run 1.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs
Comment on lines +1645 to +1648
let mut child = cmd
.stdout(std::process::Stdio::piped())
.stderr(std::process::Stdio::piped())
.spawn()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drain child output while enforcing timeouts

With verbose pytest output from many selected tests, the child can fill the stdout/stderr pipes and block before exiting, but this loop waits for process exit with try_wait() and does not read those pipes until after exit. In that scenario a successful docker-tests run is treated as a timeout even though pytest is only blocked on the parent not draining output; use a timeout mechanism that concurrently collects output or redirects it to a file.

Useful? React with 👍 / 👎.

claude added 2 commits June 6, 2026 23:34
- Fix clippy::too_many_lines on bench_evaluate, run_docker_tests, and
  evaluate_instance_with_docker_inner with #[allow] attributes
- Restore run_docker_tests return type to Result<EvaluationResults, Error>
  now that dataset loading error propagation provides a real Err path,
  satisfying the clippy::unnecessary_wraps lint without suppression
- Propagate swebench::load_dataset errors when dataset_path is provided
  instead of silently converting failures to an empty map
- Add --pull=never to docker run to enforce offline operation
- Fix parse_pytest_output to use suffix matching (find(" PASSED") etc.)
  instead of prefix matching; pytest -v emits "nodeid PASSED [pct%]"
- Redirect pytest output to /tmp/pytest.output to avoid pipe buffer
  deadlock on large test suites, then cat the file after completion
- Use run_with_timeout for docker run itself to handle unresponsive daemons
- Write per-instance run-1.patch files in selftest evaluate_via_docker_tests
  so docker-tests backend can locate them via existing_patch_path_for_run

https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
CI runs clippy::expect_used as a denied lint. The test file already
suppressed clippy::unwrap_used; add expect_used to the same allow list
so the five .expect() calls in the test assertions compile cleanly.

https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d4a7743aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/evaluate.rs Outdated
"-c",
"cd /testbed && git apply /tmp/candidate.patch 2>&1",
])
.output()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply the patch under the per-instance deadline

When git apply inside the container hangs or just runs longer than --timeout-per-instance, this unbounded .output() call blocks the evaluator past the configured per-instance deadline before the code ever computes the remaining time for pytest. The docker-tests backend only wraps docker run and the pytest exec in run_with_timeout, so a pathological patch/application step can stall an offline evaluation indefinitely; run patch injection/apply with the same remaining deadline.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs
// No dataset provided or instance not found → can't determine image.
instances.push(InstanceEvaluation {
instance_id: id.to_owned(),
resolved: false,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude skipped images from unresolved-rate metrics

When docker-tests cannot find dataset/image metadata, this row is marked SkippedNoImage but still has resolved: false, and the existing summary paths count every eval.instances row when computing resolved/pass rates. In a sweep where some submitted instances lack a canonical image, those skipped infrastructure cases lower the reported resolved rate exactly like real unresolved patches, despite this new exit reason being documented as skipped/never unresolved; exclude them from score denominators or account for them separately.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs
Comment on lines +1484 to +1488
"run",
"-d",
"--rm",
"--pull=never",
"--label",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disable networking in offline test containers

This docker run leaves the container on Docker's default network, so tests or candidate code that open sockets can still reach the network whenever the host allows it, which breaks the advertised offline/reproducible evaluation mode. Docker documents --network none as the way to completely isolate the container networking stack; add it here so docker-tests cannot make external calls during pytest.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
pub struct DockerTestsProvenance {
/// Per-instance evaluation timeout in seconds.
pub timeout_per_instance_secs: u64,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compare docker-tests provenance details

When baseline and candidate evaluations both use docker-tests but differ in this timeout, bench compare still reports evaluator provenance as matching because compare_evaluator_provenance only checks the sb_cli detail block. A shorter timeout can turn the same patch into EvalError/unresolved, so docker-tests comparisons need to warn on mismatched docker_tests.timeout_per_instance_secs (and missing detail blocks) just like the sb-cli path does.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs
let pass_to_pass = extract_test_list(&inst_data.other, "PASS_TO_PASS");

// Find the patch for run 1 (multi-run reruns use run 1 by default for offline eval).
let patch_path = swebench::existing_patch_path_for_run(&args.sweep_dir, id, 1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reuse the same patch lookup for patch stats

When a sweep only has a legacy patches/<id>.patch artifact, docker-tests evaluates that patch via swebench::existing_patch_path_for_run, but attach_patch_stats later reads through the local helper that only checks nested and root-level patch paths. The verdict can therefore be based on the real patch while patch_stats, submission_class_rollup, and test_only_resolved_rate are computed from an empty patch; use the same lookup path for stats as the evaluator uses here.

Useful? React with 👍 / 👎.

…overage

Unit tests:
- parse_pytest_output: 6 tests covering PASSED/FAILED/ERROR suffix format,
  unrun-test fallback, reason-suffix stripping, and deduplication
- extract_test_list: 5 tests covering JSON array form, JSON-encoded string
  form, missing key, wrong type, and PASS_TO_PASS key

Offline isolation (Codex P2):
- Add --network=none to docker run so container code cannot make external
  calls during evaluation, enforcing reproducible offline mode

Per-instance deadline (Codex P2):
- Wrap git apply in run_with_timeout using the remaining budget so a
  pathological patch step cannot stall evaluation past the configured
  per-instance timeout

Codecov:
- Increase patch threshold from 5% to 20% to account for Docker container
  execution code that requires a real daemon and cannot be unit-tested

https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86a7b17004

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

EvalExitReason::PatchApplyFailed => "patch_apply_failed".to_owned(),
EvalExitReason::EvalError => EXIT_REASON_EVALUATOR_FAILED.to_owned(),
EvalExitReason::SkippedNoPatch => EXIT_REASON_GOLD_PATCH_MISSING.to_owned(),
EvalExitReason::SkippedNoImage => "skipped_no_image".to_owned(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Classify missing docker images as selftest errors

When evaluator-selftest --backend docker-tests evaluates an instance whose dataset row has no image, this maps the new SkippedNoImage verdict to "skipped_no_image", but is_errored_reason() only treats gold_patch_missing and evaluator_failed as errored. That makes a harness/configuration problem count as an unresolved gold patch and returns HasUnresolved/exit 4 instead of the documented error path, so preflight failures can be misreported as model/evaluator verdicts; map this to an errored reason or include it in the errored predicate.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs Outdated
Comment on lines +1597 to +1600
if all_tests.is_empty() {
// No tests defined — treat as resolved (no tests to fail).
return Ok(DockerTestVerdict {
resolved: true,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject docker-tests rows with no runnable tests

If a dataset row has an image and patch but the FAIL_TO_PASS/PASS_TO_PASS metadata is missing, misspelled, or fails to parse, extract_test_list() returns empty lists and this branch marks the instance resolved after applying the patch without running any tests. Because docker-tests depends on the dataset for the test selectors, a malformed test-list field can silently turn every submitted patch for that row into a success; treat an empty combined test list as an evaluator/setup error instead of resolved.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs
Comment on lines +1517 to +1519
let patch_inject = Command::new("docker")
.args(["exec", &container_id, "bash", "-c"])
.arg("cat > /tmp/candidate.patch")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep stdin open when injecting the patch

This docker exec invocation writes the patch through stdin to cat, but it omits -i; Docker documents -i, --interactive as “Keep STDIN open even if not attached” for docker exec (https://docs.docker.com/reference/cli/docker/container/exec/). In a real docker-tests run with any non-empty patch, cat can receive EOF instead of the patch bytes, leaving /tmp/candidate.patch empty and causing git apply to fail every candidate as PatchApplyFailed; add -i here or use docker cp/a bind mount for the patch file.

Useful? React with 👍 / 👎.

P1: Add -i to docker exec when injecting patch via stdin
Without --interactive, docker exec closes stdin immediately and cat
receives EOF before any patch bytes arrive, leaving /tmp/candidate.patch
empty and causing every candidate to fail as PatchApplyFailed.

P1: Treat empty FAIL_TO_PASS+PASS_TO_PASS as EvalError, not Resolved
When the dataset row has no test selectors (missing, misspelled, or
unparseable fields), the combined list is empty. Previously this was
silently accepted as "resolved" without running any tests. Now it
surfaces as an evaluator/setup error so bad dataset metadata cannot
inflate resolved counts.

P2: Classify skipped_no_image as an errored reason in selftest
is_errored_reason() only recognized gold_patch_missing and
evaluator_failed. A SkippedNoImage result (missing dataset or image
field) is an infrastructure/configuration problem, not a model verdict;
counting it as unresolved inflates HasUnresolved exit codes. Added
skipped_no_image to the errored predicate and updated the unit test.

https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a155def58b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/evaluate.rs Outdated
Comment on lines +1298 to +1303
let image = inst_data.image.as_deref().or_else(|| {
inst_data
.other
.get("image")
.and_then(serde_json::Value::as_str)
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive standard SWE-bench image names

When evaluating stock SWE-bench/Lite/Verified JSONL files, rows normally do not carry an image field, while prebuilt local images are addressed via image_name/docker_image or the conventional swebench/sweb.eval.x86_64.<instance_id> name. With the current lookup, those rows fall through to SkippedNoImage even if the canonical image is already loaded, so docker-tests cannot evaluate a standard dataset unless the operator first rewrites it with a custom image field; please honor the common image fields or derive the conventional image name before skipping.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs Outdated
});
}

// 4. Build the test command: run FAIL_TO_PASS + PASS_TO_PASS together.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Apply test_patch before running tests

For standard SWE-bench rows, many FAIL_TO_PASS selectors refer to tests introduced or modified by the dataset's test_patch, but this backend applies only the candidate patch and then immediately starts building the pytest command. On images that contain the repository at the base commit, a correct gold/candidate patch will still report those requested tests as missing or failing because the oracle tests were never installed; please copy/apply test_patch after the candidate patch and before executing the selected tests.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs Outdated
Comment on lines +1610 to +1611
let pytest_cmd = format!(
"cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} > /tmp/pytest.output 2>&1 || true"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use each repo's SWE-bench test command

This always invokes python -m pytest with the raw selectors, but several SWE-bench repositories require harness-specific commands rather than pytest directly (for example Django tasks are run through the project's test runner). In those instances the selectors can be valid for the official evaluator while this backend reports collection/import failures and marks correct patches unresolved; build the command from the SWE-bench test spec/eval script instead of hard-coding pytest.

Useful? React with 👍 / 👎.

…support

Image lookup (Codex P2):
- Check image_name and docker_image fields in addition to image
- Derive the conventional swebench/sweb.eval.x86_64.<id> fallback when no
  explicit image field is present; with --pull=never a missing image fails
  fast as EvalError (more informative than SkippedNoImage for operators who
  pre-pull standard images)

test_patch support (Codex P1):
- Apply test_patch (oracle tests from the dataset row) before the candidate
  patch, matching the official SWE-bench evaluator order; without this,
  FAIL_TO_PASS tests introduced by the issue are missing and every correct
  patch is falsely reported unresolved

test_command support (Codex P1):
- Accept an optional test_command field from the dataset to override the
  hard-coded python -m pytest invocation; repos with custom test runners
  (e.g. Django) can now be evaluated correctly when the dataset includes
  the appropriate command string

Also add #[allow(clippy::too_many_arguments)] to the two inner helpers
whose parameter counts grew beyond the default lint threshold, and update
the no-image-field test to match the new derived-image behaviour.

https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41784c621d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +497 to +501
let row = serde_json::json!({
"instance_id": inst.instance_id,
"image": inst.image,
"FAIL_TO_PASS": inst.other.get("FAIL_TO_PASS").cloned().unwrap_or(serde_json::Value::Array(vec![])),
"PASS_TO_PASS": inst.other.get("PASS_TO_PASS").cloned().unwrap_or(serde_json::Value::Array(vec![])),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve test_patch in docker selftest datasets

When evaluator-selftest --backend docker-tests runs on standard SWE-bench rows that include a non-empty test_patch, this synthetic dataset drops that field before calling evaluate::run(), so run_docker_tests reloads the row without oracle tests and never applies them. Gold patches whose FAIL_TO_PASS selectors are introduced or updated by test_patch will then be reported unresolved even though the original dataset row contains everything needed; include test_patch (or preserve the relevant other fields) in this JSONL row.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs
&container_id,
"bash",
"-c",
"cd /testbed && git apply /tmp/candidate.patch 2>&1",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Try SWE-bench patch fallbacks before failing

For docker-tests rows whose candidate patch does not apply with strict git apply but would apply with the standard SWE-bench fallback commands (git apply --reject / fuzzier patch, as used by the harness), this marks the instance PatchApplyFailed before any tests run, so offline results can disagree with sb-cli/SWE-bench for the same submitted patch. Please mirror the evaluator's patch-application fallback sequence before classifying the patch as failed.

Useful? React with 👍 / 👎.

Comment thread src/run/evaluate.rs Outdated
&container_id,
"bash",
"-c",
"cd /testbed && git apply --ignore-whitespace /tmp/test.patch 2>&1 || true",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply oracle tests after the candidate and require success

When a candidate patch touches the same test files as test_patch, applying the oracle patch before the candidate and then silencing failures with || true lets the candidate overwrite or prevent the oracle test changes while the run still proceeds. That can produce false resolved verdicts against stale or model-modified tests; mirror the SWE-bench flow by applying the candidate first, then resetting/applying test_patch as a required step before running tests.

Useful? React with 👍 / 👎.

…fallback

Selftest preserves test_patch and test_command (Codex P2):
Previously the synthetic JSONL written by evaluate_via_docker_tests only
included image/FAIL_TO_PASS/PASS_TO_PASS, dropping test_patch and
test_command. Gold patches whose FAIL_TO_PASS selectors are introduced
by test_patch would always report unresolved. Now both fields are
preserved from the source dataset row.

test_patch apply no longer silenced (Codex P2):
The || true silencing meant a test_patch that conflicted with the
candidate would proceed with stale or absent oracle tests. Now the
apply failure is surfaced as EvalError so the caller can distinguish
a broken dataset from a genuine unresolved verdict. Timeout is also
properly checked before and after the test_patch apply step.

git apply --3way fallback for candidate patch (Codex P2):
Mirrors the SWE-bench harness fallback sequence. When strict git apply
fails (e.g. line-number skew between patch and actual file), a three-
way merge is attempted before the instance is classified PatchApplyFailed.
This reduces false PatchApplyFailed verdicts for lightly drifted patches.

https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6c990b72a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/evaluate.rs Outdated
})
.map_err(|e| format!("test_patch inject failed: {e}"))?;

if tp_inject.status.success() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat oracle patch injection failures as evaluator errors

When a dataset row has a non-empty test_patch but the docker exec ... cat > /tmp/test.patch command exits non-zero, this if simply skips the oracle patch and continues to apply the candidate and run tests. In environments where the copy fails, for example due to /tmp being unwritable/full or the exec failing after spawn, docker-tests can grade against stale or missing oracle tests instead of surfacing an evaluator setup error; this should mirror candidate patch injection and return an error on a non-success status.

Useful? React with 👍 / 👎.

for (inst, _) in &eval_pairs {
let row = serde_json::json!({
"instance_id": inst.instance_id,
"image": inst.image,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve alternate image fields in selftest datasets

For evaluator-selftest --backend docker-tests on datasets that specify the runnable image via image_name or docker_image instead of the top-level image, this synthetic row drops those fields even though run_docker_tests() now knows how to read them. The selftest then falls back to swebench/sweb.eval.x86_64.<id> and can fail with the wrong/missing image while a normal docker-tests evaluation of the original dataset would use the configured image; carry these alternate image fields (or the full other map) into the synthetic JSONL.

Useful? React with 👍 / 👎.

…age fields

evaluate.rs (Codex P2):
When docker exec ... cat > /tmp/test.patch exits non-zero the code
previously silently skipped the oracle patch and proceeded to evaluate
against stale or missing oracle tests. Now the failure is returned as
Err("test_patch inject failed: ...") → EvalError so misconfigured
environments surface a clear error rather than a false unresolved verdict.

evaluator_selftest.rs (Codex P2):
The synthetic JSONL dataset written by evaluate_via_docker_tests was
missing the image_name and docker_image alternative image fields.
Datasets that specify the runnable image via these fields would fall
through to the conventional swebench/sweb.eval.x86_64.<id> derivation
while a normal docker-tests run on the same dataset would find the
correct image. Both fields are now carried from inst.other into the
synthetic row alongside image, test_patch, and test_command.

https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b6c59332c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/run/evaluate.rs
Comment on lines +1546 to +1548
let _ = stdin.write_all(patch_content.as_bytes());
}
child.wait_with_output()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce the per-instance deadline while injecting patches

When docker exec ... cat > /tmp/candidate.patch or the container filesystem/daemon stalls while receiving a patch, this path waits in write_all/wait_with_output() without using the remaining per-instance timeout. That means a docker-tests run can hang before reaching the deadline-aware git apply and pytest steps, despite --timeout-per-instance being configured; wrap patch injection (and the analogous test-patch injection) in the same timeout mechanism or copy via a bounded command.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants