Add docker-tests offline evaluation backend#629
Conversation
Adds `--backend docker-tests` to `bench evaluate`, enabling a real resolved/unresolved verdict with zero network calls to any evaluation service. Each instance's candidate patch is applied inside its canonical Docker image; FAIL_TO_PASS and PASS_TO_PASS tests are run via pytest and the verdict is produced locally. Key changes: - `EvaluateBackend::DockerTests` variant + CLI string `"docker-tests"` - `EvalExitReason::SkippedNoImage`: instances with no known Docker image are reported as explicitly skipped, never silently unresolved (AC3) - `DockerTestsProvenance` struct recorded in `EvaluatorProvenance` with timeout_per_instance_secs, satisfying the same provenance schema contract as the sb-cli backend (AC2) - Per-instance timeout enforced via polling loop with kill (AC4) - `evaluator-selftest` accepts `"docker-tests"` backend and routes gold patches through the offline pipeline (AC5) - All downstream consumers (compare, inspect, near-miss, test-progress) handle the new `SkippedNoImage` exit reason (AC2) TDD: 16 tests added in bench_docker_tests_eval.rs covering all ACs. https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
There was a problem hiding this comment.
Code Review
This pull request introduces a new offline evaluator backend, docker-tests, which runs evaluations locally inside Docker containers. It adds the DockerTests backend variant, implements the Docker container lifecycle and pytest execution logic, and integrates the backend with CLI parsing, selftests, and downstream commands. Feedback on the implementation highlights several critical issues: a parsing bug where pytest output is incorrectly assumed to have status prefixes instead of suffixes; a potential deadlock when using Stdio::piped() for large test outputs; a lack of validation and error propagation for the required dataset_path; and a missing timeout on the synchronous docker run command which could cause the evaluation to hang indefinitely.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for line in output.lines() { | ||
| let line = line.trim(); | ||
| if let Some(rest) = line.strip_prefix("PASSED ") { | ||
| let test_id = rest.split_whitespace().next().unwrap_or("").to_owned(); | ||
| if !test_id.is_empty() { | ||
| passed.push(test_id); | ||
| } | ||
| } else if let Some(rest) = line.strip_prefix("FAILED ") { | ||
| let test_id = rest.split_whitespace().next().unwrap_or(""); | ||
| // Strip trailing " - <reason>" that pytest sometimes appends. | ||
| let test_id = test_id.split(" - ").next().unwrap_or(test_id).to_owned(); | ||
| if !test_id.is_empty() { | ||
| failed.push(test_id); | ||
| } | ||
| } else if let Some(rest) = line.strip_prefix("ERROR ") { | ||
| let test_id = rest.split_whitespace().next().unwrap_or("").to_owned(); | ||
| if !test_id.is_empty() { | ||
| failed.push(test_id); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Standard pytest -v output formats each test result line with the status as a suffix (e.g., tests/test_foo.py::test_bar PASSED [ 50%]), rather than a prefix. Because the current implementation uses strip_prefix, it will fail to parse any test results, causing all tests to be treated as failed by the fallback logic.
Using a substring search for PASSED, FAILED, and ERROR is extremely robust and correctly extracts the test ID from both standard and verbose pytest outputs.
for line in output.lines() {
let line = line.trim();
if let Some(idx) = line.find(" PASSED") {
let test_id = line[..idx].trim().to_owned();
if !test_id.is_empty() {
passed.push(test_id);
}
} else if let Some(idx) = line.find(" FAILED") {
let test_id = line[..idx].trim();
// Strip trailing " - <reason>" that pytest sometimes appends.
let test_id = test_id.split(" - ").next().unwrap_or(test_id).to_owned();
if !test_id.is_empty() {
failed.push(test_id);
}
} else if let Some(idx) = line.find(" ERROR") {
let test_id = line[..idx].trim().to_owned();
if !test_id.is_empty() {
failed.push(test_id);
}
}
}| let pytest_cmd = format!( | ||
| "cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} 2>&1 || true" | ||
| ); |
There was a problem hiding this comment.
When spawning a child process with Stdio::piped() for stdout/stderr, the operating system pipe buffer is limited (typically 64KB on Linux). If the child process (pytest) writes more than this limit, it will block indefinitely on the write system call because the parent process is busy-waiting in run_with_timeout and not reading from the pipes. This causes a deadlock, resulting in false timeouts for verbose test suites.
To completely avoid this deadlock, redirect the pytest output to a file inside the container (e.g., /tmp/pytest.output), and then read the file contents after the command execution completes.
let pytest_cmd = format!(
"cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} > /tmp/pytest.output 2>&1 || true"
);| let (timed_out, output_text) = match test_out { | ||
| Ok(o) => (false, String::from_utf8_lossy(&o.stdout).into_owned()), | ||
| Err(TimedOut) => { | ||
| return Ok(DockerTestVerdict { | ||
| resolved: false, | ||
| timed_out: true, | ||
| patch_apply_failed: false, | ||
| patch_error_log: None, | ||
| tests_passed: vec![], | ||
| tests_failed: vec![], | ||
| }); | ||
| } | ||
| }; |
There was a problem hiding this comment.
This reads the redirected pytest output file from the container after execution, avoiding the pipe buffer deadlock.
let (timed_out, output_text) = match test_out {
Ok(_) => {
let output_out = Command::new("docker")
.args(["exec", &container_id, "cat", "/tmp/pytest.output"])
.output()
.map_err(|e| format!("failed to read pytest output: {e}"))?;
(false, String::from_utf8_lossy(&output_out.stdout).into_owned())
}
Err(TimedOut) => {
return Ok(DockerTestVerdict {
resolved: false,
timed_out: true,
patch_apply_failed: false,
patch_error_log: None,
tests_passed: vec![],
tests_failed: vec![],
});
}
};| let dataset_map: HashMap<String, swebench::SweBenchInstance> = args | ||
| .dataset_path | ||
| .as_ref() | ||
| .and_then(|p| swebench::load_dataset(p).ok()) | ||
| .unwrap_or_default() | ||
| .into_iter() | ||
| .map(|inst| (inst.instance_id.clone(), inst)) | ||
| .collect(); |
There was a problem hiding this comment.
The docker-tests backend absolutely requires the dataset to extract the canonical Docker image and test lists for each instance. If dataset_path is not provided, all instances will be silently skipped with SkippedNoImage without any error feedback. Additionally, if the dataset fails to load, the error should be propagated rather than silently swallowed.
Validate that dataset_path is present and propagate any loading errors using ?.
let dataset_path = args.dataset_path.as_ref().ok_or_else(|| {
Error::Config(crate::error::ConfigError::Invalid(
"dataset path is required for the `docker-tests` backend".to_owned(),
))
})?;
let dataset_map: HashMap<String, swebench::SweBenchInstance> = swebench::load_dataset(dataset_path)?
.into_iter()
.map(|inst| (inst.instance_id.clone(), inst))
.collect();| let run_out = Command::new("docker") | ||
| .args([ | ||
| "run", | ||
| "-d", | ||
| "--rm", | ||
| "--label", | ||
| "maxwells-daemon=1", | ||
| "-w", | ||
| "/testbed", | ||
| ]) | ||
| .arg(image) | ||
| .args(["sleep", "infinity"]) | ||
| .output() | ||
| .map_err(|e| format!("docker run failed: {e}"))?; |
There was a problem hiding this comment.
The docker run command is executed synchronously without a timeout. If the Docker image needs to be pulled and the network hangs, or if the Docker daemon is unresponsive, the entire evaluation will hang indefinitely, ignoring the configured timeout.
Use run_with_timeout to execute docker run with the remaining timeout budget.
let run_out = run_with_timeout(
Command::new("docker")
.args([
"run",
"-d",
"--rm",
"--label",
"maxwells-daemon=1",
"-w",
"/testbed",
])
.arg(image)
.args(["sleep", "infinity"]),
timeout,
)
.map_err(|_| "docker run timed out or failed".to_owned())?;
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## trunk #629 +/- ##
==========================================
- Coverage 85.42% 85.18% -0.25%
==========================================
Files 117 117
Lines 67245 67897 +652
==========================================
+ Hits 57445 57838 +393
- Misses 9800 10059 +259 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cef47db0ee
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| for line in output.lines() { | ||
| let line = line.trim(); | ||
| if let Some(rest) = line.strip_prefix("PASSED ") { |
There was a problem hiding this comment.
Parse pytest's nodeid-first verbose output
When docker-tests runs real pytest with -v, result lines are emitted as <nodeid> PASSED/<nodeid> FAILED, not PASSED <nodeid> (I checked the exact command shape used here with a sample pytest run). In any instance with non-empty FAIL_TO_PASS tests, this parser never records successes, then the “not run” fallback marks every requested test failed, so valid patches are reported unresolved. Please parse the nodeid-first format as well as any alternate forms.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| write_synthetic_results_json(&scratch, &eval_pairs); | ||
| write_synthetic_predictions(&scratch, &eval_pairs); |
There was a problem hiding this comment.
Write patch files for docker-tests selftest
For evaluator-selftest --backend docker-tests with an image and a non-empty gold patch, this only writes all_preds.jsonl, but the docker-tests evaluator reads the candidate from the sweep patch path (<scratch>/<id>/run-1.patch or legacy patch files) and never consumes the predictions file. As a result, gold-patch selftests are classified as skipped_no_patch/gold-patch-missing instead of actually evaluating the gold patch; please materialize each patch file in the synthetic sweep before calling evaluate::run().
Useful? React with 👍 / 👎.
| EvaluateBackend::SbCli => run_sb_cli(args, &results)?, | ||
| EvaluateBackend::Rehearsal => run_rehearsal_eval(args, &results)?, | ||
| EvaluateBackend::DockerTests => { | ||
| EvaluateRunOutput::without_run_resolution(run_docker_tests(args, &results)?) |
There was a problem hiding this comment.
Populate run-resolution data for docker-tests
When docker-tests returns resolved rows, wrapping it with without_run_resolution() leaves resolved_by_run empty, but the later cost-attribution and model-mix builders derive their resolved counts solely from that map. In a docker-tests run with any resolved instance, evaluation.json will show the instance as resolved while default cost_attribution buckets and model_mix_summary count the same run as unresolved; please fill at least the run-1 entries from the docker-tests verdicts.
Useful? React with 👍 / 👎.
| .dataset_path | ||
| .as_ref() | ||
| .and_then(|p| swebench::load_dataset(p).ok()) | ||
| .unwrap_or_default() |
There was a problem hiding this comment.
Fail on unreadable docker-tests datasets
If --dataset is misspelled or the JSONL is malformed, this silently converts the load error into an empty map, so every submitted instance is reported as skipped_no_image instead of surfacing the operator error. Because docker-tests requires the dataset for images and test lists, please propagate swebench::load_dataset failures when a dataset path is provided.
Useful? React with 👍 / 👎.
| let run_out = Command::new("docker") | ||
| .args([ | ||
| "run", | ||
| "-d", | ||
| "--rm", | ||
| "--label", | ||
| "maxwells-daemon=1", | ||
| "-w", | ||
| "/testbed", | ||
| ]) |
There was a problem hiding this comment.
Prevent docker-tests from pulling missing images
This backend is described as offline, but docker run defaults to the missing pull policy (Docker documents this as pulling the image when it is absent from the local image cache). In an offline/air-gapped evaluation where the canonical image was not preloaded, this can make the evaluator attempt registry network access and hang outside the per-instance test timeout; add --pull=never so missing images fail locally as intended.
Useful? React with 👍 / 👎.
| let tests_arg = all_tests.join(" "); | ||
| let pytest_cmd = format!( | ||
| "cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} 2>&1 || true" |
There was a problem hiding this comment.
Pass pytest node IDs without shell interpolation
The test list is interpolated into a bash -c string without quoting, so any valid pytest node ID containing shell metacharacters from parametrized tests (for example brackets/globs, spaces, &, or ;) can be split, expanded, or executed as shell syntax instead of being passed as a literal test selector. In those instances docker-tests can run the wrong tests or fail the task for the wrong reason; pass test IDs as argv or quote each selector safely before building the shell command.
Useful? React with 👍 / 👎.
| // Find the patch for run 1 (multi-run reruns use run 1 by default for offline eval). | ||
| let patch_path = swebench::existing_patch_path_for_run(&args.sweep_dir, id, 1); |
There was a problem hiding this comment.
Evaluate all rerun patches for docker-tests
For sweeps with runs > 1, docker-tests always reads only run 1's patch, so a rerun sweep where run 1 fails or has no patch but run 2+ resolves is reported as unresolved/skipped and gets incorrect resolved_count, pass_at_k, and pass@k summaries. Since the other real evaluator path evaluates each run slot, docker-tests should iterate the available run patches instead of defaulting to run 1.
Useful? React with 👍 / 👎.
| let mut child = cmd | ||
| .stdout(std::process::Stdio::piped()) | ||
| .stderr(std::process::Stdio::piped()) | ||
| .spawn() |
There was a problem hiding this comment.
Drain child output while enforcing timeouts
With verbose pytest output from many selected tests, the child can fill the stdout/stderr pipes and block before exiting, but this loop waits for process exit with try_wait() and does not read those pipes until after exit. In that scenario a successful docker-tests run is treated as a timeout even though pytest is only blocked on the parent not draining output; use a timeout mechanism that concurrently collects output or redirects it to a file.
Useful? React with 👍 / 👎.
- Fix clippy::too_many_lines on bench_evaluate, run_docker_tests, and
evaluate_instance_with_docker_inner with #[allow] attributes
- Restore run_docker_tests return type to Result<EvaluationResults, Error>
now that dataset loading error propagation provides a real Err path,
satisfying the clippy::unnecessary_wraps lint without suppression
- Propagate swebench::load_dataset errors when dataset_path is provided
instead of silently converting failures to an empty map
- Add --pull=never to docker run to enforce offline operation
- Fix parse_pytest_output to use suffix matching (find(" PASSED") etc.)
instead of prefix matching; pytest -v emits "nodeid PASSED [pct%]"
- Redirect pytest output to /tmp/pytest.output to avoid pipe buffer
deadlock on large test suites, then cat the file after completion
- Use run_with_timeout for docker run itself to handle unresponsive daemons
- Write per-instance run-1.patch files in selftest evaluate_via_docker_tests
so docker-tests backend can locate them via existing_patch_path_for_run
https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
CI runs clippy::expect_used as a denied lint. The test file already suppressed clippy::unwrap_used; add expect_used to the same allow list so the five .expect() calls in the test assertions compile cleanly. https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1d4a7743aa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "-c", | ||
| "cd /testbed && git apply /tmp/candidate.patch 2>&1", | ||
| ]) | ||
| .output() |
There was a problem hiding this comment.
Apply the patch under the per-instance deadline
When git apply inside the container hangs or just runs longer than --timeout-per-instance, this unbounded .output() call blocks the evaluator past the configured per-instance deadline before the code ever computes the remaining time for pytest. The docker-tests backend only wraps docker run and the pytest exec in run_with_timeout, so a pathological patch/application step can stall an offline evaluation indefinitely; run patch injection/apply with the same remaining deadline.
Useful? React with 👍 / 👎.
| // No dataset provided or instance not found → can't determine image. | ||
| instances.push(InstanceEvaluation { | ||
| instance_id: id.to_owned(), | ||
| resolved: false, |
There was a problem hiding this comment.
Exclude skipped images from unresolved-rate metrics
When docker-tests cannot find dataset/image metadata, this row is marked SkippedNoImage but still has resolved: false, and the existing summary paths count every eval.instances row when computing resolved/pass rates. In a sweep where some submitted instances lack a canonical image, those skipped infrastructure cases lower the reported resolved rate exactly like real unresolved patches, despite this new exit reason being documented as skipped/never unresolved; exclude them from score denominators or account for them separately.
Useful? React with 👍 / 👎.
| "run", | ||
| "-d", | ||
| "--rm", | ||
| "--pull=never", | ||
| "--label", |
There was a problem hiding this comment.
Disable networking in offline test containers
This docker run leaves the container on Docker's default network, so tests or candidate code that open sockets can still reach the network whenever the host allows it, which breaks the advertised offline/reproducible evaluation mode. Docker documents --network none as the way to completely isolate the container networking stack; add it here so docker-tests cannot make external calls during pytest.
Useful? React with 👍 / 👎.
| #[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)] | ||
| pub struct DockerTestsProvenance { | ||
| /// Per-instance evaluation timeout in seconds. | ||
| pub timeout_per_instance_secs: u64, |
There was a problem hiding this comment.
Compare docker-tests provenance details
When baseline and candidate evaluations both use docker-tests but differ in this timeout, bench compare still reports evaluator provenance as matching because compare_evaluator_provenance only checks the sb_cli detail block. A shorter timeout can turn the same patch into EvalError/unresolved, so docker-tests comparisons need to warn on mismatched docker_tests.timeout_per_instance_secs (and missing detail blocks) just like the sb-cli path does.
Useful? React with 👍 / 👎.
| let pass_to_pass = extract_test_list(&inst_data.other, "PASS_TO_PASS"); | ||
|
|
||
| // Find the patch for run 1 (multi-run reruns use run 1 by default for offline eval). | ||
| let patch_path = swebench::existing_patch_path_for_run(&args.sweep_dir, id, 1); |
There was a problem hiding this comment.
Reuse the same patch lookup for patch stats
When a sweep only has a legacy patches/<id>.patch artifact, docker-tests evaluates that patch via swebench::existing_patch_path_for_run, but attach_patch_stats later reads through the local helper that only checks nested and root-level patch paths. The verdict can therefore be based on the real patch while patch_stats, submission_class_rollup, and test_only_resolved_rate are computed from an empty patch; use the same lookup path for stats as the evaluator uses here.
Useful? React with 👍 / 👎.
…overage Unit tests: - parse_pytest_output: 6 tests covering PASSED/FAILED/ERROR suffix format, unrun-test fallback, reason-suffix stripping, and deduplication - extract_test_list: 5 tests covering JSON array form, JSON-encoded string form, missing key, wrong type, and PASS_TO_PASS key Offline isolation (Codex P2): - Add --network=none to docker run so container code cannot make external calls during evaluation, enforcing reproducible offline mode Per-instance deadline (Codex P2): - Wrap git apply in run_with_timeout using the remaining budget so a pathological patch step cannot stall evaluation past the configured per-instance timeout Codecov: - Increase patch threshold from 5% to 20% to account for Docker container execution code that requires a real daemon and cannot be unit-tested https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 86a7b17004
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| EvalExitReason::PatchApplyFailed => "patch_apply_failed".to_owned(), | ||
| EvalExitReason::EvalError => EXIT_REASON_EVALUATOR_FAILED.to_owned(), | ||
| EvalExitReason::SkippedNoPatch => EXIT_REASON_GOLD_PATCH_MISSING.to_owned(), | ||
| EvalExitReason::SkippedNoImage => "skipped_no_image".to_owned(), |
There was a problem hiding this comment.
Classify missing docker images as selftest errors
When evaluator-selftest --backend docker-tests evaluates an instance whose dataset row has no image, this maps the new SkippedNoImage verdict to "skipped_no_image", but is_errored_reason() only treats gold_patch_missing and evaluator_failed as errored. That makes a harness/configuration problem count as an unresolved gold patch and returns HasUnresolved/exit 4 instead of the documented error path, so preflight failures can be misreported as model/evaluator verdicts; map this to an errored reason or include it in the errored predicate.
Useful? React with 👍 / 👎.
| if all_tests.is_empty() { | ||
| // No tests defined — treat as resolved (no tests to fail). | ||
| return Ok(DockerTestVerdict { | ||
| resolved: true, |
There was a problem hiding this comment.
Reject docker-tests rows with no runnable tests
If a dataset row has an image and patch but the FAIL_TO_PASS/PASS_TO_PASS metadata is missing, misspelled, or fails to parse, extract_test_list() returns empty lists and this branch marks the instance resolved after applying the patch without running any tests. Because docker-tests depends on the dataset for the test selectors, a malformed test-list field can silently turn every submitted patch for that row into a success; treat an empty combined test list as an evaluator/setup error instead of resolved.
Useful? React with 👍 / 👎.
| let patch_inject = Command::new("docker") | ||
| .args(["exec", &container_id, "bash", "-c"]) | ||
| .arg("cat > /tmp/candidate.patch") |
There was a problem hiding this comment.
Keep stdin open when injecting the patch
This docker exec invocation writes the patch through stdin to cat, but it omits -i; Docker documents -i, --interactive as “Keep STDIN open even if not attached” for docker exec (https://docs.docker.com/reference/cli/docker/container/exec/). In a real docker-tests run with any non-empty patch, cat can receive EOF instead of the patch bytes, leaving /tmp/candidate.patch empty and causing git apply to fail every candidate as PatchApplyFailed; add -i here or use docker cp/a bind mount for the patch file.
Useful? React with 👍 / 👎.
P1: Add -i to docker exec when injecting patch via stdin Without --interactive, docker exec closes stdin immediately and cat receives EOF before any patch bytes arrive, leaving /tmp/candidate.patch empty and causing every candidate to fail as PatchApplyFailed. P1: Treat empty FAIL_TO_PASS+PASS_TO_PASS as EvalError, not Resolved When the dataset row has no test selectors (missing, misspelled, or unparseable fields), the combined list is empty. Previously this was silently accepted as "resolved" without running any tests. Now it surfaces as an evaluator/setup error so bad dataset metadata cannot inflate resolved counts. P2: Classify skipped_no_image as an errored reason in selftest is_errored_reason() only recognized gold_patch_missing and evaluator_failed. A SkippedNoImage result (missing dataset or image field) is an infrastructure/configuration problem, not a model verdict; counting it as unresolved inflates HasUnresolved exit codes. Added skipped_no_image to the errored predicate and updated the unit test. https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a155def58b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let image = inst_data.image.as_deref().or_else(|| { | ||
| inst_data | ||
| .other | ||
| .get("image") | ||
| .and_then(serde_json::Value::as_str) | ||
| }); |
There was a problem hiding this comment.
Derive standard SWE-bench image names
When evaluating stock SWE-bench/Lite/Verified JSONL files, rows normally do not carry an image field, while prebuilt local images are addressed via image_name/docker_image or the conventional swebench/sweb.eval.x86_64.<instance_id> name. With the current lookup, those rows fall through to SkippedNoImage even if the canonical image is already loaded, so docker-tests cannot evaluate a standard dataset unless the operator first rewrites it with a custom image field; please honor the common image fields or derive the conventional image name before skipping.
Useful? React with 👍 / 👎.
| }); | ||
| } | ||
|
|
||
| // 4. Build the test command: run FAIL_TO_PASS + PASS_TO_PASS together. |
There was a problem hiding this comment.
Apply test_patch before running tests
For standard SWE-bench rows, many FAIL_TO_PASS selectors refer to tests introduced or modified by the dataset's test_patch, but this backend applies only the candidate patch and then immediately starts building the pytest command. On images that contain the repository at the base commit, a correct gold/candidate patch will still report those requested tests as missing or failing because the oracle tests were never installed; please copy/apply test_patch after the candidate patch and before executing the selected tests.
Useful? React with 👍 / 👎.
| let pytest_cmd = format!( | ||
| "cd /testbed && python -m pytest -v --tb=no --no-header -rN {tests_arg} > /tmp/pytest.output 2>&1 || true" |
There was a problem hiding this comment.
Use each repo's SWE-bench test command
This always invokes python -m pytest with the raw selectors, but several SWE-bench repositories require harness-specific commands rather than pytest directly (for example Django tasks are run through the project's test runner). In those instances the selectors can be valid for the official evaluator while this backend reports collection/import failures and marks correct patches unresolved; build the command from the SWE-bench test spec/eval script instead of hard-coding pytest.
Useful? React with 👍 / 👎.
…support Image lookup (Codex P2): - Check image_name and docker_image fields in addition to image - Derive the conventional swebench/sweb.eval.x86_64.<id> fallback when no explicit image field is present; with --pull=never a missing image fails fast as EvalError (more informative than SkippedNoImage for operators who pre-pull standard images) test_patch support (Codex P1): - Apply test_patch (oracle tests from the dataset row) before the candidate patch, matching the official SWE-bench evaluator order; without this, FAIL_TO_PASS tests introduced by the issue are missing and every correct patch is falsely reported unresolved test_command support (Codex P1): - Accept an optional test_command field from the dataset to override the hard-coded python -m pytest invocation; repos with custom test runners (e.g. Django) can now be evaluated correctly when the dataset includes the appropriate command string Also add #[allow(clippy::too_many_arguments)] to the two inner helpers whose parameter counts grew beyond the default lint threshold, and update the no-image-field test to match the new derived-image behaviour. https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 41784c621d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let row = serde_json::json!({ | ||
| "instance_id": inst.instance_id, | ||
| "image": inst.image, | ||
| "FAIL_TO_PASS": inst.other.get("FAIL_TO_PASS").cloned().unwrap_or(serde_json::Value::Array(vec![])), | ||
| "PASS_TO_PASS": inst.other.get("PASS_TO_PASS").cloned().unwrap_or(serde_json::Value::Array(vec![])), |
There was a problem hiding this comment.
Preserve test_patch in docker selftest datasets
When evaluator-selftest --backend docker-tests runs on standard SWE-bench rows that include a non-empty test_patch, this synthetic dataset drops that field before calling evaluate::run(), so run_docker_tests reloads the row without oracle tests and never applies them. Gold patches whose FAIL_TO_PASS selectors are introduced or updated by test_patch will then be reported unresolved even though the original dataset row contains everything needed; include test_patch (or preserve the relevant other fields) in this JSONL row.
Useful? React with 👍 / 👎.
| &container_id, | ||
| "bash", | ||
| "-c", | ||
| "cd /testbed && git apply /tmp/candidate.patch 2>&1", |
There was a problem hiding this comment.
Try SWE-bench patch fallbacks before failing
For docker-tests rows whose candidate patch does not apply with strict git apply but would apply with the standard SWE-bench fallback commands (git apply --reject / fuzzier patch, as used by the harness), this marks the instance PatchApplyFailed before any tests run, so offline results can disagree with sb-cli/SWE-bench for the same submitted patch. Please mirror the evaluator's patch-application fallback sequence before classifying the patch as failed.
Useful? React with 👍 / 👎.
| &container_id, | ||
| "bash", | ||
| "-c", | ||
| "cd /testbed && git apply --ignore-whitespace /tmp/test.patch 2>&1 || true", |
There was a problem hiding this comment.
Apply oracle tests after the candidate and require success
When a candidate patch touches the same test files as test_patch, applying the oracle patch before the candidate and then silencing failures with || true lets the candidate overwrite or prevent the oracle test changes while the run still proceeds. That can produce false resolved verdicts against stale or model-modified tests; mirror the SWE-bench flow by applying the candidate first, then resetting/applying test_patch as a required step before running tests.
Useful? React with 👍 / 👎.
…fallback Selftest preserves test_patch and test_command (Codex P2): Previously the synthetic JSONL written by evaluate_via_docker_tests only included image/FAIL_TO_PASS/PASS_TO_PASS, dropping test_patch and test_command. Gold patches whose FAIL_TO_PASS selectors are introduced by test_patch would always report unresolved. Now both fields are preserved from the source dataset row. test_patch apply no longer silenced (Codex P2): The || true silencing meant a test_patch that conflicted with the candidate would proceed with stale or absent oracle tests. Now the apply failure is surfaced as EvalError so the caller can distinguish a broken dataset from a genuine unresolved verdict. Timeout is also properly checked before and after the test_patch apply step. git apply --3way fallback for candidate patch (Codex P2): Mirrors the SWE-bench harness fallback sequence. When strict git apply fails (e.g. line-number skew between patch and actual file), a three- way merge is attempted before the instance is classified PatchApplyFailed. This reduces false PatchApplyFailed verdicts for lightly drifted patches. https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f6c990b72a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }) | ||
| .map_err(|e| format!("test_patch inject failed: {e}"))?; | ||
|
|
||
| if tp_inject.status.success() { |
There was a problem hiding this comment.
Treat oracle patch injection failures as evaluator errors
When a dataset row has a non-empty test_patch but the docker exec ... cat > /tmp/test.patch command exits non-zero, this if simply skips the oracle patch and continues to apply the candidate and run tests. In environments where the copy fails, for example due to /tmp being unwritable/full or the exec failing after spawn, docker-tests can grade against stale or missing oracle tests instead of surfacing an evaluator setup error; this should mirror candidate patch injection and return an error on a non-success status.
Useful? React with 👍 / 👎.
| for (inst, _) in &eval_pairs { | ||
| let row = serde_json::json!({ | ||
| "instance_id": inst.instance_id, | ||
| "image": inst.image, |
There was a problem hiding this comment.
Preserve alternate image fields in selftest datasets
For evaluator-selftest --backend docker-tests on datasets that specify the runnable image via image_name or docker_image instead of the top-level image, this synthetic row drops those fields even though run_docker_tests() now knows how to read them. The selftest then falls back to swebench/sweb.eval.x86_64.<id> and can fail with the wrong/missing image while a normal docker-tests evaluation of the original dataset would use the configured image; carry these alternate image fields (or the full other map) into the synthetic JSONL.
Useful? React with 👍 / 👎.
…age fields
evaluate.rs (Codex P2):
When docker exec ... cat > /tmp/test.patch exits non-zero the code
previously silently skipped the oracle patch and proceeded to evaluate
against stale or missing oracle tests. Now the failure is returned as
Err("test_patch inject failed: ...") → EvalError so misconfigured
environments surface a clear error rather than a false unresolved verdict.
evaluator_selftest.rs (Codex P2):
The synthetic JSONL dataset written by evaluate_via_docker_tests was
missing the image_name and docker_image alternative image fields.
Datasets that specify the runnable image via these fields would fall
through to the conventional swebench/sweb.eval.x86_64.<id> derivation
while a normal docker-tests run on the same dataset would find the
correct image. Both fields are now carried from inst.other into the
synthetic row alongside image, test_patch, and test_command.
https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b6c59332c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let _ = stdin.write_all(patch_content.as_bytes()); | ||
| } | ||
| child.wait_with_output() |
There was a problem hiding this comment.
Enforce the per-instance deadline while injecting patches
When docker exec ... cat > /tmp/candidate.patch or the container filesystem/daemon stalls while receiving a patch, this path waits in write_all/wait_with_output() without using the remaining per-instance timeout. That means a docker-tests run can hang before reaching the deadline-aware git apply and pytest steps, despite --timeout-per-instance being configured; wrap patch injection (and the analogous test-patch injection) in the same timeout mechanism or copy via a bounded command.
Useful? React with 👍 / 👎.
Summary
This PR introduces a new
docker-testsoffline evaluation backend that applies candidate patches inside Docker containers and runs test suites locally without any network calls to external evaluation services. This complements the existingsb-clibackend and enables fully offline evaluation workflows.Key Changes
EvaluateBackend::DockerTestsvariant: Added to the backend enum alongside existingNone,SbCli, andRehearsaloptionsEvalExitReason::SkippedNoImagevariant: Instances without a canonical Docker image are explicitly marked as skipped (never silently unresolved), distinguishing them from other failure modesrun_docker_tests()and supporting functions that:git applyinside containersDockerTestsProvenancestruct to record backend-specific metadata (timeout, parallel worker count) in evaluation resultsother["image"]fieldevaluator_selftest::run()to accept"docker-tests"backend and route instances through the new offline evaluatorbench_evaluate()to parse--backend docker-testsflagImplementation Details
docker run --rmwith explicitdocker rm -fcleanup viaContainerGuardRAII pattern to prevent orphaned containers-v --tb=nooutput, handling PASSED/FAILED/ERROR prefixes and optional failure reasonsdocker exec bash -c "cat > /tmp/candidate.patch"to avoid shell escaping issuesimagefield and JSON-encoded test lists inFAIL_TO_PASS/PASS_TO_PASSfieldsTesting
All existing tests updated to include
docker_tests: Nonein provenance structures. New test filetests/bench_docker_tests_eval.rsvalidates:https://claude.ai/code/session_01EvN3BDAbWh1cme8isYoPQx