feat(m1b): SearchStrategy Protocol + BFTS-lite + Evaluator trust boundary by suzuke · Pull Request #3 · suzuke/autocrucible

suzuke · 2026-04-25T13:12:07Z

Summary

Stacked on #2 (M1a TrialLedger). Adds searchable experiment tree primitives — the differentiating feature of v1.0 vs the existing greedy/restart/beam loop.

What's new

SearchStrategy Protocol

crucible/strategy.py — SearchStrategy Protocol with 4 actions: Continue / BranchFrom(parent_id) / Restart / Stop.
3 reference impls: GreedyStrategy, RestartStrategy, BFTSLiteStrategy (best-first tree search over kept ledger nodes).
Strategies are pure (StrategyContext snapshot in, action out — no I/O).

Orchestrator wiring

_run_loop_serial calls strategy.decide() BEFORE legacy max_retries / convergence stops, so BFTS can pre-empt failure-streak halts via BranchFrom.
BranchFrom(parent_id) resolves the AttemptNode's commit via the ledger and git.reset_to_commit()s the workspace — the next iteration runs from that ancestor.
parent_id semantics fixed: only keep outcomes advance the code-parent pointer (discard/crash/violation/skip nodes still appear in the ledger but don't claim ancestry).

Trust boundary fixes

CheatResistancePolicy SSOT — POC's security/ module (and ~3,400-input ACL test corpus) lifted into main. GuardRails(workspace=...) opts into SSOT mode with symlink/hardlink/path-traversal defenses.
Docker eval inside isolation — sandbox.py:59-64 bug closed: parse_metric in Docker mode now runs the eval cmd inside the container, host parses captured stdout.
Sealed EvalResult artefact — every iteration writes platform-owned logs/run-<tag>/iter-N/eval-result.json with seal: "content-sha256:<hex>" over canonical JSON. AttemptNode carries eval_result_ref + eval_result_sha256.
EvalResult seals correct bytes — MetricParseResult returns the eval cmd's stdout/stderr_tail; orchestrator hashes those (not run cmd's), so the integrity hash actually proves the bytes that produced the metric value.

HTML reporter tree-view

_render_tree(nodes) does DFS-by-parent walk with depth indentation. Linear chains (greedy/restart) collapse to depth=0; BFTS expansions produce the visible tree.
Branch markers (↳) on non-root cards. Orphan badge for nodes whose parent_id isn't in the ledger.

Reviewer trail

Round	Verdict	Findings addressed
1	NEEDS_TWEAK	F1 EvalResult sealing source / F2 BranchFrom pre-empts max_retries / F3 parent_id code ancestry
2	APPROVE	All 3 fixed with explicit regression tests

Stats

11 commits, ~3,400 LOC, 2,296 tests passing (4 skipped: Win + tier-3 placeholders), 0 regressions.

Known limitations (M2 candidates)

evaluation.repeat > 1 aggregate path still doesn't seal each eval stdout (returns (RunResult, metric) only). Acceptable for M1b; M2 may extend.
stderr_sha256 is tail-based (last 50 lines) not full stderr — code comments document this honestly.
BFTSLiteStrategy is naive (always picks metric-best). M2 doom-loop pruning makes it smarter about not re-expanding stale branches.
Beam strategy still uses legacy _run_loop_beam (not yet routed through SearchStrategy).

Test plan

Real-agent demo gate (sanity check)

4 runs across 2 examples × 2 strategies, 3 iter each. Total: 12 iter, ~$0.77, ~12 min wall.

Example	Strategy	Best metric	Total cost	Notable
optimize-compress	greedy	1.7973 (3.5× baseline)	$0.2479	strict descending chain
optimize-compress	bfts-lite	1.7074	$0.2068	iter-2 discard → iter-3 keep `parent=n000001` ⚡
optimize-2048	greedy	8921.2 (6.6× baseline)	$0.1532	strict monotonic, all keep
optimize-2048	bfts-lite	7595.4 (5.6× baseline)	$0.1630	strict monotonic, all keep

What this validates — end-to-end M1b wiring: [strategy] max_iterations=3 reached log proves BFTSLiteStrategy.decide() is the deciding voice; sealed eval-result.json per iteration; HTML tree-view renders for both. Most importantly: in compress-bfts the iter-3 parent_id=n000001 (NOT n000002 — the discarded sibling), confirming PR 8a's code-ancestry fix is observable in real LLM runs.

What this does NOT validate — "BFTS beats greedy" superiority. 3 iter is too short to exercise meaningful branching; M2 doom-loop pruning + 30+ iter runs are what would produce statistical evidence.

🤖 Generated with Claude Code

…e impls First M1b deliverable. Pure additive — defines the interface BFTS will implement; existing greedy/restart/beam logic in orchestrator.py is NOT yet rewired to call this Protocol. Wiring lands in M1b PR 2. What's added: - src/crucible/strategy.py (~210 LOC): - StrategyAction sum type (frozen dataclasses): * Continue — extend current branch * BranchFrom(id) — start a new branch from kept node * Restart — back to baseline * Stop — end the run cleanly - StrategyContext: pure snapshot passed to decide() — ledger nodes, metric_lookup, direction, iteration_count, consecutive_failures, plateau_threshold, max_iterations, baseline_commit. No live objects; strategies remain unit-testable in isolation. - SearchStrategy Protocol (runtime_checkable): name + decide() + should_prune(). Pure functions, no I/O, no side effects. - GreedyStrategy: stops at plateau / max_iterations. - RestartStrategy: returns Restart on plateau (else stops at max_iters). - BFTSLiteStrategy: best-first tree search. * Picks max (or min for minimize) kept node by metric_lookup. * If most-recent ledger node is already extending best, returns Continue; else BranchFrom(best_id). * should_prune is a no-op stub; M2 plugs in doom-loop pruning. - make_strategy(name) factory. - tests/test_strategy.py (22 tests, ~200 LOC): - All 3 reference impls satisfy SearchStrategy Protocol (isinstance check) - make_strategy dispatches and raises on unknown name - GreedyStrategy: continue on no failures, stop at plateau / max_iters - RestartStrategy: returns Restart on plateau, Stop on max_iters - BFTSLiteStrategy: * Continue when no kept nodes (cold start) * Continue when single kept node and most-recent IS that node * BranchFrom best when most-recent is on wrong branch * Minimize objective picks smallest metric * Continue when already extending the best branch * Stop at max_iterations * should_prune always False (M1b PR 1 stub) - StrategyAction frozen + equality contract Test results: 22/22 strategy tests pass; 127 passed total when run alongside M1a tests (ledger + orchestrator-ledger + reporter + demo gate). 0 regressions. Per spec §M1b deliverable 1: SearchStrategy interface extracted ✅ Pending: PR 2 will rewire orchestrator.run_loop to call SearchStrategy instead of branching on config.search.strategy strings. Existing greedy/ restart/beam behavior preserved through the new GreedyStrategy / RestartStrategy reference impls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Second M1b deliverable. Existing greedy / restart behavior in _run_loop_serial now flows through the new SearchStrategy Protocol; "beam" still uses the legacy _run_loop_beam path (M1b PR 3 will lift it). What changed: src/crucible/strategy.py - Renamed StrategyContext.consecutive_failures → plateau_streak. The field is "iterations since last improvement," same as Orchestrator._count_plateau_streak. The original name was misleading because consecutive_failures counts crashes/discards, not no-progress iterations. Rename clarifies the contract before any production code relies on the field. src/crucible/orchestrator.py - Import { BranchFrom, Continue as StrategyContinue, Restart as StrategyRestart, SearchStrategy, Stop as StrategyStop, StrategyContext, make_strategy }. - __init__ instantiates self.strategy = make_strategy(config.search.strategy) for non-beam runs. Falls back to None on unknown name (legacy path). Beam still uses legacy. - _run_loop_serial: replaces the old `if strategy == "restart" and ... plateau_streak >= plateau_threshold` block with strategy.decide(ctx). Action handling: Stop → log reason, break loop Restart → existing reset_to_commit + context.add_error BranchFrom → log "not yet wired" (M1b PR 3 ships expansion) Continue → fall through to next iteration - Legacy fallback path retained for the unknown-strategy case so a typo in config.yaml doesn't crash the run; behavior is identical to pre-PR. tests/test_strategy.py - All `failures=` kwargs renamed to `streak=` to match the new StrategyContext field name. Test results: 145 passed, 4 skipped (parameter-golf agent-creates + tier-3 placeholders + Windows). Includes test_orchestrator suite which exercises the rewired loop with mocked agents — no behavioral regressions in greedy/restart paths. Per spec §M1b deliverable 2: orchestrator wires through SearchStrategy ✅ Pending PR 3: implement BranchFrom git checkout + workspace reset for true BFTS expansion. Beam loop unified into Protocol path (currently still uses legacy _run_loop_beam). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Third M1b deliverable. BFTSLiteStrategy can now actually move the workspace to a non-most-recent kept node via git reset_to_commit; the next iteration runs from that commit and chains parent_id correctly. What's added: - src/crucible/orchestrator.py: - _lookup_commit_for_node(node_id): reads the ledger and returns the git sha for the AttemptNode with that id (or None for violation/ skip / unknown). - _build_strategy_context(session_count, plateau_threshold, max_iterations): builds the StrategyContext with ledger_nodes + metric_lookup so BFTSLiteStrategy gets full attempt-tree visibility. metric_lookup is built from ResultsLog records keyed by AttemptNode id (handles both linear "n000042" and beam "b1n000042" schemes). - _run_loop_serial: BranchFrom action now actually: 1. resolves parent_id → commit sha via _lookup_commit_for_node 2. self.git.reset_to_commit(target_commit) 3. logs and adds context note for the agent 4. updates _last_attempt_id_by_beam[None] so the next AttemptNode chains parent_id back to the requested ancestor 5. resets consecutive_failures + consecutive_skips counters - If commit resolution fails (e.g., parent_id refers to a violation node with no commit), logs a warning and falls through to Continue semantics — never crash the loop. - tests/test_orchestrator_ledger.py (4 new tests): - _lookup_commit_for_node returns the right sha - Violation node returns None (no commit) - Unknown id returns None - _build_strategy_context populates metric_lookup correctly with ExperimentRecord → AttemptNode id mapping Test results: 22 in test_orchestrator_ledger pass, 158 pass when run alongside test_strategy + test_ledger + test_orchestrator + test_demo_gate + test_reporter_html. 0 regressions. Per spec §M1b deliverable 3: BFTS-lite is now executable end-to-end — strategy.decide() returns BranchFrom → orchestrator resets workspace → next iteration runs from the chosen ancestor. Demo gate (M1b PR 4) will prove BFTS produces a more efficient search than greedy on optimize-compress / optimize-2048. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dator Fourth M1b deliverable. End-to-end exercise of the BFTS-lite strategy through the real Orchestrator loop with a deterministic FakeAgent. What's added: - src/crucible/config.py: - SearchConfig.strategy comment now includes "bfts-lite" - load_config validator accepts "bfts-lite" alongside greedy/restart/beam - Updated error message lists all four valid strategies - tests/test_bfts_integration.py (5 tests, ~190 LOC): - test_orchestrator_instantiates_bfts_strategy: confirms config.search.strategy="bfts-lite" produces a BFTSLiteStrategy instance on the Orchestrator - test_greedy_produces_linear_chain: greedy + monotonic-improvement agent → 3 keep nodes in linear chain (n1 → n2 → n3) - test_bfts_produces_linear_chain_when_monotonic: same agent under BFTS → still linear, because most-recent IS always the best, so BFTS returns Continue (no Branch needed) - test_bfts_branches_after_discard: ChattyAgent produces mixed keep/discard, asserts both outcomes appear in ledger (BFTS exercises BranchFrom/Continue mix) - test_bfts_branchfrom_uses_legacy_path_for_violation_nodes: _lookup_commit_for_node returns None for unknown ids (no crash) Each test builds an isolated tmp git repo with synthetic solution.py + evaluate.py + .crucible/config.yaml; deterministic agent edits the multiplier in solution.py; evaluate.py computes sum(f(0..9)). Workaround: evaluate.py purges __pycache__ before importing solution, because tests change solution.py faster than mtime resolution allows Python to detect staleness against cached .pyc files. Production runs have full-second mtime gaps and don't need this; only synthetic tests do. Test results: test_strategy: 22/22 ✅ test_ledger: 22/22 ✅ (1 Win skip) test_orchestrator_ledger: 22/22 ✅ test_orchestrator: full pass ✅ test_demo_gate: 53/56 ✅ (3 expected skips) test_reporter_html: 14/14 ✅ test_bfts_integration: 5/5 ✅ (NEW) test_config: full pass ✅ Per spec §M1b deliverable 4: integration test demonstrates BFTS-lite runs end-to-end through orchestrator. Demo gate (M1b PR 5) will be a real-agent comparison on optimize-compress / optimize-2048 to claim "BFTS beats greedy" with empirical numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fifth M1b deliverable, step a/c. Lifts the CheatResistancePolicy module from POC worktree into the production codebase, plus the entire ACL test corpus (~3,400 adversarial inputs) so future PRs that touch the ACL logic have an immediate regression net. What's added: - src/crucible/security/__init__.py: package init exposing CheatResistancePolicy, Classification, PolicyViolation - src/crucible/security/cheat_resistance_policy.py (~165 LOC): SSOT for path classification per spec §3 / §11. classify(path) returns one of editable / readonly / hidden / unlisted with priority order: 1. Workspace boundary (paths outside → unlisted) 2. Inode collision (hardlink to hidden/readonly) 3. Direct path match (after symlink resolution) 4. Default → unlisted assert_writable / assert_visible raise PolicyViolation. - tests/security/__init__.py - tests/security/test_acl_boundary.py: 36 hardcoded ACL attacks (path traversal, symlink, hardlink, glob, case mutations) - tests/security/test_acl_property.py: 9 Hypothesis properties (~1,330 generated examples). Strongest invariant: no mutation can upgrade hidden → editable. - tests/security/test_acl_mutation.py: 1,957 combinatoric variants (encoding × case × suffix × seed cartesian product) - tests/security/test_executor_escape.py: 70 LocalPythonExecutor escape attempts (smolagents 1.24.0 with additional_authorized_imports=[]) All blocked or neutered (5 disguised-payload escapes verified inert via sentinel-file checks). - pyproject.toml: dev dep additions - hypothesis>=6.150 (property tests) - smolagents>=1.24.0 (executor escape tests) Test results: 2,072 / 2,072 security tests pass (1 Win-only skipped on macOS), 0 regressions on the rest of the suite. Per spec §M1b deliverable 2 — Evaluator trust boundary part 1: CheatResistancePolicy now lives in main. PR 5b will refactor guardrails.py to read from this single source of truth. PR 5c will fix sandbox.py:59-64 (the eval-cmd-on-host break) so evaluator runs in the same isolation as run_cmd. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fifth M1b deliverable, step b/c. GuardRails now delegates path classification to CheatResistancePolicy when constructed with a workspace, gaining symlink/hardlink/path-traversal defenses without changing the external API. What changed: src/crucible/guardrails.py - Constructor signature extended (backward-compatible): GuardRails(editable, readonly) # legacy mode GuardRails(editable, readonly, hidden, workspace=path) # SSOT mode - In SSOT mode, builds an internal CheatResistancePolicy and `check_edits()` runs every modified file through `policy.classify()`, catching: * path traversal (`../etc/passwd` → unlisted) * absolute path escape outside workspace * symlink redirection (solution.py → evaluate.py) * hardlink aliasing (caught via inode collision) * `.crucible/` always platform-protected (defense in depth) New "hidden" violation kind for hidden files (previously caught only via Claude SDK PreToolUse hooks; now also at the GuardRails layer). - In legacy mode, behavior identical to pre-M1b: pure string membership check. Existing callers without workspace= keep working. - New `add_editable(rel_path)` method updates both the legacy set AND the underlying CheatResistancePolicy in lock-step. Used by orchestrator's `allow_install` path to grant requirements.txt edits mid-run. - New `policy` property exposes the CheatResistancePolicy when SSOT mode is active, for callers needing direct classify access. src/crucible/orchestrator.py - GuardRails construction now passes `hidden=config.files.hidden` and `workspace=self.workspace`, enabling SSOT mode for production runs. - allow_install path uses `add_editable()` instead of mutating the set directly, so the policy stays in sync. Test results: 2,189 passed, 3 skipped (1 Win-only + 2 tier-3 placeholders). 0 regressions. Existing test_guardrails string-only tests unaffected; test_orchestrator.test_allow_install_installs_on_requirements_change exercises the add_editable sync path. Per spec §M1b deliverable 2 — Evaluator trust boundary part 2: GuardRails now SSOT-aware. PR 5c will fix sandbox.py:59-64 (eval cmd running on host while run_cmd is in container) so evaluator inherits the same isolation policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sixth M1b deliverable, step c/c. Closes the security finding the v3.2 spec review surfaced at sandbox.py:59-64: in Docker mode, run_cmd was running inside the container, but parse_metric (eval_cmd) ran on the host. That meant any metric file the agent could write on the host — or any command injection through eval_command — could bypass the isolation that protected run_cmd. What changed: src/crucible/runner.py - RunResult dataclass gains a new optional field `stdout: str = ""`. Existing callers unaffected (default empty); SandboxRunner uses it to capture Docker container stdout for metric extraction. - ExperimentRunner.execute populates stdout on RunResult (parity). src/crucible/sandbox.py - SandboxRunner.parse_metric now branches on backend: backend == "docker": runs eval_command via _docker_run (same isolation policy as run_cmd: --network none, mounts, shadow envs, resource limits). Host process parses captured stdout with the same regex as the native runner. backend != "docker": delegates to native ExperimentRunner (unchanged behavior). Returns the float metric value or None on no-match / parse failure. - _docker_run captures stdout in RunResult. tests/test_sandbox.py - Renamed test_parse_metric_always_native → test_parse_metric_native_when_backend_none (the old name asserted the now-fixed v3.2 break). - Added 4 new tests: * test_parse_metric_docker_uses_isolation: docker backend routes eval through _docker_run, NOT native; native.parse_metric NEVER called when backend=docker. * test_parse_metric_native_falls_through: backend=none keeps native delegation. * test_parse_metric_docker_returns_none_on_no_match: missing metric in stdout → None, no crash. * test_parse_metric_docker_handles_invalid_float: malformed metric value → None, no exception. - Updated existing RunResult test fixtures to use stdout= keyword. Test results: 2,261 passed, 4 skipped (Win + tier-3 placeholders). 0 regressions. Per spec §M1b deliverable 2 — Evaluator trust boundary part 3: Trust break closed. Docker mode is now end-to-end isolated: run_cmd in container → eval_cmd in container → host parses captured stdout → host writes sealed EvalResult (M1b PR 6 will add the EvalResult artefact write-out). Spec §11 INV-2 / INV-3 alignment: - Eval and run share IsolationPolicy ✅ - Host process owns metric extraction ✅ - Agent code cannot influence which file the platform parses ✅ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tNode Sixth M1b deliverable. After each iteration's metric eval, orchestrator now writes logs/run-<tag>/iter-N/eval-result.json — the platform-owned sealed artefact promised by spec §4 / §11. AttemptNode.eval_result_ref and eval_result_sha256 are populated from this write. What's added: src/crucible/orchestrator.py - _write_eval_result_artifact(iteration, commit_hash, run_result, metric_value, run_duration_seconds) -> tuple[rel_path, sha256]: * Builds an EvalResult dataclass populated from the iteration's run/parse_metric outcome. * eval_manifest_hash = sha256(eval_command|run_command|metric_name). M2 will extend to entry_file_hash + config_hash. * stdout_sha256 + stderr_sha256 over the captured streams. * seal field set to "content-sha256:<hex>" computed over canonical JSON (sorted keys, no whitespace) of the full payload. M2 will upgrade to HMAC-SHA256 once the secret-key path lands. * Writes the JSON to logs/run-<tag>/iter-N/eval-result.json. * Returns (relative_path_from_workspace, sha256_of_disk_bytes) so callers thread the integrity hash onto AttemptNode. * On any I/O exception, returns (None, None) and logs WARNING — never breaks the iteration loop. - run_one_iteration: calls _write_eval_result_artifact after the eval block, before _dual_log. Populates self._pending_eval_result_ref and self._pending_eval_result_sha256 (state lasts only across the next dual_log call). - _record_to_attempt_node: reads _pending_eval_result_* and threads them onto AttemptNode.eval_result_ref / eval_result_sha256. tests/test_orchestrator_ledger.py (3 new tests): - test_write_eval_result_artifact_creates_json_with_seal: verifies file lands at spec path, payload has metric/exit_code/seal in "content-sha256:" prefix, returned sha matches disk - test_write_eval_result_artifact_handles_missing_metric: metric_value=None → valid=False, seal still computed - test_write_eval_result_artifact_handles_io_failure: bad workspace path → returns (None, None) without raising Smoke verification (zero token): The two-iteration smoke now produces: logs/run-smoke/iter-1/eval-result.json logs/run-smoke/iter-2/eval-result.json And the corresponding AttemptNodes carry eval_result_ref + eval_result_sha256 populated from the actual on-disk hashes. Test results: 25 in test_orchestrator_ledger pass; full suite still green for orchestrator + bfts_integration. 0 regressions. Per spec §M1b deliverable 6 + §11 reviewer notes: - Eval result is platform-owned ✅ - Integrity hash present (M1: content sha; M2: HMAC) ✅ - Reader must always recompute on read (caller responsibility, documented in docs/LEDGER.md) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Seventh M1b deliverable. The reporter now renders the ledger as an indented tree, making BFTS branch expansions visually obvious. Linear chains (greedy/restart) still render the same as M1a — the new code collapses to depth=0 for them. What changed: src/crucible/reporter/html_tree.py - New _render_tree(nodes, best_id, metric_lookup): builds the parent → children map and walks DFS, emitting cards with progressive depth. - _render_card(node, ..., depth=0, orphan=False): added depth parameter (margin-left: depth*32px) and an "orphan" flag for nodes whose parent_id is not in the ledger. - Roots (parent_id=None) render at depth=0; descendants get "↳" branch marker + indentation. - Defensive: cycles short-circuited by visited set; orphan nodes rendered at depth=0 below the tree with an orange "orphan" badge. - CSS additions: .branch-marker (subtle gray ↳), .card-orphan (yellow tint), .orphan-badge. tests/test_reporter_html.py (4 new tests): - test_tree_view_indents_descendants: 3-level chain has 32/64px margins and ≥2 branch markers - test_tree_view_handles_branching: 2 children of same parent each at depth=1 with branch markers - test_tree_view_orphan_node: parent_id pointing to absent node renders with "orphan" badge + visible label - test_tree_view_linear_chain_unchanged: BFTS DFS produces same document order as M1a linear timeline for non-branching ledgers (regression assurance) Test results: 17 reporter tests pass (13 existing + 4 new). Per spec §M1b deliverable 3: HTML report now visualises BFTS expansion structure. The "BFTS picked n000001 over n000002" decision is no longer hidden in a flat list — viewers see the actual tree shape with branches and orphans. M3 still owns: - d3.js interactive expand/collapse - hover-to-show full diff - multi-run comparison overlays Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer feedback (M1b implementation review): the existing parent_id semantics weakened BFTS decisions and the HTML tree. After a discard / crash / violation / skip, git resets to the previous kept commit, so the next attempt actually runs FROM that kept state — but the ledger parent_id still pointed at the rejected node. Now: only "keep" outcomes advance the code-parent pointer. Discard / crash / violation / skip nodes still appear in the ledger (with their contemporary parent_id pointing to the last kept ancestor), but they do NOT become parents for subsequent attempts. The result is a tree where the trunk is kept commits and rejected siblings branch off the last good ancestor. What changed: src/crucible/orchestrator.py - _record_to_attempt_node: only updates _last_attempt_id_by_beam[beam] when record.status == "keep". Discard / crash etc. still get a ledger entry with the current parent_id, but the tracker stays put. - _ledger_log_no_commit: violation / skip never advance the tracker either (they have no commit by definition). - resume(): rebuild parent chain only from KEPT nodes — preserves code-ancestry semantics across restarts. Test updates (3 fixed, 1 renamed for clarity): - test_dual_log_records_match_in_full_run: assertions now reflect code-ancestry chain (crash → keep → discard → keep → violation produces parent_ids [None, None, n2, n2, n4]). - test_no_commit_parent_does_not_advance (renamed from test_no_commit_parent_chain_extends): keep → violation → keep produces parent_ids [None, n1, n1] — third attempt extends from the FIRST kept, not the violation. - test_resume_reads_ledger_for_iteration_max: _last_attempt_id_by_beam holds the kept node, not the trailing violation. - test_resume_reconstructs_beam_chain: per-beam tracker holds last kept per beam. Test results: 83 pass (orchestrator_ledger + bfts_integration + strategy + orchestrator), 0 regressions. Per reviewer F3 expectations: BFTS now sees a tree where parent_id == "this attempt actually ran from this code state", which is the correct ancestry signal for branch-from decisions. PR 8b will fix the EvalResult sealing source (run_result.stdout vs parse_metric stdout); PR 8c will fix the BFTS gate ordering relative to legacy max_retries / convergence stops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer F1 fix: PR 6's EvalResult sealed run_result.stdout (the run command's output) and run_result.stderr_tail (the run command's stderr tail). But the metric value comes from a SEPARATE subprocess running the eval command, whose stdout/stderr were not exposed to the orchestrator. So the integrity hash didn't actually prove the bytes that produced metric_value. Now: the runner returns a MetricParseResult containing the eval command's stdout, stderr_tail, exit_code, timed_out alongside the parsed metric value. Orchestrator hashes those bytes for EvalResult. The integrity hash is honest: it covers what the eval actually saw. What changed: src/crucible/runner.py - New MetricParseResult dataclass: metric_value, stdout, stderr_tail, exit_code, timed_out. - ExperimentRunner.parse_metric_result(eval_cmd, metric_name): runs the eval subprocess, captures stdout AND stderr (tail), parses the metric, returns MetricParseResult. - ExperimentRunner.parse_metric: backward-compat shim that calls parse_metric_result and returns just the float. src/crucible/sandbox.py - SandboxRunner.parse_metric_result: same shape; in Docker mode, runs the eval command via _docker_run (closing the v3.2 break) and returns MetricParseResult with the container's captured stdout. Native mode delegates to ExperimentRunner.parse_metric_result. - SandboxRunner.parse_metric: backward-compat shim. src/crucible/orchestrator.py - run_one_iteration: prefers parse_metric_result when the runner has it (real ExperimentRunner / SandboxRunner). Falls back to parse_metric for unknown / mocked runners that haven't been upgraded — preserves backward compat. - _write_eval_result_artifact signature changed: takes seal_stdout, seal_stderr_tail, seal_exit_code, seal_timed_out (eval bytes) instead of run_result. Hashes seal_stdout for stdout_sha256 and seal_stderr_tail for stderr_sha256. The reviewer noted "stderr is just the tail" — docstring now flags this honestly; M2 may capture full stderr if needed. Tests: - tests/test_orchestrator.py: bulk-updated 14 mock sites: `patch.object(orch.runner, "parse_metric")` → `parse_metric_result`, return values wrapped in MetricParseResult(metric_value=N, stdout="", stderr_tail=""). side_effect lists rewritten to MetricParseResult elements (covers all max-iterations + convergence tests). - tests/test_sandbox.py: 2 native-fallthrough tests now mock parse_metric_result. - tests/test_orchestrator_ledger.py: 3 _write_eval_result_artifact tests updated for new signature, plus new test_write_eval_result_artifact_seals_eval_stdout_not_run_stdout (regression for reviewer F1 explicitly). Test results: 179 passed, 3 skipped (tier-3 placeholders + Win). 0 regressions. Per spec §11 / reviewer F1: - The eval cmd's stdout is the source of metric_value - That same stdout is hashed into stdout_sha256 - The seal proves the bytes that produced the metric - v3.2 trust-boundary claim is now genuinely closed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (F2) Reviewer F2 fix: previously strategy.decide() was called AFTER the legacy consecutive_failures / consecutive_skips / convergence stop checks. With default max_retries=3, three failed BFTS expansions in a row would halt the run before the strategy got a chance to BranchFrom an alternative kept node. That made BFTS-lite fragile for the demo gate. Now: SearchStrategy is consulted FIRST. If it returns BranchFrom, the orchestrator applies the git reset, updates the parent chain, resets the failure counters, and `continue`s to the next iteration — bypassing the legacy stops entirely. Greedy / Restart preserve old behavior because their decide() never returns BranchFrom; the legacy stops still fire as a safety net. What changed: src/crucible/orchestrator.py - _run_loop_serial: 1. After dual_log + consecutive_skips bookkeeping, build StrategyContext + call self.strategy.decide() FIRST. 2. If action is BranchFrom: do the git reset + context update + counter reset + `continue` (skip legacy stops). 3. Otherwise, run the legacy max_retries / consecutive_skips / convergence stops as before. 4. After the legacy block, handle the remaining strategy actions (Stop / Restart / Continue) using the previously-built context. - Removed the duplicate BranchFrom branch from the lower action- handling block; it's now exclusively the upper pre-emption path. tests/test_bfts_integration.py - New test_branch_from_preempts_max_retries: stub strategy that returns Continue then BranchFrom on subsequent decides; agent produces regressions on iters 2-4 then improvements on iter 5+. With max_retries=3, pre-PR8c would halt at 4 nodes; post-PR8c we see ≥5 nodes because BranchFrom diverts before the stop. - The stub strategy isolates the gate-ordering behavior from BFTSLite's own metric-best heuristic (which would Continue when most-recent is already a child of the only kept node). Test results: 2,296 passed, 4 skipped (Win + tier-3 placeholders). 0 regressions across the broader suite. Per spec §M1b deliverable + reviewer F2: - BFTS now actually has a runway to recover from failure streaks - Greedy/Restart unchanged: their max_retries safety net still works - Demo gate (M1b PR 9) can now feasibly produce a BFTS-vs-greedy comparison without arbitrary 3-failure halts Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sanity-check write-up for M1b: 4 real-agent runs on optimize-compress and optimize-2048 (greedy + bfts-lite each), 3 iter / 12 total, ~$0.77. Validates: - SearchStrategy seam live (BFTSLiteStrategy.decide() observable in CLI) - parent_id = code ancestry (compress-bfts iter-3 chains back to n000001 after iter-2 discard, NOT to n000002) - Sealed EvalResult artefacts per iteration with content-sha256 - HTML postmortem renders tree-view for BFTS Does NOT claim "BFTS beats greedy" — 3 iter is too short to exercise branching meaningfully. M2 with 30+ iter + doom-loop pruning is what would produce metric-superiority data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address 3 of 4 round-2 reviewer follow-ups (#1, #2, #3). #4 (CLI auth base class) deferred to PR 16c when there's a second adapter to motivate the abstraction. #1 stderr auth-phrase scan (codex_cli.py parse_output): If codex bails before --json takes effect (auth check fails before JSONL stream opens), the auth phrase is in stderr only. parse_output now scans `raw.stderr_tail` as a fallback after stdout — without this, the trial would land in error_type=UNKNOWN instead of AUTH. Tests: test_codex_auth_error_detected_in_stderr_when_no_jsonl, test_codex_stderr_unrelated_text_does_NOT_trigger_auth. #2 reproducibility flags (build_argv): Added --ephemeral, --ignore-user-config, --ignore-rules. Without --ignore-user-config, a user's ~/.codex/config.toml could silently override `model` and break the cli_version-as-reproducibility- snapshot guarantee (cli_version snapshots the binary version, not the model). Without --ignore-rules, user execpolicy .rules files could re-enable codex tool calls outside the workspace-write sandbox (§INV-3). Without --ephemeral, codex persists session files across runs. Test: test_codex_argv_includes_reproducibility_flags. #3 _FORBIDDEN_FLAGS calibration: Verified actual codex 0.124.0 flag names via `codex exec --help`. Removed two non-existent flags (`--bypass-approvals` was a fabrication; `--dangerously-skip-permissions` is a Claude CLI flag, not codex). Added the real `--dangerously-bypass-approvals-and-sandbox` (skips both approvals AND sandbox — explicit §INV-3 violation if ever passed). Updated meta-test to match. Stats: +111/-11 across 2 files (`git diff --stat HEAD~1 HEAD | tail -1`). Test suite: 60 passed (codex 23 + cli_subscription 37) in 0.15s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address 2 of 3 round-1 reviewer follow-ups (#1, #2). #3 (token cost reporting / API-vs-OAuth framing) is forward-pointing for a future docs/feature PR. #1 docstring/code mismatch (gemini_cli.py module docstring): Module docstring claimed `--include-directories` was passed to scope the blast radius. build_argv never emits it. Reality: the scope-limiting is the subprocess `cwd` (set to scratch by the backend's run_subprocess in base.py) — gemini operates only on cwd by default, no flag needed. Updated docstring to reflect actual mechanism. #2 verified auth-failure phrase: Captured real gemini 0.39.1 auth-failure stderr via HOME=tmp + empty GEMINI_API_KEY (saved as `tests/fixtures/gemini_auth_failure.stderr.txt`). The actual phrase emitted is "Please set an Auth method in your <path>/.gemini/ settings.json or specify one of the following environment variables before running: GEMINI_API_KEY, ...". Added "Please set an Auth method" as the first entry in _AUTH_FAILURE_PHRASES with inline comment marking it as VERIFIED (vs the rest which remain EXTRAPOLATED for forward-compat). New regression test test_gemini_auth_error_real_fixture_from_spike exercises the real fixture against the parser → raises GeminiCLIAuthError. Notable: gemini 0.39.1 exits 0 on auth failure (not 1) — the typed-exception pattern is what catches it, not exit-code heuristics. Stderr-fallback path (PR 16a R2 #1) is the one that fires here. Stats: +53/-15 across 3 files (gemini_cli.py +24/-15, new fixture +1, test_gemini_cli_adapter.py +28). Test results: 24 passed in test_gemini_cli_adapter.py (23 prior + 1 new fixture regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

suzuke and others added 13 commits April 25, 2026 18:54

This was referenced Apr 25, 2026

feat(m2): doom-loop pruning for BFTSLiteStrategy #4

Open

feat(m3): interactive d3 tree-view reporter #9

Open

suzuke mentioned this pull request May 13, 2026

docs: rescue v1.0-design-final.md into repo #20

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(m1b): SearchStrategy Protocol + BFTS-lite + Evaluator trust boundary#3

feat(m1b): SearchStrategy Protocol + BFTS-lite + Evaluator trust boundary#3
suzuke wants to merge 13 commits into
feat/m1a-trial-ledgerfrom
feat/m1b-search-strategy

suzuke commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

suzuke commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

SearchStrategy Protocol

Orchestrator wiring

Trust boundary fixes

HTML reporter tree-view

Reviewer trail

Stats

Known limitations (M2 candidates)

Test plan

Real-agent demo gate (sanity check)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

suzuke commented Apr 25, 2026 •

edited

Loading