Skip to content

feat(m1b): SearchStrategy Protocol + BFTS-lite + Evaluator trust boundary#3

Open
suzuke wants to merge 13 commits into
feat/m1a-trial-ledgerfrom
feat/m1b-search-strategy
Open

feat(m1b): SearchStrategy Protocol + BFTS-lite + Evaluator trust boundary#3
suzuke wants to merge 13 commits into
feat/m1a-trial-ledgerfrom
feat/m1b-search-strategy

Conversation

@suzuke
Copy link
Copy Markdown
Owner

@suzuke suzuke commented Apr 25, 2026

Summary

Stacked on #2 (M1a TrialLedger). Adds searchable experiment tree primitives — the differentiating feature of v1.0 vs the existing greedy/restart/beam loop.

What's new

SearchStrategy Protocol

  • crucible/strategy.pySearchStrategy Protocol with 4 actions: Continue / BranchFrom(parent_id) / Restart / Stop.
  • 3 reference impls: GreedyStrategy, RestartStrategy, BFTSLiteStrategy (best-first tree search over kept ledger nodes).
  • Strategies are pure (StrategyContext snapshot in, action out — no I/O).

Orchestrator wiring

  • _run_loop_serial calls strategy.decide() BEFORE legacy max_retries / convergence stops, so BFTS can pre-empt failure-streak halts via BranchFrom.
  • BranchFrom(parent_id) resolves the AttemptNode's commit via the ledger and git.reset_to_commit()s the workspace — the next iteration runs from that ancestor.
  • parent_id semantics fixed: only keep outcomes advance the code-parent pointer (discard/crash/violation/skip nodes still appear in the ledger but don't claim ancestry).

Trust boundary fixes

  • CheatResistancePolicy SSOT — POC's security/ module (and ~3,400-input ACL test corpus) lifted into main. GuardRails(workspace=...) opts into SSOT mode with symlink/hardlink/path-traversal defenses.
  • Docker eval inside isolationsandbox.py:59-64 bug closed: parse_metric in Docker mode now runs the eval cmd inside the container, host parses captured stdout.
  • Sealed EvalResult artefact — every iteration writes platform-owned logs/run-<tag>/iter-N/eval-result.json with seal: "content-sha256:<hex>" over canonical JSON. AttemptNode carries eval_result_ref + eval_result_sha256.
  • EvalResult seals correct bytesMetricParseResult returns the eval cmd's stdout/stderr_tail; orchestrator hashes those (not run cmd's), so the integrity hash actually proves the bytes that produced the metric value.

HTML reporter tree-view

  • _render_tree(nodes) does DFS-by-parent walk with depth indentation. Linear chains (greedy/restart) collapse to depth=0; BFTS expansions produce the visible tree.
  • Branch markers (↳) on non-root cards. Orphan badge for nodes whose parent_id isn't in the ledger.

Reviewer trail

Round Verdict Findings addressed
1 NEEDS_TWEAK F1 EvalResult sealing source / F2 BranchFrom pre-empts max_retries / F3 parent_id code ancestry
2 APPROVE All 3 fixed with explicit regression tests

Stats

  • 11 commits, ~3,400 LOC, 2,296 tests passing (4 skipped: Win + tier-3 placeholders), 0 regressions.

Known limitations (M2 candidates)

  • evaluation.repeat > 1 aggregate path still doesn't seal each eval stdout (returns (RunResult, metric) only). Acceptable for M1b; M2 may extend.
  • stderr_sha256 is tail-based (last 50 lines) not full stderr — code comments document this honestly.
  • BFTSLiteStrategy is naive (always picks metric-best). M2 doom-loop pruning makes it smarter about not re-expanding stale branches.
  • Beam strategy still uses legacy _run_loop_beam (not yet routed through SearchStrategy).

Test plan

  • SearchStrategy Protocol unit tests (22 tests covering all 4 actions × 3 strategies)
  • BFTS integration through real Orchestrator (5 tests with FakeAgent)
  • BranchFrom pre-empts max_retries (regression for reviewer F2)
  • parent_id is code ancestry, not sequence (4 tests for reviewer F3)
  • EvalResult seals eval cmd bytes (regression for reviewer F1)
  • Docker parse_metric runs in isolation (4 tests)
  • HTML tree-view: indentation / branching / orphans / linear-chain unchanged
  • CheatResistancePolicy SSOT integration in GuardRails
  • 2,072 ACL adversarial inputs (POC corpus) all blocked
  • Real-agent demo gate — see docs/M1B-DEMO-GATE.md

Real-agent demo gate (sanity check)

4 runs across 2 examples × 2 strategies, 3 iter each. Total: 12 iter, ~$0.77, ~12 min wall.

Example Strategy Best metric Total cost Notable
optimize-compress greedy 1.7973 (3.5× baseline) $0.2479 strict descending chain
optimize-compress bfts-lite 1.7074 $0.2068 iter-2 discard → iter-3 keep parent=n000001
optimize-2048 greedy 8921.2 (6.6× baseline) $0.1532 strict monotonic, all keep
optimize-2048 bfts-lite 7595.4 (5.6× baseline) $0.1630 strict monotonic, all keep

What this validates — end-to-end M1b wiring: [strategy] max_iterations=3 reached log proves BFTSLiteStrategy.decide() is the deciding voice; sealed eval-result.json per iteration; HTML tree-view renders for both. Most importantly: in compress-bfts the iter-3 parent_id=n000001 (NOT n000002 — the discarded sibling), confirming PR 8a's code-ancestry fix is observable in real LLM runs.

What this does NOT validate — "BFTS beats greedy" superiority. 3 iter is too short to exercise meaningful branching; M2 doom-loop pruning + 30+ iter runs are what would produce statistical evidence.

🤖 Generated with Claude Code

suzuke and others added 13 commits April 25, 2026 18:54
…e impls

First M1b deliverable. Pure additive — defines the interface BFTS will
implement; existing greedy/restart/beam logic in orchestrator.py is NOT
yet rewired to call this Protocol. Wiring lands in M1b PR 2.

What's added:
- src/crucible/strategy.py (~210 LOC):
  - StrategyAction sum type (frozen dataclasses):
    * Continue       — extend current branch
    * BranchFrom(id) — start a new branch from kept node
    * Restart        — back to baseline
    * Stop           — end the run cleanly
  - StrategyContext: pure snapshot passed to decide() — ledger nodes,
    metric_lookup, direction, iteration_count, consecutive_failures,
    plateau_threshold, max_iterations, baseline_commit. No live objects;
    strategies remain unit-testable in isolation.
  - SearchStrategy Protocol (runtime_checkable): name + decide() +
    should_prune(). Pure functions, no I/O, no side effects.
  - GreedyStrategy: stops at plateau / max_iterations.
  - RestartStrategy: returns Restart on plateau (else stops at max_iters).
  - BFTSLiteStrategy: best-first tree search.
    * Picks max (or min for minimize) kept node by metric_lookup.
    * If most-recent ledger node is already extending best, returns
      Continue; else BranchFrom(best_id).
    * should_prune is a no-op stub; M2 plugs in doom-loop pruning.
  - make_strategy(name) factory.

- tests/test_strategy.py (22 tests, ~200 LOC):
  - All 3 reference impls satisfy SearchStrategy Protocol (isinstance check)
  - make_strategy dispatches and raises on unknown name
  - GreedyStrategy: continue on no failures, stop at plateau / max_iters
  - RestartStrategy: returns Restart on plateau, Stop on max_iters
  - BFTSLiteStrategy:
    * Continue when no kept nodes (cold start)
    * Continue when single kept node and most-recent IS that node
    * BranchFrom best when most-recent is on wrong branch
    * Minimize objective picks smallest metric
    * Continue when already extending the best branch
    * Stop at max_iterations
    * should_prune always False (M1b PR 1 stub)
  - StrategyAction frozen + equality contract

Test results: 22/22 strategy tests pass; 127 passed total when run
alongside M1a tests (ledger + orchestrator-ledger + reporter + demo gate).
0 regressions.

Per spec §M1b deliverable 1: SearchStrategy interface extracted ✅
Pending: PR 2 will rewire orchestrator.run_loop to call SearchStrategy
instead of branching on config.search.strategy strings. Existing greedy/
restart/beam behavior preserved through the new GreedyStrategy /
RestartStrategy reference impls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second M1b deliverable. Existing greedy / restart behavior in
_run_loop_serial now flows through the new SearchStrategy Protocol;
"beam" still uses the legacy _run_loop_beam path (M1b PR 3 will lift it).

What changed:

src/crucible/strategy.py
  - Renamed StrategyContext.consecutive_failures → plateau_streak. The
    field is "iterations since last improvement," same as
    Orchestrator._count_plateau_streak. The original name was misleading
    because consecutive_failures counts crashes/discards, not no-progress
    iterations. Rename clarifies the contract before any production code
    relies on the field.

src/crucible/orchestrator.py
  - Import { BranchFrom, Continue as StrategyContinue, Restart as
    StrategyRestart, SearchStrategy, Stop as StrategyStop, StrategyContext,
    make_strategy }.
  - __init__ instantiates self.strategy = make_strategy(config.search.strategy)
    for non-beam runs. Falls back to None on unknown name (legacy path).
    Beam still uses legacy.
  - _run_loop_serial: replaces the old `if strategy == "restart" and ...
    plateau_streak >= plateau_threshold` block with strategy.decide(ctx).
    Action handling:
      Stop          → log reason, break loop
      Restart       → existing reset_to_commit + context.add_error
      BranchFrom    → log "not yet wired" (M1b PR 3 ships expansion)
      Continue     → fall through to next iteration
  - Legacy fallback path retained for the unknown-strategy case so a typo
    in config.yaml doesn't crash the run; behavior is identical to pre-PR.

tests/test_strategy.py
  - All `failures=` kwargs renamed to `streak=` to match the new
    StrategyContext field name.

Test results: 145 passed, 4 skipped (parameter-golf agent-creates +
tier-3 placeholders + Windows). Includes test_orchestrator suite which
exercises the rewired loop with mocked agents — no behavioral
regressions in greedy/restart paths.

Per spec §M1b deliverable 2: orchestrator wires through SearchStrategy ✅
Pending PR 3: implement BranchFrom git checkout + workspace reset for
true BFTS expansion. Beam loop unified into Protocol path (currently
still uses legacy _run_loop_beam).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third M1b deliverable. BFTSLiteStrategy can now actually move the
workspace to a non-most-recent kept node via git reset_to_commit; the
next iteration runs from that commit and chains parent_id correctly.

What's added:
- src/crucible/orchestrator.py:
  - _lookup_commit_for_node(node_id): reads the ledger and returns the
    git sha for the AttemptNode with that id (or None for violation/
    skip / unknown).
  - _build_strategy_context(session_count, plateau_threshold,
    max_iterations): builds the StrategyContext with ledger_nodes +
    metric_lookup so BFTSLiteStrategy gets full attempt-tree visibility.
    metric_lookup is built from ResultsLog records keyed by AttemptNode
    id (handles both linear "n000042" and beam "b1n000042" schemes).
  - _run_loop_serial: BranchFrom action now actually:
    1. resolves parent_id → commit sha via _lookup_commit_for_node
    2. self.git.reset_to_commit(target_commit)
    3. logs and adds context note for the agent
    4. updates _last_attempt_id_by_beam[None] so the next AttemptNode
       chains parent_id back to the requested ancestor
    5. resets consecutive_failures + consecutive_skips counters
  - If commit resolution fails (e.g., parent_id refers to a violation
    node with no commit), logs a warning and falls through to Continue
    semantics — never crash the loop.

- tests/test_orchestrator_ledger.py (4 new tests):
  - _lookup_commit_for_node returns the right sha
  - Violation node returns None (no commit)
  - Unknown id returns None
  - _build_strategy_context populates metric_lookup correctly with
    ExperimentRecord → AttemptNode id mapping

Test results: 22 in test_orchestrator_ledger pass, 158 pass when run
alongside test_strategy + test_ledger + test_orchestrator + test_demo_gate
+ test_reporter_html. 0 regressions.

Per spec §M1b deliverable 3: BFTS-lite is now executable end-to-end —
strategy.decide() returns BranchFrom → orchestrator resets workspace →
next iteration runs from the chosen ancestor. Demo gate (M1b PR 4) will
prove BFTS produces a more efficient search than greedy on
optimize-compress / optimize-2048.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dator

Fourth M1b deliverable. End-to-end exercise of the BFTS-lite strategy
through the real Orchestrator loop with a deterministic FakeAgent.

What's added:
- src/crucible/config.py:
  - SearchConfig.strategy comment now includes "bfts-lite"
  - load_config validator accepts "bfts-lite" alongside greedy/restart/beam
  - Updated error message lists all four valid strategies

- tests/test_bfts_integration.py (5 tests, ~190 LOC):
  - test_orchestrator_instantiates_bfts_strategy: confirms
    config.search.strategy="bfts-lite" produces a BFTSLiteStrategy
    instance on the Orchestrator
  - test_greedy_produces_linear_chain: greedy + monotonic-improvement
    agent → 3 keep nodes in linear chain (n1 → n2 → n3)
  - test_bfts_produces_linear_chain_when_monotonic: same agent under
    BFTS → still linear, because most-recent IS always the best, so
    BFTS returns Continue (no Branch needed)
  - test_bfts_branches_after_discard: ChattyAgent produces mixed
    keep/discard, asserts both outcomes appear in ledger (BFTS exercises
    BranchFrom/Continue mix)
  - test_bfts_branchfrom_uses_legacy_path_for_violation_nodes:
    _lookup_commit_for_node returns None for unknown ids (no crash)

Each test builds an isolated tmp git repo with synthetic solution.py +
evaluate.py + .crucible/config.yaml; deterministic agent edits the
multiplier in solution.py; evaluate.py computes sum(f(0..9)).

Workaround: evaluate.py purges __pycache__ before importing solution,
because tests change solution.py faster than mtime resolution allows
Python to detect staleness against cached .pyc files. Production
runs have full-second mtime gaps and don't need this; only synthetic
tests do.

Test results:
  test_strategy:                22/22  ✅
  test_ledger:                  22/22  ✅ (1 Win skip)
  test_orchestrator_ledger:     22/22  ✅
  test_orchestrator:            full pass ✅
  test_demo_gate:               53/56  ✅ (3 expected skips)
  test_reporter_html:           14/14  ✅
  test_bfts_integration:         5/5  ✅ (NEW)
  test_config:                  full pass ✅

Per spec §M1b deliverable 4: integration test demonstrates BFTS-lite
runs end-to-end through orchestrator. Demo gate (M1b PR 5) will be a
real-agent comparison on optimize-compress / optimize-2048 to claim
"BFTS beats greedy" with empirical numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fifth M1b deliverable, step a/c. Lifts the CheatResistancePolicy module
from POC worktree into the production codebase, plus the entire ACL
test corpus (~3,400 adversarial inputs) so future PRs that touch the
ACL logic have an immediate regression net.

What's added:
- src/crucible/security/__init__.py: package init exposing
  CheatResistancePolicy, Classification, PolicyViolation
- src/crucible/security/cheat_resistance_policy.py (~165 LOC): SSOT
  for path classification per spec §3 / §11. classify(path) returns
  one of editable / readonly / hidden / unlisted with priority order:
    1. Workspace boundary (paths outside → unlisted)
    2. Inode collision (hardlink to hidden/readonly)
    3. Direct path match (after symlink resolution)
    4. Default → unlisted
  assert_writable / assert_visible raise PolicyViolation.

- tests/security/__init__.py
- tests/security/test_acl_boundary.py: 36 hardcoded ACL attacks
  (path traversal, symlink, hardlink, glob, case mutations)
- tests/security/test_acl_property.py: 9 Hypothesis properties
  (~1,330 generated examples). Strongest invariant: no mutation can
  upgrade hidden → editable.
- tests/security/test_acl_mutation.py: 1,957 combinatoric variants
  (encoding × case × suffix × seed cartesian product)
- tests/security/test_executor_escape.py: 70 LocalPythonExecutor
  escape attempts (smolagents 1.24.0 with additional_authorized_imports=[])
  All blocked or neutered (5 disguised-payload escapes verified
  inert via sentinel-file checks).

- pyproject.toml: dev dep additions
  - hypothesis>=6.150 (property tests)
  - smolagents>=1.24.0 (executor escape tests)

Test results: 2,072 / 2,072 security tests pass (1 Win-only skipped on
macOS), 0 regressions on the rest of the suite.

Per spec §M1b deliverable 2 — Evaluator trust boundary part 1:
CheatResistancePolicy now lives in main. PR 5b will refactor
guardrails.py to read from this single source of truth.
PR 5c will fix sandbox.py:59-64 (the eval-cmd-on-host break) so
evaluator runs in the same isolation as run_cmd.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fifth M1b deliverable, step b/c. GuardRails now delegates path
classification to CheatResistancePolicy when constructed with a
workspace, gaining symlink/hardlink/path-traversal defenses without
changing the external API.

What changed:

src/crucible/guardrails.py
  - Constructor signature extended (backward-compatible):
      GuardRails(editable, readonly)                         # legacy mode
      GuardRails(editable, readonly, hidden, workspace=path) # SSOT mode
  - In SSOT mode, builds an internal CheatResistancePolicy and
    `check_edits()` runs every modified file through `policy.classify()`,
    catching:
      * path traversal (`../etc/passwd` → unlisted)
      * absolute path escape outside workspace
      * symlink redirection (solution.py → evaluate.py)
      * hardlink aliasing (caught via inode collision)
      * `.crucible/` always platform-protected (defense in depth)
    New "hidden" violation kind for hidden files (previously caught only
    via Claude SDK PreToolUse hooks; now also at the GuardRails layer).
  - In legacy mode, behavior identical to pre-M1b: pure string membership
    check. Existing callers without workspace= keep working.
  - New `add_editable(rel_path)` method updates both the legacy set AND
    the underlying CheatResistancePolicy in lock-step. Used by
    orchestrator's `allow_install` path to grant requirements.txt edits
    mid-run.
  - New `policy` property exposes the CheatResistancePolicy when SSOT
    mode is active, for callers needing direct classify access.

src/crucible/orchestrator.py
  - GuardRails construction now passes `hidden=config.files.hidden` and
    `workspace=self.workspace`, enabling SSOT mode for production runs.
  - allow_install path uses `add_editable()` instead of mutating the
    set directly, so the policy stays in sync.

Test results: 2,189 passed, 3 skipped (1 Win-only + 2 tier-3 placeholders).
0 regressions. Existing test_guardrails string-only tests unaffected;
test_orchestrator.test_allow_install_installs_on_requirements_change
exercises the add_editable sync path.

Per spec §M1b deliverable 2 — Evaluator trust boundary part 2:
GuardRails now SSOT-aware. PR 5c will fix sandbox.py:59-64 (eval cmd
running on host while run_cmd is in container) so evaluator inherits
the same isolation policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sixth M1b deliverable, step c/c. Closes the security finding the v3.2
spec review surfaced at sandbox.py:59-64: in Docker mode, run_cmd was
running inside the container, but parse_metric (eval_cmd) ran on the
host. That meant any metric file the agent could write on the host —
or any command injection through eval_command — could bypass the
isolation that protected run_cmd.

What changed:

src/crucible/runner.py
  - RunResult dataclass gains a new optional field `stdout: str = ""`.
    Existing callers unaffected (default empty); SandboxRunner uses it
    to capture Docker container stdout for metric extraction.
  - ExperimentRunner.execute populates stdout on RunResult (parity).

src/crucible/sandbox.py
  - SandboxRunner.parse_metric now branches on backend:
      backend == "docker": runs eval_command via _docker_run (same
        isolation policy as run_cmd: --network none, mounts, shadow
        envs, resource limits). Host process parses captured stdout
        with the same regex as the native runner.
      backend != "docker": delegates to native ExperimentRunner
        (unchanged behavior).
    Returns the float metric value or None on no-match / parse failure.
  - _docker_run captures stdout in RunResult.

tests/test_sandbox.py
  - Renamed test_parse_metric_always_native → test_parse_metric_native_when_backend_none
    (the old name asserted the now-fixed v3.2 break).
  - Added 4 new tests:
    * test_parse_metric_docker_uses_isolation: docker backend routes
      eval through _docker_run, NOT native; native.parse_metric NEVER
      called when backend=docker.
    * test_parse_metric_native_falls_through: backend=none keeps
      native delegation.
    * test_parse_metric_docker_returns_none_on_no_match: missing
      metric in stdout → None, no crash.
    * test_parse_metric_docker_handles_invalid_float: malformed
      metric value → None, no exception.
  - Updated existing RunResult test fixtures to use stdout= keyword.

Test results: 2,261 passed, 4 skipped (Win + tier-3 placeholders).
0 regressions.

Per spec §M1b deliverable 2 — Evaluator trust boundary part 3:
Trust break closed. Docker mode is now end-to-end isolated:
run_cmd in container → eval_cmd in container → host parses captured
stdout → host writes sealed EvalResult (M1b PR 6 will add the
EvalResult artefact write-out).

Spec §11 INV-2 / INV-3 alignment:
- Eval and run share IsolationPolicy ✅
- Host process owns metric extraction ✅
- Agent code cannot influence which file the platform parses ✅

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tNode

Sixth M1b deliverable. After each iteration's metric eval, orchestrator
now writes logs/run-<tag>/iter-N/eval-result.json — the platform-owned
sealed artefact promised by spec §4 / §11. AttemptNode.eval_result_ref
and eval_result_sha256 are populated from this write.

What's added:

src/crucible/orchestrator.py
  - _write_eval_result_artifact(iteration, commit_hash, run_result,
    metric_value, run_duration_seconds) -> tuple[rel_path, sha256]:
    * Builds an EvalResult dataclass populated from the iteration's
      run/parse_metric outcome.
    * eval_manifest_hash = sha256(eval_command|run_command|metric_name).
      M2 will extend to entry_file_hash + config_hash.
    * stdout_sha256 + stderr_sha256 over the captured streams.
    * seal field set to "content-sha256:<hex>" computed over canonical
      JSON (sorted keys, no whitespace) of the full payload. M2 will
      upgrade to HMAC-SHA256 once the secret-key path lands.
    * Writes the JSON to logs/run-<tag>/iter-N/eval-result.json.
    * Returns (relative_path_from_workspace, sha256_of_disk_bytes) so
      callers thread the integrity hash onto AttemptNode.
    * On any I/O exception, returns (None, None) and logs WARNING —
      never breaks the iteration loop.

  - run_one_iteration: calls _write_eval_result_artifact after the
    eval block, before _dual_log. Populates self._pending_eval_result_ref
    and self._pending_eval_result_sha256 (state lasts only across the
    next dual_log call).

  - _record_to_attempt_node: reads _pending_eval_result_* and threads
    them onto AttemptNode.eval_result_ref / eval_result_sha256.

tests/test_orchestrator_ledger.py (3 new tests):
  - test_write_eval_result_artifact_creates_json_with_seal: verifies
    file lands at spec path, payload has metric/exit_code/seal in
    "content-sha256:" prefix, returned sha matches disk
  - test_write_eval_result_artifact_handles_missing_metric:
    metric_value=None → valid=False, seal still computed
  - test_write_eval_result_artifact_handles_io_failure: bad workspace
    path → returns (None, None) without raising

Smoke verification (zero token):
The two-iteration smoke now produces:
  logs/run-smoke/iter-1/eval-result.json
  logs/run-smoke/iter-2/eval-result.json
And the corresponding AttemptNodes carry eval_result_ref +
eval_result_sha256 populated from the actual on-disk hashes.

Test results: 25 in test_orchestrator_ledger pass; full suite still
green for orchestrator + bfts_integration. 0 regressions.

Per spec §M1b deliverable 6 + §11 reviewer notes:
- Eval result is platform-owned ✅
- Integrity hash present (M1: content sha; M2: HMAC) ✅
- Reader must always recompute on read (caller responsibility,
  documented in docs/LEDGER.md)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seventh M1b deliverable. The reporter now renders the ledger as an
indented tree, making BFTS branch expansions visually obvious. Linear
chains (greedy/restart) still render the same as M1a — the new code
collapses to depth=0 for them.

What changed:

src/crucible/reporter/html_tree.py
  - New _render_tree(nodes, best_id, metric_lookup): builds the
    parent → children map and walks DFS, emitting cards with
    progressive depth.
  - _render_card(node, ..., depth=0, orphan=False): added depth
    parameter (margin-left: depth*32px) and an "orphan" flag for
    nodes whose parent_id is not in the ledger.
  - Roots (parent_id=None) render at depth=0; descendants get
    "↳" branch marker + indentation.
  - Defensive: cycles short-circuited by visited set; orphan nodes
    rendered at depth=0 below the tree with an orange "orphan" badge.
  - CSS additions: .branch-marker (subtle gray ↳),
    .card-orphan (yellow tint), .orphan-badge.

tests/test_reporter_html.py (4 new tests):
  - test_tree_view_indents_descendants: 3-level chain has 32/64px
    margins and ≥2 branch markers
  - test_tree_view_handles_branching: 2 children of same parent
    each at depth=1 with branch markers
  - test_tree_view_orphan_node: parent_id pointing to absent node
    renders with "orphan" badge + visible label
  - test_tree_view_linear_chain_unchanged: BFTS DFS produces same
    document order as M1a linear timeline for non-branching
    ledgers (regression assurance)

Test results: 17 reporter tests pass (13 existing + 4 new).

Per spec §M1b deliverable 3:
HTML report now visualises BFTS expansion structure. The "BFTS picked
n000001 over n000002" decision is no longer hidden in a flat list —
viewers see the actual tree shape with branches and orphans.

M3 still owns:
- d3.js interactive expand/collapse
- hover-to-show full diff
- multi-run comparison overlays

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer feedback (M1b implementation review): the existing parent_id
semantics weakened BFTS decisions and the HTML tree. After a discard /
crash / violation / skip, git resets to the previous kept commit, so
the next attempt actually runs FROM that kept state — but the ledger
parent_id still pointed at the rejected node.

Now: only "keep" outcomes advance the code-parent pointer. Discard /
crash / violation / skip nodes still appear in the ledger (with their
contemporary parent_id pointing to the last kept ancestor), but they
do NOT become parents for subsequent attempts. The result is a tree
where the trunk is kept commits and rejected siblings branch off the
last good ancestor.

What changed:

src/crucible/orchestrator.py
  - _record_to_attempt_node: only updates _last_attempt_id_by_beam[beam]
    when record.status == "keep". Discard / crash etc. still get a
    ledger entry with the current parent_id, but the tracker stays put.
  - _ledger_log_no_commit: violation / skip never advance the tracker
    either (they have no commit by definition).
  - resume(): rebuild parent chain only from KEPT nodes — preserves
    code-ancestry semantics across restarts.

Test updates (3 fixed, 1 renamed for clarity):
  - test_dual_log_records_match_in_full_run: assertions now reflect
    code-ancestry chain (crash → keep → discard → keep → violation
    produces parent_ids [None, None, n2, n2, n4]).
  - test_no_commit_parent_does_not_advance (renamed from
    test_no_commit_parent_chain_extends): keep → violation → keep
    produces parent_ids [None, n1, n1] — third attempt extends from
    the FIRST kept, not the violation.
  - test_resume_reads_ledger_for_iteration_max: _last_attempt_id_by_beam
    holds the kept node, not the trailing violation.
  - test_resume_reconstructs_beam_chain: per-beam tracker holds last
    kept per beam.

Test results: 83 pass (orchestrator_ledger + bfts_integration +
strategy + orchestrator), 0 regressions.

Per reviewer F3 expectations: BFTS now sees a tree where parent_id
== "this attempt actually ran from this code state", which is the
correct ancestry signal for branch-from decisions.

PR 8b will fix the EvalResult sealing source (run_result.stdout vs
parse_metric stdout); PR 8c will fix the BFTS gate ordering relative
to legacy max_retries / convergence stops.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer F1 fix: PR 6's EvalResult sealed run_result.stdout (the run
command's output) and run_result.stderr_tail (the run command's
stderr tail). But the metric value comes from a SEPARATE subprocess
running the eval command, whose stdout/stderr were not exposed to
the orchestrator. So the integrity hash didn't actually prove the
bytes that produced metric_value.

Now: the runner returns a MetricParseResult containing the eval
command's stdout, stderr_tail, exit_code, timed_out alongside the
parsed metric value. Orchestrator hashes those bytes for EvalResult.
The integrity hash is honest: it covers what the eval actually saw.

What changed:

src/crucible/runner.py
  - New MetricParseResult dataclass: metric_value, stdout,
    stderr_tail, exit_code, timed_out.
  - ExperimentRunner.parse_metric_result(eval_cmd, metric_name): runs
    the eval subprocess, captures stdout AND stderr (tail), parses
    the metric, returns MetricParseResult.
  - ExperimentRunner.parse_metric: backward-compat shim that calls
    parse_metric_result and returns just the float.

src/crucible/sandbox.py
  - SandboxRunner.parse_metric_result: same shape; in Docker mode,
    runs the eval command via _docker_run (closing the v3.2 break)
    and returns MetricParseResult with the container's captured
    stdout. Native mode delegates to ExperimentRunner.parse_metric_result.
  - SandboxRunner.parse_metric: backward-compat shim.

src/crucible/orchestrator.py
  - run_one_iteration: prefers parse_metric_result when the runner
    has it (real ExperimentRunner / SandboxRunner). Falls back to
    parse_metric for unknown / mocked runners that haven't been
    upgraded — preserves backward compat.
  - _write_eval_result_artifact signature changed: takes seal_stdout,
    seal_stderr_tail, seal_exit_code, seal_timed_out (eval bytes)
    instead of run_result. Hashes seal_stdout for stdout_sha256 and
    seal_stderr_tail for stderr_sha256. The reviewer noted "stderr
    is just the tail" — docstring now flags this honestly; M2 may
    capture full stderr if needed.

Tests:
  - tests/test_orchestrator.py: bulk-updated 14 mock sites:
    `patch.object(orch.runner, "parse_metric")` → `parse_metric_result`,
    return values wrapped in MetricParseResult(metric_value=N, stdout="",
    stderr_tail=""). side_effect lists rewritten to MetricParseResult
    elements (covers all max-iterations + convergence tests).
  - tests/test_sandbox.py: 2 native-fallthrough tests now mock
    parse_metric_result.
  - tests/test_orchestrator_ledger.py: 3 _write_eval_result_artifact
    tests updated for new signature, plus new
    test_write_eval_result_artifact_seals_eval_stdout_not_run_stdout
    (regression for reviewer F1 explicitly).

Test results: 179 passed, 3 skipped (tier-3 placeholders + Win).
0 regressions.

Per spec §11 / reviewer F1:
- The eval cmd's stdout is the source of metric_value
- That same stdout is hashed into stdout_sha256
- The seal proves the bytes that produced the metric
- v3.2 trust-boundary claim is now genuinely closed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (F2)

Reviewer F2 fix: previously strategy.decide() was called AFTER the legacy
consecutive_failures / consecutive_skips / convergence stop checks. With
default max_retries=3, three failed BFTS expansions in a row would halt
the run before the strategy got a chance to BranchFrom an alternative
kept node. That made BFTS-lite fragile for the demo gate.

Now: SearchStrategy is consulted FIRST. If it returns BranchFrom, the
orchestrator applies the git reset, updates the parent chain, resets
the failure counters, and `continue`s to the next iteration — bypassing
the legacy stops entirely. Greedy / Restart preserve old behavior because
their decide() never returns BranchFrom; the legacy stops still fire as
a safety net.

What changed:

src/crucible/orchestrator.py
  - _run_loop_serial:
    1. After dual_log + consecutive_skips bookkeeping, build
       StrategyContext + call self.strategy.decide() FIRST.
    2. If action is BranchFrom: do the git reset + context update +
       counter reset + `continue` (skip legacy stops).
    3. Otherwise, run the legacy max_retries / consecutive_skips /
       convergence stops as before.
    4. After the legacy block, handle the remaining strategy actions
       (Stop / Restart / Continue) using the previously-built context.
  - Removed the duplicate BranchFrom branch from the lower action-
    handling block; it's now exclusively the upper pre-emption path.

tests/test_bfts_integration.py
  - New test_branch_from_preempts_max_retries: stub strategy that
    returns Continue then BranchFrom on subsequent decides; agent
    produces regressions on iters 2-4 then improvements on iter 5+.
    With max_retries=3, pre-PR8c would halt at 4 nodes; post-PR8c
    we see ≥5 nodes because BranchFrom diverts before the stop.
  - The stub strategy isolates the gate-ordering behavior from
    BFTSLite's own metric-best heuristic (which would Continue when
    most-recent is already a child of the only kept node).

Test results: 2,296 passed, 4 skipped (Win + tier-3 placeholders).
0 regressions across the broader suite.

Per spec §M1b deliverable + reviewer F2:
- BFTS now actually has a runway to recover from failure streaks
- Greedy/Restart unchanged: their max_retries safety net still works
- Demo gate (M1b PR 9) can now feasibly produce a BFTS-vs-greedy
  comparison without arbitrary 3-failure halts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sanity-check write-up for M1b: 4 real-agent runs on optimize-compress
and optimize-2048 (greedy + bfts-lite each), 3 iter / 12 total, ~$0.77.

Validates:
- SearchStrategy seam live (BFTSLiteStrategy.decide() observable in CLI)
- parent_id = code ancestry (compress-bfts iter-3 chains back to n000001
  after iter-2 discard, NOT to n000002)
- Sealed EvalResult artefacts per iteration with content-sha256
- HTML postmortem renders tree-view for BFTS

Does NOT claim "BFTS beats greedy" — 3 iter is too short to exercise
branching meaningfully. M2 with 30+ iter + doom-loop pruning is what
would produce metric-superiority data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
suzuke added a commit that referenced this pull request Apr 26, 2026
Address 3 of 4 round-2 reviewer follow-ups (#1, #2, #3). #4 (CLI auth
base class) deferred to PR 16c when there's a second adapter to
motivate the abstraction.

#1 stderr auth-phrase scan (codex_cli.py parse_output):
  If codex bails before --json takes effect (auth check fails before
  JSONL stream opens), the auth phrase is in stderr only. parse_output
  now scans `raw.stderr_tail` as a fallback after stdout — without
  this, the trial would land in error_type=UNKNOWN instead of AUTH.
  Tests: test_codex_auth_error_detected_in_stderr_when_no_jsonl,
  test_codex_stderr_unrelated_text_does_NOT_trigger_auth.

#2 reproducibility flags (build_argv):
  Added --ephemeral, --ignore-user-config, --ignore-rules. Without
  --ignore-user-config, a user's ~/.codex/config.toml could silently
  override `model` and break the cli_version-as-reproducibility-
  snapshot guarantee (cli_version snapshots the binary version, not
  the model). Without --ignore-rules, user execpolicy .rules files
  could re-enable codex tool calls outside the workspace-write
  sandbox (§INV-3). Without --ephemeral, codex persists session files
  across runs.
  Test: test_codex_argv_includes_reproducibility_flags.

#3 _FORBIDDEN_FLAGS calibration:
  Verified actual codex 0.124.0 flag names via `codex exec --help`.
  Removed two non-existent flags (`--bypass-approvals` was a
  fabrication; `--dangerously-skip-permissions` is a Claude CLI flag,
  not codex). Added the real `--dangerously-bypass-approvals-and-sandbox`
  (skips both approvals AND sandbox — explicit §INV-3 violation if
  ever passed). Updated meta-test to match.

Stats: +111/-11 across 2 files (`git diff --stat HEAD~1 HEAD | tail -1`).
Test suite: 60 passed (codex 23 + cli_subscription 37) in 0.15s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
suzuke added a commit that referenced this pull request Apr 26, 2026
Address 2 of 3 round-1 reviewer follow-ups (#1, #2). #3 (token cost
reporting / API-vs-OAuth framing) is forward-pointing for a future
docs/feature PR.

#1 docstring/code mismatch (gemini_cli.py module docstring):
  Module docstring claimed `--include-directories` was passed to
  scope the blast radius. build_argv never emits it. Reality: the
  scope-limiting is the subprocess `cwd` (set to scratch by the
  backend's run_subprocess in base.py) — gemini operates only on
  cwd by default, no flag needed. Updated docstring to reflect actual
  mechanism.

#2 verified auth-failure phrase:
  Captured real gemini 0.39.1 auth-failure stderr via HOME=tmp +
  empty GEMINI_API_KEY (saved as
  `tests/fixtures/gemini_auth_failure.stderr.txt`). The actual phrase
  emitted is "Please set an Auth method in your <path>/.gemini/
  settings.json or specify one of the following environment
  variables before running: GEMINI_API_KEY, ...". Added "Please set
  an Auth method" as the first entry in _AUTH_FAILURE_PHRASES with
  inline comment marking it as VERIFIED (vs the rest which remain
  EXTRAPOLATED for forward-compat).

  New regression test test_gemini_auth_error_real_fixture_from_spike
  exercises the real fixture against the parser → raises
  GeminiCLIAuthError. Notable: gemini 0.39.1 exits 0 on auth failure
  (not 1) — the typed-exception pattern is what catches it, not
  exit-code heuristics. Stderr-fallback path (PR 16a R2 #1) is the
  one that fires here.

Stats: +53/-15 across 3 files (gemini_cli.py +24/-15, new fixture +1,
test_gemini_cli_adapter.py +28).
Test results: 24 passed in test_gemini_cli_adapter.py (23 prior + 1
new fixture regression).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant