feat(m1b): SearchStrategy Protocol + BFTS-lite + Evaluator trust boundary#3
Open
suzuke wants to merge 13 commits into
Open
feat(m1b): SearchStrategy Protocol + BFTS-lite + Evaluator trust boundary#3suzuke wants to merge 13 commits into
suzuke wants to merge 13 commits into
Conversation
…e impls
First M1b deliverable. Pure additive — defines the interface BFTS will
implement; existing greedy/restart/beam logic in orchestrator.py is NOT
yet rewired to call this Protocol. Wiring lands in M1b PR 2.
What's added:
- src/crucible/strategy.py (~210 LOC):
- StrategyAction sum type (frozen dataclasses):
* Continue — extend current branch
* BranchFrom(id) — start a new branch from kept node
* Restart — back to baseline
* Stop — end the run cleanly
- StrategyContext: pure snapshot passed to decide() — ledger nodes,
metric_lookup, direction, iteration_count, consecutive_failures,
plateau_threshold, max_iterations, baseline_commit. No live objects;
strategies remain unit-testable in isolation.
- SearchStrategy Protocol (runtime_checkable): name + decide() +
should_prune(). Pure functions, no I/O, no side effects.
- GreedyStrategy: stops at plateau / max_iterations.
- RestartStrategy: returns Restart on plateau (else stops at max_iters).
- BFTSLiteStrategy: best-first tree search.
* Picks max (or min for minimize) kept node by metric_lookup.
* If most-recent ledger node is already extending best, returns
Continue; else BranchFrom(best_id).
* should_prune is a no-op stub; M2 plugs in doom-loop pruning.
- make_strategy(name) factory.
- tests/test_strategy.py (22 tests, ~200 LOC):
- All 3 reference impls satisfy SearchStrategy Protocol (isinstance check)
- make_strategy dispatches and raises on unknown name
- GreedyStrategy: continue on no failures, stop at plateau / max_iters
- RestartStrategy: returns Restart on plateau, Stop on max_iters
- BFTSLiteStrategy:
* Continue when no kept nodes (cold start)
* Continue when single kept node and most-recent IS that node
* BranchFrom best when most-recent is on wrong branch
* Minimize objective picks smallest metric
* Continue when already extending the best branch
* Stop at max_iterations
* should_prune always False (M1b PR 1 stub)
- StrategyAction frozen + equality contract
Test results: 22/22 strategy tests pass; 127 passed total when run
alongside M1a tests (ledger + orchestrator-ledger + reporter + demo gate).
0 regressions.
Per spec §M1b deliverable 1: SearchStrategy interface extracted ✅
Pending: PR 2 will rewire orchestrator.run_loop to call SearchStrategy
instead of branching on config.search.strategy strings. Existing greedy/
restart/beam behavior preserved through the new GreedyStrategy /
RestartStrategy reference impls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second M1b deliverable. Existing greedy / restart behavior in
_run_loop_serial now flows through the new SearchStrategy Protocol;
"beam" still uses the legacy _run_loop_beam path (M1b PR 3 will lift it).
What changed:
src/crucible/strategy.py
- Renamed StrategyContext.consecutive_failures → plateau_streak. The
field is "iterations since last improvement," same as
Orchestrator._count_plateau_streak. The original name was misleading
because consecutive_failures counts crashes/discards, not no-progress
iterations. Rename clarifies the contract before any production code
relies on the field.
src/crucible/orchestrator.py
- Import { BranchFrom, Continue as StrategyContinue, Restart as
StrategyRestart, SearchStrategy, Stop as StrategyStop, StrategyContext,
make_strategy }.
- __init__ instantiates self.strategy = make_strategy(config.search.strategy)
for non-beam runs. Falls back to None on unknown name (legacy path).
Beam still uses legacy.
- _run_loop_serial: replaces the old `if strategy == "restart" and ...
plateau_streak >= plateau_threshold` block with strategy.decide(ctx).
Action handling:
Stop → log reason, break loop
Restart → existing reset_to_commit + context.add_error
BranchFrom → log "not yet wired" (M1b PR 3 ships expansion)
Continue → fall through to next iteration
- Legacy fallback path retained for the unknown-strategy case so a typo
in config.yaml doesn't crash the run; behavior is identical to pre-PR.
tests/test_strategy.py
- All `failures=` kwargs renamed to `streak=` to match the new
StrategyContext field name.
Test results: 145 passed, 4 skipped (parameter-golf agent-creates +
tier-3 placeholders + Windows). Includes test_orchestrator suite which
exercises the rewired loop with mocked agents — no behavioral
regressions in greedy/restart paths.
Per spec §M1b deliverable 2: orchestrator wires through SearchStrategy ✅
Pending PR 3: implement BranchFrom git checkout + workspace reset for
true BFTS expansion. Beam loop unified into Protocol path (currently
still uses legacy _run_loop_beam).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third M1b deliverable. BFTSLiteStrategy can now actually move the
workspace to a non-most-recent kept node via git reset_to_commit; the
next iteration runs from that commit and chains parent_id correctly.
What's added:
- src/crucible/orchestrator.py:
- _lookup_commit_for_node(node_id): reads the ledger and returns the
git sha for the AttemptNode with that id (or None for violation/
skip / unknown).
- _build_strategy_context(session_count, plateau_threshold,
max_iterations): builds the StrategyContext with ledger_nodes +
metric_lookup so BFTSLiteStrategy gets full attempt-tree visibility.
metric_lookup is built from ResultsLog records keyed by AttemptNode
id (handles both linear "n000042" and beam "b1n000042" schemes).
- _run_loop_serial: BranchFrom action now actually:
1. resolves parent_id → commit sha via _lookup_commit_for_node
2. self.git.reset_to_commit(target_commit)
3. logs and adds context note for the agent
4. updates _last_attempt_id_by_beam[None] so the next AttemptNode
chains parent_id back to the requested ancestor
5. resets consecutive_failures + consecutive_skips counters
- If commit resolution fails (e.g., parent_id refers to a violation
node with no commit), logs a warning and falls through to Continue
semantics — never crash the loop.
- tests/test_orchestrator_ledger.py (4 new tests):
- _lookup_commit_for_node returns the right sha
- Violation node returns None (no commit)
- Unknown id returns None
- _build_strategy_context populates metric_lookup correctly with
ExperimentRecord → AttemptNode id mapping
Test results: 22 in test_orchestrator_ledger pass, 158 pass when run
alongside test_strategy + test_ledger + test_orchestrator + test_demo_gate
+ test_reporter_html. 0 regressions.
Per spec §M1b deliverable 3: BFTS-lite is now executable end-to-end —
strategy.decide() returns BranchFrom → orchestrator resets workspace →
next iteration runs from the chosen ancestor. Demo gate (M1b PR 4) will
prove BFTS produces a more efficient search than greedy on
optimize-compress / optimize-2048.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dator
Fourth M1b deliverable. End-to-end exercise of the BFTS-lite strategy
through the real Orchestrator loop with a deterministic FakeAgent.
What's added:
- src/crucible/config.py:
- SearchConfig.strategy comment now includes "bfts-lite"
- load_config validator accepts "bfts-lite" alongside greedy/restart/beam
- Updated error message lists all four valid strategies
- tests/test_bfts_integration.py (5 tests, ~190 LOC):
- test_orchestrator_instantiates_bfts_strategy: confirms
config.search.strategy="bfts-lite" produces a BFTSLiteStrategy
instance on the Orchestrator
- test_greedy_produces_linear_chain: greedy + monotonic-improvement
agent → 3 keep nodes in linear chain (n1 → n2 → n3)
- test_bfts_produces_linear_chain_when_monotonic: same agent under
BFTS → still linear, because most-recent IS always the best, so
BFTS returns Continue (no Branch needed)
- test_bfts_branches_after_discard: ChattyAgent produces mixed
keep/discard, asserts both outcomes appear in ledger (BFTS exercises
BranchFrom/Continue mix)
- test_bfts_branchfrom_uses_legacy_path_for_violation_nodes:
_lookup_commit_for_node returns None for unknown ids (no crash)
Each test builds an isolated tmp git repo with synthetic solution.py +
evaluate.py + .crucible/config.yaml; deterministic agent edits the
multiplier in solution.py; evaluate.py computes sum(f(0..9)).
Workaround: evaluate.py purges __pycache__ before importing solution,
because tests change solution.py faster than mtime resolution allows
Python to detect staleness against cached .pyc files. Production
runs have full-second mtime gaps and don't need this; only synthetic
tests do.
Test results:
test_strategy: 22/22 ✅
test_ledger: 22/22 ✅ (1 Win skip)
test_orchestrator_ledger: 22/22 ✅
test_orchestrator: full pass ✅
test_demo_gate: 53/56 ✅ (3 expected skips)
test_reporter_html: 14/14 ✅
test_bfts_integration: 5/5 ✅ (NEW)
test_config: full pass ✅
Per spec §M1b deliverable 4: integration test demonstrates BFTS-lite
runs end-to-end through orchestrator. Demo gate (M1b PR 5) will be a
real-agent comparison on optimize-compress / optimize-2048 to claim
"BFTS beats greedy" with empirical numbers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fifth M1b deliverable, step a/c. Lifts the CheatResistancePolicy module
from POC worktree into the production codebase, plus the entire ACL
test corpus (~3,400 adversarial inputs) so future PRs that touch the
ACL logic have an immediate regression net.
What's added:
- src/crucible/security/__init__.py: package init exposing
CheatResistancePolicy, Classification, PolicyViolation
- src/crucible/security/cheat_resistance_policy.py (~165 LOC): SSOT
for path classification per spec §3 / §11. classify(path) returns
one of editable / readonly / hidden / unlisted with priority order:
1. Workspace boundary (paths outside → unlisted)
2. Inode collision (hardlink to hidden/readonly)
3. Direct path match (after symlink resolution)
4. Default → unlisted
assert_writable / assert_visible raise PolicyViolation.
- tests/security/__init__.py
- tests/security/test_acl_boundary.py: 36 hardcoded ACL attacks
(path traversal, symlink, hardlink, glob, case mutations)
- tests/security/test_acl_property.py: 9 Hypothesis properties
(~1,330 generated examples). Strongest invariant: no mutation can
upgrade hidden → editable.
- tests/security/test_acl_mutation.py: 1,957 combinatoric variants
(encoding × case × suffix × seed cartesian product)
- tests/security/test_executor_escape.py: 70 LocalPythonExecutor
escape attempts (smolagents 1.24.0 with additional_authorized_imports=[])
All blocked or neutered (5 disguised-payload escapes verified
inert via sentinel-file checks).
- pyproject.toml: dev dep additions
- hypothesis>=6.150 (property tests)
- smolagents>=1.24.0 (executor escape tests)
Test results: 2,072 / 2,072 security tests pass (1 Win-only skipped on
macOS), 0 regressions on the rest of the suite.
Per spec §M1b deliverable 2 — Evaluator trust boundary part 1:
CheatResistancePolicy now lives in main. PR 5b will refactor
guardrails.py to read from this single source of truth.
PR 5c will fix sandbox.py:59-64 (the eval-cmd-on-host break) so
evaluator runs in the same isolation as run_cmd.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fifth M1b deliverable, step b/c. GuardRails now delegates path
classification to CheatResistancePolicy when constructed with a
workspace, gaining symlink/hardlink/path-traversal defenses without
changing the external API.
What changed:
src/crucible/guardrails.py
- Constructor signature extended (backward-compatible):
GuardRails(editable, readonly) # legacy mode
GuardRails(editable, readonly, hidden, workspace=path) # SSOT mode
- In SSOT mode, builds an internal CheatResistancePolicy and
`check_edits()` runs every modified file through `policy.classify()`,
catching:
* path traversal (`../etc/passwd` → unlisted)
* absolute path escape outside workspace
* symlink redirection (solution.py → evaluate.py)
* hardlink aliasing (caught via inode collision)
* `.crucible/` always platform-protected (defense in depth)
New "hidden" violation kind for hidden files (previously caught only
via Claude SDK PreToolUse hooks; now also at the GuardRails layer).
- In legacy mode, behavior identical to pre-M1b: pure string membership
check. Existing callers without workspace= keep working.
- New `add_editable(rel_path)` method updates both the legacy set AND
the underlying CheatResistancePolicy in lock-step. Used by
orchestrator's `allow_install` path to grant requirements.txt edits
mid-run.
- New `policy` property exposes the CheatResistancePolicy when SSOT
mode is active, for callers needing direct classify access.
src/crucible/orchestrator.py
- GuardRails construction now passes `hidden=config.files.hidden` and
`workspace=self.workspace`, enabling SSOT mode for production runs.
- allow_install path uses `add_editable()` instead of mutating the
set directly, so the policy stays in sync.
Test results: 2,189 passed, 3 skipped (1 Win-only + 2 tier-3 placeholders).
0 regressions. Existing test_guardrails string-only tests unaffected;
test_orchestrator.test_allow_install_installs_on_requirements_change
exercises the add_editable sync path.
Per spec §M1b deliverable 2 — Evaluator trust boundary part 2:
GuardRails now SSOT-aware. PR 5c will fix sandbox.py:59-64 (eval cmd
running on host while run_cmd is in container) so evaluator inherits
the same isolation policy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sixth M1b deliverable, step c/c. Closes the security finding the v3.2
spec review surfaced at sandbox.py:59-64: in Docker mode, run_cmd was
running inside the container, but parse_metric (eval_cmd) ran on the
host. That meant any metric file the agent could write on the host —
or any command injection through eval_command — could bypass the
isolation that protected run_cmd.
What changed:
src/crucible/runner.py
- RunResult dataclass gains a new optional field `stdout: str = ""`.
Existing callers unaffected (default empty); SandboxRunner uses it
to capture Docker container stdout for metric extraction.
- ExperimentRunner.execute populates stdout on RunResult (parity).
src/crucible/sandbox.py
- SandboxRunner.parse_metric now branches on backend:
backend == "docker": runs eval_command via _docker_run (same
isolation policy as run_cmd: --network none, mounts, shadow
envs, resource limits). Host process parses captured stdout
with the same regex as the native runner.
backend != "docker": delegates to native ExperimentRunner
(unchanged behavior).
Returns the float metric value or None on no-match / parse failure.
- _docker_run captures stdout in RunResult.
tests/test_sandbox.py
- Renamed test_parse_metric_always_native → test_parse_metric_native_when_backend_none
(the old name asserted the now-fixed v3.2 break).
- Added 4 new tests:
* test_parse_metric_docker_uses_isolation: docker backend routes
eval through _docker_run, NOT native; native.parse_metric NEVER
called when backend=docker.
* test_parse_metric_native_falls_through: backend=none keeps
native delegation.
* test_parse_metric_docker_returns_none_on_no_match: missing
metric in stdout → None, no crash.
* test_parse_metric_docker_handles_invalid_float: malformed
metric value → None, no exception.
- Updated existing RunResult test fixtures to use stdout= keyword.
Test results: 2,261 passed, 4 skipped (Win + tier-3 placeholders).
0 regressions.
Per spec §M1b deliverable 2 — Evaluator trust boundary part 3:
Trust break closed. Docker mode is now end-to-end isolated:
run_cmd in container → eval_cmd in container → host parses captured
stdout → host writes sealed EvalResult (M1b PR 6 will add the
EvalResult artefact write-out).
Spec §11 INV-2 / INV-3 alignment:
- Eval and run share IsolationPolicy ✅
- Host process owns metric extraction ✅
- Agent code cannot influence which file the platform parses ✅
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tNode
Sixth M1b deliverable. After each iteration's metric eval, orchestrator
now writes logs/run-<tag>/iter-N/eval-result.json — the platform-owned
sealed artefact promised by spec §4 / §11. AttemptNode.eval_result_ref
and eval_result_sha256 are populated from this write.
What's added:
src/crucible/orchestrator.py
- _write_eval_result_artifact(iteration, commit_hash, run_result,
metric_value, run_duration_seconds) -> tuple[rel_path, sha256]:
* Builds an EvalResult dataclass populated from the iteration's
run/parse_metric outcome.
* eval_manifest_hash = sha256(eval_command|run_command|metric_name).
M2 will extend to entry_file_hash + config_hash.
* stdout_sha256 + stderr_sha256 over the captured streams.
* seal field set to "content-sha256:<hex>" computed over canonical
JSON (sorted keys, no whitespace) of the full payload. M2 will
upgrade to HMAC-SHA256 once the secret-key path lands.
* Writes the JSON to logs/run-<tag>/iter-N/eval-result.json.
* Returns (relative_path_from_workspace, sha256_of_disk_bytes) so
callers thread the integrity hash onto AttemptNode.
* On any I/O exception, returns (None, None) and logs WARNING —
never breaks the iteration loop.
- run_one_iteration: calls _write_eval_result_artifact after the
eval block, before _dual_log. Populates self._pending_eval_result_ref
and self._pending_eval_result_sha256 (state lasts only across the
next dual_log call).
- _record_to_attempt_node: reads _pending_eval_result_* and threads
them onto AttemptNode.eval_result_ref / eval_result_sha256.
tests/test_orchestrator_ledger.py (3 new tests):
- test_write_eval_result_artifact_creates_json_with_seal: verifies
file lands at spec path, payload has metric/exit_code/seal in
"content-sha256:" prefix, returned sha matches disk
- test_write_eval_result_artifact_handles_missing_metric:
metric_value=None → valid=False, seal still computed
- test_write_eval_result_artifact_handles_io_failure: bad workspace
path → returns (None, None) without raising
Smoke verification (zero token):
The two-iteration smoke now produces:
logs/run-smoke/iter-1/eval-result.json
logs/run-smoke/iter-2/eval-result.json
And the corresponding AttemptNodes carry eval_result_ref +
eval_result_sha256 populated from the actual on-disk hashes.
Test results: 25 in test_orchestrator_ledger pass; full suite still
green for orchestrator + bfts_integration. 0 regressions.
Per spec §M1b deliverable 6 + §11 reviewer notes:
- Eval result is platform-owned ✅
- Integrity hash present (M1: content sha; M2: HMAC) ✅
- Reader must always recompute on read (caller responsibility,
documented in docs/LEDGER.md)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Seventh M1b deliverable. The reporter now renders the ledger as an
indented tree, making BFTS branch expansions visually obvious. Linear
chains (greedy/restart) still render the same as M1a — the new code
collapses to depth=0 for them.
What changed:
src/crucible/reporter/html_tree.py
- New _render_tree(nodes, best_id, metric_lookup): builds the
parent → children map and walks DFS, emitting cards with
progressive depth.
- _render_card(node, ..., depth=0, orphan=False): added depth
parameter (margin-left: depth*32px) and an "orphan" flag for
nodes whose parent_id is not in the ledger.
- Roots (parent_id=None) render at depth=0; descendants get
"↳" branch marker + indentation.
- Defensive: cycles short-circuited by visited set; orphan nodes
rendered at depth=0 below the tree with an orange "orphan" badge.
- CSS additions: .branch-marker (subtle gray ↳),
.card-orphan (yellow tint), .orphan-badge.
tests/test_reporter_html.py (4 new tests):
- test_tree_view_indents_descendants: 3-level chain has 32/64px
margins and ≥2 branch markers
- test_tree_view_handles_branching: 2 children of same parent
each at depth=1 with branch markers
- test_tree_view_orphan_node: parent_id pointing to absent node
renders with "orphan" badge + visible label
- test_tree_view_linear_chain_unchanged: BFTS DFS produces same
document order as M1a linear timeline for non-branching
ledgers (regression assurance)
Test results: 17 reporter tests pass (13 existing + 4 new).
Per spec §M1b deliverable 3:
HTML report now visualises BFTS expansion structure. The "BFTS picked
n000001 over n000002" decision is no longer hidden in a flat list —
viewers see the actual tree shape with branches and orphans.
M3 still owns:
- d3.js interactive expand/collapse
- hover-to-show full diff
- multi-run comparison overlays
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer feedback (M1b implementation review): the existing parent_id
semantics weakened BFTS decisions and the HTML tree. After a discard /
crash / violation / skip, git resets to the previous kept commit, so
the next attempt actually runs FROM that kept state — but the ledger
parent_id still pointed at the rejected node.
Now: only "keep" outcomes advance the code-parent pointer. Discard /
crash / violation / skip nodes still appear in the ledger (with their
contemporary parent_id pointing to the last kept ancestor), but they
do NOT become parents for subsequent attempts. The result is a tree
where the trunk is kept commits and rejected siblings branch off the
last good ancestor.
What changed:
src/crucible/orchestrator.py
- _record_to_attempt_node: only updates _last_attempt_id_by_beam[beam]
when record.status == "keep". Discard / crash etc. still get a
ledger entry with the current parent_id, but the tracker stays put.
- _ledger_log_no_commit: violation / skip never advance the tracker
either (they have no commit by definition).
- resume(): rebuild parent chain only from KEPT nodes — preserves
code-ancestry semantics across restarts.
Test updates (3 fixed, 1 renamed for clarity):
- test_dual_log_records_match_in_full_run: assertions now reflect
code-ancestry chain (crash → keep → discard → keep → violation
produces parent_ids [None, None, n2, n2, n4]).
- test_no_commit_parent_does_not_advance (renamed from
test_no_commit_parent_chain_extends): keep → violation → keep
produces parent_ids [None, n1, n1] — third attempt extends from
the FIRST kept, not the violation.
- test_resume_reads_ledger_for_iteration_max: _last_attempt_id_by_beam
holds the kept node, not the trailing violation.
- test_resume_reconstructs_beam_chain: per-beam tracker holds last
kept per beam.
Test results: 83 pass (orchestrator_ledger + bfts_integration +
strategy + orchestrator), 0 regressions.
Per reviewer F3 expectations: BFTS now sees a tree where parent_id
== "this attempt actually ran from this code state", which is the
correct ancestry signal for branch-from decisions.
PR 8b will fix the EvalResult sealing source (run_result.stdout vs
parse_metric stdout); PR 8c will fix the BFTS gate ordering relative
to legacy max_retries / convergence stops.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer F1 fix: PR 6's EvalResult sealed run_result.stdout (the run
command's output) and run_result.stderr_tail (the run command's
stderr tail). But the metric value comes from a SEPARATE subprocess
running the eval command, whose stdout/stderr were not exposed to
the orchestrator. So the integrity hash didn't actually prove the
bytes that produced metric_value.
Now: the runner returns a MetricParseResult containing the eval
command's stdout, stderr_tail, exit_code, timed_out alongside the
parsed metric value. Orchestrator hashes those bytes for EvalResult.
The integrity hash is honest: it covers what the eval actually saw.
What changed:
src/crucible/runner.py
- New MetricParseResult dataclass: metric_value, stdout,
stderr_tail, exit_code, timed_out.
- ExperimentRunner.parse_metric_result(eval_cmd, metric_name): runs
the eval subprocess, captures stdout AND stderr (tail), parses
the metric, returns MetricParseResult.
- ExperimentRunner.parse_metric: backward-compat shim that calls
parse_metric_result and returns just the float.
src/crucible/sandbox.py
- SandboxRunner.parse_metric_result: same shape; in Docker mode,
runs the eval command via _docker_run (closing the v3.2 break)
and returns MetricParseResult with the container's captured
stdout. Native mode delegates to ExperimentRunner.parse_metric_result.
- SandboxRunner.parse_metric: backward-compat shim.
src/crucible/orchestrator.py
- run_one_iteration: prefers parse_metric_result when the runner
has it (real ExperimentRunner / SandboxRunner). Falls back to
parse_metric for unknown / mocked runners that haven't been
upgraded — preserves backward compat.
- _write_eval_result_artifact signature changed: takes seal_stdout,
seal_stderr_tail, seal_exit_code, seal_timed_out (eval bytes)
instead of run_result. Hashes seal_stdout for stdout_sha256 and
seal_stderr_tail for stderr_sha256. The reviewer noted "stderr
is just the tail" — docstring now flags this honestly; M2 may
capture full stderr if needed.
Tests:
- tests/test_orchestrator.py: bulk-updated 14 mock sites:
`patch.object(orch.runner, "parse_metric")` → `parse_metric_result`,
return values wrapped in MetricParseResult(metric_value=N, stdout="",
stderr_tail=""). side_effect lists rewritten to MetricParseResult
elements (covers all max-iterations + convergence tests).
- tests/test_sandbox.py: 2 native-fallthrough tests now mock
parse_metric_result.
- tests/test_orchestrator_ledger.py: 3 _write_eval_result_artifact
tests updated for new signature, plus new
test_write_eval_result_artifact_seals_eval_stdout_not_run_stdout
(regression for reviewer F1 explicitly).
Test results: 179 passed, 3 skipped (tier-3 placeholders + Win).
0 regressions.
Per spec §11 / reviewer F1:
- The eval cmd's stdout is the source of metric_value
- That same stdout is hashed into stdout_sha256
- The seal proves the bytes that produced the metric
- v3.2 trust-boundary claim is now genuinely closed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (F2)
Reviewer F2 fix: previously strategy.decide() was called AFTER the legacy
consecutive_failures / consecutive_skips / convergence stop checks. With
default max_retries=3, three failed BFTS expansions in a row would halt
the run before the strategy got a chance to BranchFrom an alternative
kept node. That made BFTS-lite fragile for the demo gate.
Now: SearchStrategy is consulted FIRST. If it returns BranchFrom, the
orchestrator applies the git reset, updates the parent chain, resets
the failure counters, and `continue`s to the next iteration — bypassing
the legacy stops entirely. Greedy / Restart preserve old behavior because
their decide() never returns BranchFrom; the legacy stops still fire as
a safety net.
What changed:
src/crucible/orchestrator.py
- _run_loop_serial:
1. After dual_log + consecutive_skips bookkeeping, build
StrategyContext + call self.strategy.decide() FIRST.
2. If action is BranchFrom: do the git reset + context update +
counter reset + `continue` (skip legacy stops).
3. Otherwise, run the legacy max_retries / consecutive_skips /
convergence stops as before.
4. After the legacy block, handle the remaining strategy actions
(Stop / Restart / Continue) using the previously-built context.
- Removed the duplicate BranchFrom branch from the lower action-
handling block; it's now exclusively the upper pre-emption path.
tests/test_bfts_integration.py
- New test_branch_from_preempts_max_retries: stub strategy that
returns Continue then BranchFrom on subsequent decides; agent
produces regressions on iters 2-4 then improvements on iter 5+.
With max_retries=3, pre-PR8c would halt at 4 nodes; post-PR8c
we see ≥5 nodes because BranchFrom diverts before the stop.
- The stub strategy isolates the gate-ordering behavior from
BFTSLite's own metric-best heuristic (which would Continue when
most-recent is already a child of the only kept node).
Test results: 2,296 passed, 4 skipped (Win + tier-3 placeholders).
0 regressions across the broader suite.
Per spec §M1b deliverable + reviewer F2:
- BFTS now actually has a runway to recover from failure streaks
- Greedy/Restart unchanged: their max_retries safety net still works
- Demo gate (M1b PR 9) can now feasibly produce a BFTS-vs-greedy
comparison without arbitrary 3-failure halts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sanity-check write-up for M1b: 4 real-agent runs on optimize-compress and optimize-2048 (greedy + bfts-lite each), 3 iter / 12 total, ~$0.77. Validates: - SearchStrategy seam live (BFTSLiteStrategy.decide() observable in CLI) - parent_id = code ancestry (compress-bfts iter-3 chains back to n000001 after iter-2 discard, NOT to n000002) - Sealed EvalResult artefacts per iteration with content-sha256 - HTML postmortem renders tree-view for BFTS Does NOT claim "BFTS beats greedy" — 3 iter is too short to exercise branching meaningfully. M2 with 30+ iter + doom-loop pruning is what would produce metric-superiority data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 25, 2026
suzuke
added a commit
that referenced
this pull request
Apr 26, 2026
Address 3 of 4 round-2 reviewer follow-ups (#1, #2, #3). #4 (CLI auth base class) deferred to PR 16c when there's a second adapter to motivate the abstraction. #1 stderr auth-phrase scan (codex_cli.py parse_output): If codex bails before --json takes effect (auth check fails before JSONL stream opens), the auth phrase is in stderr only. parse_output now scans `raw.stderr_tail` as a fallback after stdout — without this, the trial would land in error_type=UNKNOWN instead of AUTH. Tests: test_codex_auth_error_detected_in_stderr_when_no_jsonl, test_codex_stderr_unrelated_text_does_NOT_trigger_auth. #2 reproducibility flags (build_argv): Added --ephemeral, --ignore-user-config, --ignore-rules. Without --ignore-user-config, a user's ~/.codex/config.toml could silently override `model` and break the cli_version-as-reproducibility- snapshot guarantee (cli_version snapshots the binary version, not the model). Without --ignore-rules, user execpolicy .rules files could re-enable codex tool calls outside the workspace-write sandbox (§INV-3). Without --ephemeral, codex persists session files across runs. Test: test_codex_argv_includes_reproducibility_flags. #3 _FORBIDDEN_FLAGS calibration: Verified actual codex 0.124.0 flag names via `codex exec --help`. Removed two non-existent flags (`--bypass-approvals` was a fabrication; `--dangerously-skip-permissions` is a Claude CLI flag, not codex). Added the real `--dangerously-bypass-approvals-and-sandbox` (skips both approvals AND sandbox — explicit §INV-3 violation if ever passed). Updated meta-test to match. Stats: +111/-11 across 2 files (`git diff --stat HEAD~1 HEAD | tail -1`). Test suite: 60 passed (codex 23 + cli_subscription 37) in 0.15s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
suzuke
added a commit
that referenced
this pull request
Apr 26, 2026
Address 2 of 3 round-1 reviewer follow-ups (#1, #2). #3 (token cost reporting / API-vs-OAuth framing) is forward-pointing for a future docs/feature PR. #1 docstring/code mismatch (gemini_cli.py module docstring): Module docstring claimed `--include-directories` was passed to scope the blast radius. build_argv never emits it. Reality: the scope-limiting is the subprocess `cwd` (set to scratch by the backend's run_subprocess in base.py) — gemini operates only on cwd by default, no flag needed. Updated docstring to reflect actual mechanism. #2 verified auth-failure phrase: Captured real gemini 0.39.1 auth-failure stderr via HOME=tmp + empty GEMINI_API_KEY (saved as `tests/fixtures/gemini_auth_failure.stderr.txt`). The actual phrase emitted is "Please set an Auth method in your <path>/.gemini/ settings.json or specify one of the following environment variables before running: GEMINI_API_KEY, ...". Added "Please set an Auth method" as the first entry in _AUTH_FAILURE_PHRASES with inline comment marking it as VERIFIED (vs the rest which remain EXTRAPOLATED for forward-compat). New regression test test_gemini_auth_error_real_fixture_from_spike exercises the real fixture against the parser → raises GeminiCLIAuthError. Notable: gemini 0.39.1 exits 0 on auth failure (not 1) — the typed-exception pattern is what catches it, not exit-code heuristics. Stderr-fallback path (PR 16a R2 #1) is the one that fires here. Stats: +53/-15 across 3 files (gemini_cli.py +24/-15, new fixture +1, test_gemini_cli_adapter.py +28). Test results: 24 passed in test_gemini_cli_adapter.py (23 prior + 1 new fixture regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #2 (M1a TrialLedger). Adds searchable experiment tree primitives — the differentiating feature of v1.0 vs the existing greedy/restart/beam loop.
What's new
SearchStrategy Protocol
crucible/strategy.py—SearchStrategyProtocol with 4 actions:Continue/BranchFrom(parent_id)/Restart/Stop.GreedyStrategy,RestartStrategy,BFTSLiteStrategy(best-first tree search over kept ledger nodes).Orchestrator wiring
_run_loop_serialcallsstrategy.decide()BEFORE legacymax_retries/ convergence stops, so BFTS can pre-empt failure-streak halts viaBranchFrom.BranchFrom(parent_id)resolves the AttemptNode's commit via the ledger andgit.reset_to_commit()s the workspace — the next iteration runs from that ancestor.parent_idsemantics fixed: onlykeepoutcomes advance the code-parent pointer (discard/crash/violation/skip nodes still appear in the ledger but don't claim ancestry).Trust boundary fixes
security/module (and ~3,400-input ACL test corpus) lifted into main.GuardRails(workspace=...)opts into SSOT mode with symlink/hardlink/path-traversal defenses.sandbox.py:59-64bug closed:parse_metricin Docker mode now runs the eval cmd inside the container, host parses captured stdout.logs/run-<tag>/iter-N/eval-result.jsonwithseal: "content-sha256:<hex>"over canonical JSON. AttemptNode carrieseval_result_ref+eval_result_sha256.MetricParseResultreturns the eval cmd's stdout/stderr_tail; orchestrator hashes those (not run cmd's), so the integrity hash actually proves the bytes that produced the metric value.HTML reporter tree-view
_render_tree(nodes)does DFS-by-parent walk with depth indentation. Linear chains (greedy/restart) collapse to depth=0; BFTS expansions produce the visible tree.Reviewer trail
Stats
Known limitations (M2 candidates)
evaluation.repeat > 1aggregate path still doesn't seal each eval stdout (returns(RunResult, metric)only). Acceptable for M1b; M2 may extend.stderr_sha256is tail-based (last 50 lines) not full stderr — code comments document this honestly._run_loop_beam(not yet routed through SearchStrategy).Test plan
docs/M1B-DEMO-GATE.mdReal-agent demo gate (sanity check)
4 runs across 2 examples × 2 strategies, 3 iter each. Total: 12 iter, ~$0.77, ~12 min wall.
parent=n000001⚡What this validates — end-to-end M1b wiring:
[strategy] max_iterations=3 reachedlog provesBFTSLiteStrategy.decide()is the deciding voice; sealedeval-result.jsonper iteration; HTML tree-view renders for both. Most importantly: in compress-bfts the iter-3parent_id=n000001(NOTn000002— the discarded sibling), confirming PR 8a's code-ancestry fix is observable in real LLM runs.What this does NOT validate — "BFTS beats greedy" superiority. 3 iter is too short to exercise meaningful branching; M2 doom-loop pruning + 30+ iter runs are what would produce statistical evidence.
🤖 Generated with Claude Code