feat(m3): ledger query helpers + reporter banners + strategy decision sidecar#11
Open
suzuke wants to merge 1 commit into
Open
feat(m3): ledger query helpers + reporter banners + strategy decision sidecar#11suzuke wants to merge 1 commit into
suzuke wants to merge 1 commit into
Conversation
… sidecar (M3 PR 17)
Per spec §1: StateStore "extend" + SearchStrategy "polish" + Reporter
follow-up to PR 15. Three focused, additive improvements landing as a
single coherent PR.
**StateStore extend — Ledger query helpers** (`ledger.py`):
- `kept_path(node_id, *, include_self=True)` — walk parent chain,
filter to outcome=="keep". `include_self=False` drops the queried
node from the result. Reviewer round 1 Q2 pin: don't bake
heuristics; let callers control via the kwarg.
- `descendants_of(node_id)` — DFS-by-parent + sort children by id.
Iteration order matches `_render_tree` in `html_tree.py`. Includes
cycle defense via visited set.
- `find_by_outcome(outcome)` — exact string match on outcome.
**SearchStrategy polish — Decision sidecar log** (`strategy_decisions.py`):
- Reviewer round 1 Q4 pin: SIDECAR file (`logs/run-<tag>/strategy-decisions.jsonl`),
NOT a new ledger event type. Why not ledger:
(a) ledger is safety-critical (POSIX flock, single-writer, schema-
versioned, seal HMAC); a schema bump invalidates parallel append
safety per spec §4.2.
(b) mixing safety-critical data with debug data violates separation
of concerns.
Why not in-memory: lost on crash exactly when most needed.
- `StrategyDecision` dataclass: timestamp / iteration / kept_candidates
/ pruned_candidates / chosen_action / rationale / extras.
- `append()` writes one JSON line per decide() call; `load_all()` reads
for the postmortem reporter.
- Orchestrator hook: `_log_strategy_decision()` called immediately after
`strategy.decide(sctx)` in `_run_loop_serial`. Best-effort — any
failure is logged at DEBUG and never raises (so a logging hiccup
never blocks the run loop).
- CLI: `crucible postmortem --tag X --strategy-decisions` flag prints
the recorded sequence.
**Reporter polish — Truth-in-labeling banners** (`reporter/_banners.py`):
- Reviewer round 1 Q3 pin: SINGLE source of truth for banner copy so
static + interactive renderers stay in sync. Spec §INV-1 wording rules
apply ("no bypass observed in N adversarial trials" not "secure"; we
don't say it without N).
- `UNSANDBOXED_HEADING` / `UNSANDBOXED_BODY`: shown when AttemptNode
metadata has `isolation == "cli_subscription_unsandboxed"` (M2 PR 16
CLI-subscription runs).
- `STALE_COMPLIANCE_HEADING` / `STALE_COMPLIANCE_BODY`: shown when
`compliance_report_path is None` AND the run is unsandboxed (i.e.
`experimental.allow_stale_compliance` was used).
- Body copy uses "diagnostic only" / "not a containment claim" wording
per §INV-1 — never "secure" / "isolated" / "sandbox".
- Both `html_tree.py` (static) and `interactive.py` (d3) import the
same `render_banners_html` helper. Tests assert both paths render
the banner when isolation is set.
**Schema additions** (`ledger.py:AttemptNode`):
- `isolation: Optional[str] = None` — truth-in-labeling tag (parallel
to spec §11.2 Q5's `isolation=local_unsafe`).
- `compliance_report_path: Optional[str] = None` — audit trail to the
JSONL evidence file.
- Both default None for backward compat with M1a/M1b/M2 ledgers.
**Orchestrator wiring** (`orchestrator.py`):
- `_extract_backend_metadata(agent_result)` helper reads the dict from
`AgentResult.backend_metadata` (M2 PR 13/16 propagation channel).
- Both AttemptNode construction sites copy cli_binary_path, cli_version,
cli_argv, env_allowlist, isolation, compliance_report_path from
metadata onto the persisted node.
- `_log_strategy_decision(sctx, action)` records each decision; calls
`should_prune` reflectively to populate `pruned_candidates`.
**Tests** — 29 new in `test_m3_polish.py`:
- kept_path: linear chain, discard filtering, include_self toggle, root
kept/discard, orphan termination
- descendants_of: DFS order, leaf, large chain (cycle defense canary)
- find_by_outcome: exact match, unknown returns []
- Banner predicates + render: no metadata = no output, unsandboxed +
stale combinations, HTML escape, §INV-1 wording absence
- Sidecar: round-trip, missing file = empty, malformed line tolerance
- Property assertion: N decide() calls → N sidecar entries (Q6 pin)
- AttemptNode: new fields default None, can be set
- Reporter integration: static + interactive both render banner when
isolation set; both DON'T render for normal runs
**Reviewer Q1-Q6 trace**:
- Q1 scope: 3-bullet, no Evaluator (deferred to PR 17a if a real need
surfaces) ✓
- Q2 helpers: include_self kwarg, no baked heuristics ✓
- Q3 banner SSOT: one module, both renderers import ✓
- Q4 sidecar: separate file, not ledger ✓
- Q5 CLI: --strategy-decisions flag on postmortem (no new top-level command) ✓
- Q6 property assertion: 1 decide → 1 sidecar entry test added ✓
**Stats**:
- 8 files changed, +517 / -3 LOC
- 29 new tests; full suite 2761 passed + 1 pre-existing failure
(`test_create_agent_unknown_raises` regex/case mismatch, exists at
PR 16 baseline; NOT a PR 17 regression) + 4 skipped
- 0 regressions from PR 17
**Note on dedup pass**: reviewer round 1 suggested deduplicating inline
walks in `compare.py` / `html_tree.py` after the helpers landed.
Inspection shows their walks (DFS-from-None-root with depth/orphan
handling, metric-aware best-of-run) don't match the new helper
signatures (which work on a specific node_id). Future helpers can
absorb those if a refactor opportunity arises.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #10 (M3 PR 16 SubscriptionCLIBackend). Polish PR. Three focused, additive improvements landing as a single coherent PR per spec §1 M3 module table:
crucible postmortem --strategy-decisionsWhat's new
StateStore — Ledger query helpers (
ledger.py)kept_pathwalks parent chain, filters tooutcome=="keep".include_self=Falsedrops the queried node from the result regardless of its outcome. Reviewer Q2 pin: no baked heuristics — caller controls semantics via the kwarg.descendants_ofreturns all transitive children, DFS-by-parent + siblings sorted by id. Iteration order matches_render_treefor consistency. Cycle defense via visited set.find_by_outcomeexact-string filter, plainstrfor forward-compat with custom outcomes.Reporter — Banner SSOT (
reporter/_banners.pyNEW)Single source of truth for warning banner copy. Reviewer Q3 pin: both
html_tree.py(static) andinteractive.py(d3) import the samerender_banners_html()helper. §INV-1 wording compliance — uses "degraded ACL" / "diagnostic only" / "do NOT constitute a containment claim", deliberately avoids "secure" / "isolated" / "no bypass observed in N trials" (no N to ground it).Two banner triggers:
isolation == "cli_subscription_unsandboxed"→ unsandboxed bannercompliance_report_path is NoneAND isolation set → stale-compliance bannerSearchStrategy — Decision sidecar (
strategy_decisions.pyNEW)Reviewer Q4 pin: SIDECAR file (
logs/run-<tag>/strategy-decisions.jsonl), NOT a new ledger event type. Why not ledger:Why not in-memory: lost on crash exactly when most needed.
Orchestrator hook:
_log_strategy_decision(sctx, action)runs immediately afterstrategy.decide(sctx)in_run_loop_serial. Best-effort — any failure is logged at DEBUG and never raises, so a logging hiccup never blocks the run loop.CLI:
crucible postmortem --tag X --strategy-decisionsprints the recorded sequence.AttemptNode schema additions (
ledger.py)Both default None for backward compat with M1a/M1b/M2 ledgers. Populated from
AgentResult.backend_metadatavia the new_extract_backend_metadata()helper inorchestrator.py, used at both AttemptNode construction sites.Reviewer trail
_extract_backend_metadatahelper reused at both AttemptNode construction sites (avoids drift). Deferred dedup pass forcompare.py/html_tree.pyinline walks confirmed correct judgment — patterns don't match new helper shapes.Stats
5365098)test_m3_polish.pytest_create_agent_unknown_raises— exists at PR 16 baseline) + 4 skipped. 0 regressions from PR 17.Test plan
Q2 ledger helpers
kept_pathlinear chainkept_pathfilters out discardskept_path(include_self=False)drops selfkept_pathroot kept / root discardkept_pathorphan node terminates cleanly (no infinite loop)descendants_ofDFS order (matches_render_tree)descendants_ofleaf returns []descendants_ofcycle defense (large chain canary)find_by_outcomeexact match + unknown returns []Q3 banner SSOT
isolationset firstQ4 sidecar
Q6 property assertion + integration
AttemptNode schema
isolationdefaults to Nonecompliance_report_pathdefaults to NoneKnown limitations / non-blockers
compare.py/html_tree.pyafter helpers landed. Inspection showed those walks (DFS-from-None-root with depth/orphan handling) don't match the new helper signatures (specificnode_idqueries). Forcing a dedup would create abstractions-for-abstraction's-sake. Defer until a concrete caller needs both shapes.extrasdict is free-form — adapter-specific details go here. No schema enforcement; reader uses.get()defaults.🤖 Generated with Claude Code