feat(m3): ledger query helpers + reporter banners + strategy decision sidecar by suzuke · Pull Request #11 · suzuke/autocrucible

suzuke · 2026-04-26T02:22:47Z

Summary

Stacked on #10 (M3 PR 16 SubscriptionCLIBackend). Polish PR. Three focused, additive improvements landing as a single coherent PR per spec §1 M3 module table:

StateStore extend — Ledger query helpers
Reporter polish — Truth-in-labeling banners (single source of truth)
SearchStrategy polish — Decision sidecar log + crucible postmortem --strategy-decisions

What's new

StateStore — Ledger query helpers (`ledger.py`)

TrialLedger.kept_path(node_id, *, include_self=True) -> list[AttemptNode]
TrialLedger.descendants_of(node_id) -> list[AttemptNode]
TrialLedger.find_by_outcome(outcome: str) -> list[AttemptNode]

kept_path walks parent chain, filters to outcome=="keep". include_self=False drops the queried node from the result regardless of its outcome. Reviewer Q2 pin: no baked heuristics — caller controls semantics via the kwarg.
descendants_of returns all transitive children, DFS-by-parent + siblings sorted by id. Iteration order matches _render_tree for consistency. Cycle defense via visited set.
find_by_outcome exact-string filter, plain str for forward-compat with custom outcomes.

Reporter — Banner SSOT (`reporter/_banners.py` NEW)

Single source of truth for warning banner copy. Reviewer Q3 pin: both html_tree.py (static) and interactive.py (d3) import the same render_banners_html() helper. §INV-1 wording compliance — uses "degraded ACL" / "diagnostic only" / "do NOT constitute a containment claim", deliberately avoids "secure" / "isolated" / "no bypass observed in N trials" (no N to ground it).

Two banner triggers:

isolation == "cli_subscription_unsandboxed" → unsandboxed banner
compliance_report_path is None AND isolation set → stale-compliance banner

SearchStrategy — Decision sidecar (`strategy_decisions.py` NEW)

Reviewer Q4 pin: SIDECAR file (logs/run-<tag>/strategy-decisions.jsonl), NOT a new ledger event type. Why not ledger:

Ledger is safety-critical (POSIX flock, single-writer, schema-versioned, seal HMAC); a schema bump invalidates parallel append safety per spec §4.2.
Mixing safety-critical data with debug data violates separation of concerns.

Why not in-memory: lost on crash exactly when most needed.

@dataclass
class StrategyDecision:
    timestamp: str       # ISO-8601 UTC
    iteration: int
    kept_candidates: list[str]
    pruned_candidates: list[str]
    chosen_action: str   # e.g. "Continue", "BranchFrom", "Stop"
    rationale: str
    extras: dict[str, Any]  # adapter-specific

Orchestrator hook: _log_strategy_decision(sctx, action) runs immediately after strategy.decide(sctx) in _run_loop_serial. Best-effort — any failure is logged at DEBUG and never raises, so a logging hiccup never blocks the run loop.

CLI: crucible postmortem --tag X --strategy-decisions prints the recorded sequence.

AttemptNode schema additions (`ledger.py`)

isolation: Optional[str] = None          # truth-in-labeling tag
compliance_report_path: Optional[str] = None  # audit trail

Both default None for backward compat with M1a/M1b/M2 ledgers. Populated from AgentResult.backend_metadata via the new _extract_backend_metadata() helper in orchestrator.py, used at both AttemptNode construction sites.

Reviewer trail

Round	Verdict	Findings
1 (design)	ACCEPT	4 steers (Q2 helper signature, Q3 banner SSOT, Q4 sidecar not ledger, Q6 property assertion + edge cases) — all blocking-lite, none REJECT-level
2 (impl)	VERIFIED	All 4 steers landed correctly. Numbers accurate (third round in a row). Reviewer flagged bonus quality: proactive XSS test on banner content + `_extract_backend_metadata` helper reused at both AttemptNode construction sites (avoids drift). Deferred dedup pass for `compare.py`/`html_tree.py` inline walks confirmed correct judgment — patterns don't match new helper shapes.

Stats

1 commit (5365098)
8 files changed, +882 / -3 LOC
29 new tests in test_m3_polish.py
Full suite: 2761 passed + 1 pre-existing failure (test_create_agent_unknown_raises — exists at PR 16 baseline) + 4 skipped. 0 regressions from PR 17.

Test plan

Q2 ledger helpers

kept_path linear chain
kept_path filters out discards
kept_path(include_self=False) drops self
kept_path root kept / root discard
kept_path orphan node terminates cleanly (no infinite loop)
descendants_of DFS order (matches _render_tree)
descendants_of leaf returns []
descendants_of cycle defense (large chain canary)
find_by_outcome exact match + unknown returns []

Q3 banner SSOT

No metadata = no output
Unsandboxed banner renders correctly
Stale-compliance banner gated on isolation set first
Both renderers import same module (structural assertion)
§INV-1 wording compliance (no "secure" / "no bypass observed in N")
HTML escape on body text (defensive)

Q4 sidecar

Round-trip: append then load yields equal records
Missing file → empty list (reporter calls on every run dir)
Malformed lines tolerated (mirror ledger reader)
Sidecar filename matches documented constant

Q6 property assertion + integration

N decide() calls → exactly N sidecar entries (Q6 counter assertion)
Static reporter renders banner when isolation set
Interactive reporter renders banner when isolation set
Neither renders banner for normal (non-CLI) runs

AttemptNode schema

isolation defaults to None
compliance_report_path defaults to None
Both can be set to specific tag values

Known limitations / non-blockers

Inline-walk dedup deferred — reviewer round 1 suggested deduplicating walks in compare.py / html_tree.py after helpers landed. Inspection showed those walks (DFS-from-None-root with depth/orphan handling) don't match the new helper signatures (specific node_id queries). Forcing a dedup would create abstractions-for-abstraction's-sake. Defer until a concrete caller needs both shapes.
Strategy decision logging is best-effort — failures are logged at DEBUG and swallowed. A logging hiccup never blocks the run loop.
extras dict is free-form — adapter-specific details go here. No schema enforcement; reader uses .get() defaults.

🤖 Generated with Claude Code

… sidecar (M3 PR 17) Per spec §1: StateStore "extend" + SearchStrategy "polish" + Reporter follow-up to PR 15. Three focused, additive improvements landing as a single coherent PR. **StateStore extend — Ledger query helpers** (`ledger.py`): - `kept_path(node_id, *, include_self=True)` — walk parent chain, filter to outcome=="keep". `include_self=False` drops the queried node from the result. Reviewer round 1 Q2 pin: don't bake heuristics; let callers control via the kwarg. - `descendants_of(node_id)` — DFS-by-parent + sort children by id. Iteration order matches `_render_tree` in `html_tree.py`. Includes cycle defense via visited set. - `find_by_outcome(outcome)` — exact string match on outcome. **SearchStrategy polish — Decision sidecar log** (`strategy_decisions.py`): - Reviewer round 1 Q4 pin: SIDECAR file (`logs/run-<tag>/strategy-decisions.jsonl`), NOT a new ledger event type. Why not ledger: (a) ledger is safety-critical (POSIX flock, single-writer, schema- versioned, seal HMAC); a schema bump invalidates parallel append safety per spec §4.2. (b) mixing safety-critical data with debug data violates separation of concerns. Why not in-memory: lost on crash exactly when most needed. - `StrategyDecision` dataclass: timestamp / iteration / kept_candidates / pruned_candidates / chosen_action / rationale / extras. - `append()` writes one JSON line per decide() call; `load_all()` reads for the postmortem reporter. - Orchestrator hook: `_log_strategy_decision()` called immediately after `strategy.decide(sctx)` in `_run_loop_serial`. Best-effort — any failure is logged at DEBUG and never raises (so a logging hiccup never blocks the run loop). - CLI: `crucible postmortem --tag X --strategy-decisions` flag prints the recorded sequence. **Reporter polish — Truth-in-labeling banners** (`reporter/_banners.py`): - Reviewer round 1 Q3 pin: SINGLE source of truth for banner copy so static + interactive renderers stay in sync. Spec §INV-1 wording rules apply ("no bypass observed in N adversarial trials" not "secure"; we don't say it without N). - `UNSANDBOXED_HEADING` / `UNSANDBOXED_BODY`: shown when AttemptNode metadata has `isolation == "cli_subscription_unsandboxed"` (M2 PR 16 CLI-subscription runs). - `STALE_COMPLIANCE_HEADING` / `STALE_COMPLIANCE_BODY`: shown when `compliance_report_path is None` AND the run is unsandboxed (i.e. `experimental.allow_stale_compliance` was used). - Body copy uses "diagnostic only" / "not a containment claim" wording per §INV-1 — never "secure" / "isolated" / "sandbox". - Both `html_tree.py` (static) and `interactive.py` (d3) import the same `render_banners_html` helper. Tests assert both paths render the banner when isolation is set. **Schema additions** (`ledger.py:AttemptNode`): - `isolation: Optional[str] = None` — truth-in-labeling tag (parallel to spec §11.2 Q5's `isolation=local_unsafe`). - `compliance_report_path: Optional[str] = None` — audit trail to the JSONL evidence file. - Both default None for backward compat with M1a/M1b/M2 ledgers. **Orchestrator wiring** (`orchestrator.py`): - `_extract_backend_metadata(agent_result)` helper reads the dict from `AgentResult.backend_metadata` (M2 PR 13/16 propagation channel). - Both AttemptNode construction sites copy cli_binary_path, cli_version, cli_argv, env_allowlist, isolation, compliance_report_path from metadata onto the persisted node. - `_log_strategy_decision(sctx, action)` records each decision; calls `should_prune` reflectively to populate `pruned_candidates`. **Tests** — 29 new in `test_m3_polish.py`: - kept_path: linear chain, discard filtering, include_self toggle, root kept/discard, orphan termination - descendants_of: DFS order, leaf, large chain (cycle defense canary) - find_by_outcome: exact match, unknown returns [] - Banner predicates + render: no metadata = no output, unsandboxed + stale combinations, HTML escape, §INV-1 wording absence - Sidecar: round-trip, missing file = empty, malformed line tolerance - Property assertion: N decide() calls → N sidecar entries (Q6 pin) - AttemptNode: new fields default None, can be set - Reporter integration: static + interactive both render banner when isolation set; both DON'T render for normal runs **Reviewer Q1-Q6 trace**: - Q1 scope: 3-bullet, no Evaluator (deferred to PR 17a if a real need surfaces) ✓ - Q2 helpers: include_self kwarg, no baked heuristics ✓ - Q3 banner SSOT: one module, both renderers import ✓ - Q4 sidecar: separate file, not ledger ✓ - Q5 CLI: --strategy-decisions flag on postmortem (no new top-level command) ✓ - Q6 property assertion: 1 decide → 1 sidecar entry test added ✓ **Stats**: - 8 files changed, +517 / -3 LOC - 29 new tests; full suite 2761 passed + 1 pre-existing failure (`test_create_agent_unknown_raises` regex/case mismatch, exists at PR 16 baseline; NOT a PR 17 regression) + 4 skipped - 0 regressions from PR 17 **Note on dedup pass**: reviewer round 1 suggested deduplicating inline walks in `compare.py` / `html_tree.py` after the helpers landed. Inspection shows their walks (DFS-from-None-root with depth/orphan handling, metric-aware best-of-run) don't match the new helper signatures (which work on a specific node_id). Future helpers can absorb those if a refactor opportunity arises. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

suzuke mentioned this pull request Apr 26, 2026

docs(m3): wording audit per spec §INV-1 + new CLI-SUBSCRIPTION-BACKEND.md #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(m3): ledger query helpers + reporter banners + strategy decision sidecar#11

feat(m3): ledger query helpers + reporter banners + strategy decision sidecar#11
suzuke wants to merge 1 commit into
feat/m3-cli-subscriptionfrom
feat/m3-modules-polish

suzuke commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

suzuke commented Apr 26, 2026

Summary

What's new

StateStore — Ledger query helpers (ledger.py)

Reporter — Banner SSOT (reporter/_banners.py NEW)

SearchStrategy — Decision sidecar (strategy_decisions.py NEW)

AttemptNode schema additions (ledger.py)

Reviewer trail

Stats

Test plan

Q2 ledger helpers

Q3 banner SSOT

Q4 sidecar

Q6 property assertion + integration

AttemptNode schema

Known limitations / non-blockers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

StateStore — Ledger query helpers (`ledger.py`)

Reporter — Banner SSOT (`reporter/_banners.py` NEW)

SearchStrategy — Decision sidecar (`strategy_decisions.py` NEW)

AttemptNode schema additions (`ledger.py`)