Skip to content

feat(m3): ledger query helpers + reporter banners + strategy decision sidecar#11

Open
suzuke wants to merge 1 commit into
feat/m3-cli-subscriptionfrom
feat/m3-modules-polish
Open

feat(m3): ledger query helpers + reporter banners + strategy decision sidecar#11
suzuke wants to merge 1 commit into
feat/m3-cli-subscriptionfrom
feat/m3-modules-polish

Conversation

@suzuke
Copy link
Copy Markdown
Owner

@suzuke suzuke commented Apr 26, 2026

Summary

Stacked on #10 (M3 PR 16 SubscriptionCLIBackend). Polish PR. Three focused, additive improvements landing as a single coherent PR per spec §1 M3 module table:

  • StateStore extend — Ledger query helpers
  • Reporter polish — Truth-in-labeling banners (single source of truth)
  • SearchStrategy polish — Decision sidecar log + crucible postmortem --strategy-decisions

What's new

StateStore — Ledger query helpers (ledger.py)

TrialLedger.kept_path(node_id, *, include_self=True) -> list[AttemptNode]
TrialLedger.descendants_of(node_id) -> list[AttemptNode]
TrialLedger.find_by_outcome(outcome: str) -> list[AttemptNode]
  • kept_path walks parent chain, filters to outcome=="keep". include_self=False drops the queried node from the result regardless of its outcome. Reviewer Q2 pin: no baked heuristics — caller controls semantics via the kwarg.
  • descendants_of returns all transitive children, DFS-by-parent + siblings sorted by id. Iteration order matches _render_tree for consistency. Cycle defense via visited set.
  • find_by_outcome exact-string filter, plain str for forward-compat with custom outcomes.

Reporter — Banner SSOT (reporter/_banners.py NEW)

Single source of truth for warning banner copy. Reviewer Q3 pin: both html_tree.py (static) and interactive.py (d3) import the same render_banners_html() helper. §INV-1 wording compliance — uses "degraded ACL" / "diagnostic only" / "do NOT constitute a containment claim", deliberately avoids "secure" / "isolated" / "no bypass observed in N trials" (no N to ground it).

Two banner triggers:

  • isolation == "cli_subscription_unsandboxed" → unsandboxed banner
  • compliance_report_path is None AND isolation set → stale-compliance banner

SearchStrategy — Decision sidecar (strategy_decisions.py NEW)

Reviewer Q4 pin: SIDECAR file (logs/run-<tag>/strategy-decisions.jsonl), NOT a new ledger event type. Why not ledger:

  • Ledger is safety-critical (POSIX flock, single-writer, schema-versioned, seal HMAC); a schema bump invalidates parallel append safety per spec §4.2.
  • Mixing safety-critical data with debug data violates separation of concerns.

Why not in-memory: lost on crash exactly when most needed.

@dataclass
class StrategyDecision:
    timestamp: str       # ISO-8601 UTC
    iteration: int
    kept_candidates: list[str]
    pruned_candidates: list[str]
    chosen_action: str   # e.g. "Continue", "BranchFrom", "Stop"
    rationale: str
    extras: dict[str, Any]  # adapter-specific

Orchestrator hook: _log_strategy_decision(sctx, action) runs immediately after strategy.decide(sctx) in _run_loop_serial. Best-effort — any failure is logged at DEBUG and never raises, so a logging hiccup never blocks the run loop.

CLI: crucible postmortem --tag X --strategy-decisions prints the recorded sequence.

AttemptNode schema additions (ledger.py)

isolation: Optional[str] = None          # truth-in-labeling tag
compliance_report_path: Optional[str] = None  # audit trail

Both default None for backward compat with M1a/M1b/M2 ledgers. Populated from AgentResult.backend_metadata via the new _extract_backend_metadata() helper in orchestrator.py, used at both AttemptNode construction sites.

Reviewer trail

Round Verdict Findings
1 (design) ACCEPT 4 steers (Q2 helper signature, Q3 banner SSOT, Q4 sidecar not ledger, Q6 property assertion + edge cases) — all blocking-lite, none REJECT-level
2 (impl) VERIFIED All 4 steers landed correctly. Numbers accurate (third round in a row). Reviewer flagged bonus quality: proactive XSS test on banner content + _extract_backend_metadata helper reused at both AttemptNode construction sites (avoids drift). Deferred dedup pass for compare.py/html_tree.py inline walks confirmed correct judgment — patterns don't match new helper shapes.

Stats

  • 1 commit (5365098)
  • 8 files changed, +882 / -3 LOC
  • 29 new tests in test_m3_polish.py
  • Full suite: 2761 passed + 1 pre-existing failure (test_create_agent_unknown_raises — exists at PR 16 baseline) + 4 skipped. 0 regressions from PR 17.

Test plan

Q2 ledger helpers

  • kept_path linear chain
  • kept_path filters out discards
  • kept_path(include_self=False) drops self
  • kept_path root kept / root discard
  • kept_path orphan node terminates cleanly (no infinite loop)
  • descendants_of DFS order (matches _render_tree)
  • descendants_of leaf returns []
  • descendants_of cycle defense (large chain canary)
  • find_by_outcome exact match + unknown returns []

Q3 banner SSOT

  • No metadata = no output
  • Unsandboxed banner renders correctly
  • Stale-compliance banner gated on isolation set first
  • Both renderers import same module (structural assertion)
  • §INV-1 wording compliance (no "secure" / "no bypass observed in N")
  • HTML escape on body text (defensive)

Q4 sidecar

  • Round-trip: append then load yields equal records
  • Missing file → empty list (reporter calls on every run dir)
  • Malformed lines tolerated (mirror ledger reader)
  • Sidecar filename matches documented constant

Q6 property assertion + integration

  • N decide() calls → exactly N sidecar entries (Q6 counter assertion)
  • Static reporter renders banner when isolation set
  • Interactive reporter renders banner when isolation set
  • Neither renders banner for normal (non-CLI) runs

AttemptNode schema

  • isolation defaults to None
  • compliance_report_path defaults to None
  • Both can be set to specific tag values

Known limitations / non-blockers

  • Inline-walk dedup deferred — reviewer round 1 suggested deduplicating walks in compare.py / html_tree.py after helpers landed. Inspection showed those walks (DFS-from-None-root with depth/orphan handling) don't match the new helper signatures (specific node_id queries). Forcing a dedup would create abstractions-for-abstraction's-sake. Defer until a concrete caller needs both shapes.
  • Strategy decision logging is best-effort — failures are logged at DEBUG and swallowed. A logging hiccup never blocks the run loop.
  • extras dict is free-form — adapter-specific details go here. No schema enforcement; reader uses .get() defaults.

🤖 Generated with Claude Code

… sidecar (M3 PR 17)

Per spec §1: StateStore "extend" + SearchStrategy "polish" + Reporter
follow-up to PR 15. Three focused, additive improvements landing as a
single coherent PR.

**StateStore extend — Ledger query helpers** (`ledger.py`):
- `kept_path(node_id, *, include_self=True)` — walk parent chain,
  filter to outcome=="keep". `include_self=False` drops the queried
  node from the result. Reviewer round 1 Q2 pin: don't bake
  heuristics; let callers control via the kwarg.
- `descendants_of(node_id)` — DFS-by-parent + sort children by id.
  Iteration order matches `_render_tree` in `html_tree.py`. Includes
  cycle defense via visited set.
- `find_by_outcome(outcome)` — exact string match on outcome.

**SearchStrategy polish — Decision sidecar log** (`strategy_decisions.py`):
- Reviewer round 1 Q4 pin: SIDECAR file (`logs/run-<tag>/strategy-decisions.jsonl`),
  NOT a new ledger event type. Why not ledger:
  (a) ledger is safety-critical (POSIX flock, single-writer, schema-
      versioned, seal HMAC); a schema bump invalidates parallel append
      safety per spec §4.2.
  (b) mixing safety-critical data with debug data violates separation
      of concerns.
  Why not in-memory: lost on crash exactly when most needed.
- `StrategyDecision` dataclass: timestamp / iteration / kept_candidates
  / pruned_candidates / chosen_action / rationale / extras.
- `append()` writes one JSON line per decide() call; `load_all()` reads
  for the postmortem reporter.
- Orchestrator hook: `_log_strategy_decision()` called immediately after
  `strategy.decide(sctx)` in `_run_loop_serial`. Best-effort — any
  failure is logged at DEBUG and never raises (so a logging hiccup
  never blocks the run loop).
- CLI: `crucible postmortem --tag X --strategy-decisions` flag prints
  the recorded sequence.

**Reporter polish — Truth-in-labeling banners** (`reporter/_banners.py`):
- Reviewer round 1 Q3 pin: SINGLE source of truth for banner copy so
  static + interactive renderers stay in sync. Spec §INV-1 wording rules
  apply ("no bypass observed in N adversarial trials" not "secure"; we
  don't say it without N).
- `UNSANDBOXED_HEADING` / `UNSANDBOXED_BODY`: shown when AttemptNode
  metadata has `isolation == "cli_subscription_unsandboxed"` (M2 PR 16
  CLI-subscription runs).
- `STALE_COMPLIANCE_HEADING` / `STALE_COMPLIANCE_BODY`: shown when
  `compliance_report_path is None` AND the run is unsandboxed (i.e.
  `experimental.allow_stale_compliance` was used).
- Body copy uses "diagnostic only" / "not a containment claim" wording
  per §INV-1 — never "secure" / "isolated" / "sandbox".
- Both `html_tree.py` (static) and `interactive.py` (d3) import the
  same `render_banners_html` helper. Tests assert both paths render
  the banner when isolation is set.

**Schema additions** (`ledger.py:AttemptNode`):
- `isolation: Optional[str] = None` — truth-in-labeling tag (parallel
  to spec §11.2 Q5's `isolation=local_unsafe`).
- `compliance_report_path: Optional[str] = None` — audit trail to the
  JSONL evidence file.
- Both default None for backward compat with M1a/M1b/M2 ledgers.

**Orchestrator wiring** (`orchestrator.py`):
- `_extract_backend_metadata(agent_result)` helper reads the dict from
  `AgentResult.backend_metadata` (M2 PR 13/16 propagation channel).
- Both AttemptNode construction sites copy cli_binary_path, cli_version,
  cli_argv, env_allowlist, isolation, compliance_report_path from
  metadata onto the persisted node.
- `_log_strategy_decision(sctx, action)` records each decision; calls
  `should_prune` reflectively to populate `pruned_candidates`.

**Tests** — 29 new in `test_m3_polish.py`:
- kept_path: linear chain, discard filtering, include_self toggle, root
  kept/discard, orphan termination
- descendants_of: DFS order, leaf, large chain (cycle defense canary)
- find_by_outcome: exact match, unknown returns []
- Banner predicates + render: no metadata = no output, unsandboxed +
  stale combinations, HTML escape, §INV-1 wording absence
- Sidecar: round-trip, missing file = empty, malformed line tolerance
- Property assertion: N decide() calls → N sidecar entries (Q6 pin)
- AttemptNode: new fields default None, can be set
- Reporter integration: static + interactive both render banner when
  isolation set; both DON'T render for normal runs

**Reviewer Q1-Q6 trace**:
- Q1 scope: 3-bullet, no Evaluator (deferred to PR 17a if a real need
  surfaces) ✓
- Q2 helpers: include_self kwarg, no baked heuristics ✓
- Q3 banner SSOT: one module, both renderers import ✓
- Q4 sidecar: separate file, not ledger ✓
- Q5 CLI: --strategy-decisions flag on postmortem (no new top-level command) ✓
- Q6 property assertion: 1 decide → 1 sidecar entry test added ✓

**Stats**:
- 8 files changed, +517 / -3 LOC
- 29 new tests; full suite 2761 passed + 1 pre-existing failure
  (`test_create_agent_unknown_raises` regex/case mismatch, exists at
  PR 16 baseline; NOT a PR 17 regression) + 4 skipped
- 0 regressions from PR 17

**Note on dedup pass**: reviewer round 1 suggested deduplicating inline
walks in `compare.py` / `html_tree.py` after the helpers landed.
Inspection shows their walks (DFS-from-None-root with depth/orphan
handling, metric-aware best-of-run) don't match the new helper
signatures (which work on a specific node_id). Future helpers can
absorb those if a refactor opportunity arises.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant