Skip to content

55: Add durable run ledger and checkpoints for Rondo-style runs#90

Open
sandsower wants to merge 8 commits intomainfrom
issue-55-run-ledger-checkpoints
Open

55: Add durable run ledger and checkpoints for Rondo-style runs#90
sandsower wants to merge 8 commits intomainfrom
issue-55-run-ledger-checkpoints

Conversation

@sandsower
Copy link
Copy Markdown
Owner

@sandsower sandsower commented May 10, 2026

Summary

Adds the P0 durable run-ledger/checkpoint contract for Beislið repo-aware runs.

  • Defines run-ledger-v1 under external Beislið state:
    ${BEISLID_STATE_DIR:-~/.local/state/beislid}/runs/<flow>/<repo_hash>/<run_id>/
  • Documents stable run IDs, run.json, events.jsonl, checkpoints, ticket snapshots, plans, final reports, and artifact directories.
  • Wires kickoff and ship-it protocols for ledger initialization, step/phase checkpoints, side-effect checkpoints, final reports, interruptions, and resume prompts.
  • Adds predictable gate/review artifact paths.
  • Extends ship-it/kickoff smoke helpers and verifiers for ledger files, artifacts, resume hints, memory evidence, and redaction coverage.

Verification

  • python3 scripts/validate_skills.py
  • python3 -m py_compile tests/agent-smoke/scenarios/ship-it/evidence_helper.py tests/agent-smoke/scenarios/ship-it/verify.py tests/agent-smoke/scenarios/kickoff/verify.py
  • python3 tests/agent-smoke/scenarios/ship-it/evidence_helper.py --self-test
  • python3 tests/agent-smoke/scenarios/ship-it/verify.py --self-test
  • python3 tests/agent-smoke/scenarios/kickoff/verify.py --self-test
  • git diff --check
  • python3 tests/agent-smoke/run.py gate ship-it --hosts claude,codex --timeout 900 --changed-only

Review

  • Independent review findings fixed.
  • Final fresh-eyes pass: no blocking findings.
  • Cross-host smoke passed on Claude and Codex.

Notes

  • P0 chooses external state storage. Repo-local .beislid/runs is reserved for a future explicit opt-in.
  • Domain capture is not configured for this repo.

Summary by CodeRabbit

  • New Features

    • Persistent run-ledger with resume/inspect/start/abort prompts, checkpointed resume hints, verbose checkpoint/run stamps, and final-report writeback.
    • Gate/review artifact paths and transcript-safe checkpointing with best-effort persistence and redact-on-write behavior.
  • Documentation

    • Expanded docs and templates covering run-ledger layout, resume semantics, checkpointing, artifact conventions, and session-memory requirements.
  • Tests

    • Updated smoke tests and verifiers to check run-ledger presence, checkpoint/resume shapes, artifact lists, and redaction rules.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 10, 2026

📝 Walkthrough

Walkthrough

This PR implements a durable run ledger system for repo-aware orchestrators in Beislid, adding persistent local evidence storage with structured checkpoints, artifact management, and resume capability across kickoff and ship-it workflows.


Changes

Run Ledger Feature

Layer / File(s) Summary
Specification & Configuration
.beislid/output-templates.md, .beislid/workflow-md-format.md, docs/configuration.md, docs/workflows.md
Defines run-ledger-v1 under ${BEISLID_STATE_DIR:-~/.local/state/beislid}/runs/<flow>/<repo-hash>/<run-id>/, required artifacts (run.json, events.jsonl, checkpoint JSONs, optional ticket.json/implementation-plan.md/final-report.md), artifact dirs, transcript-safe checkpoint constraints, best-effort persistence, resume semantics (scan incomplete, prompt user, resume only to next resume_hint boundary), and prohibits repo-local .beislid/runs inference.
Templates (kickoff & ship-it)
.beislid/kickoff-templates.md, .beislid/ship-it-templates.md, .beislid/output-templates.md
Add incomplete-run prompts (resume/inspect/new/abort), verbose checkpoint-saved stamps with resume_hint, combined run-ledger + verbose transcript persistence spec, and extend structured session-memory to include ledger and artifact-path-bearing gates/review summaries.
Kickoff Integration
skills/kickoff/SKILL.md, skills/kickoff/step-1-ticket.md, skills/kickoff/step-8-ticket-update.md
Require ledger init/resume after probe cache; persist ticket.json and per-action pending/result checkpoints with resume_hint; write implementation-plan.md and final-report.md; emit ledger events/checkpoints after aux loads and step exits; expand required outputs and run-end completion semantics.
Kickoff Smoke Test
tests/agent-smoke/scenarios/kickoff/scenario.json, tests/agent-smoke/scenarios/kickoff/verify.py
Scenario prompt requires kickoff run ledger with run.json, events.jsonl, ticket.json, implementation-plan, final-report, and checkpoints under $BEISLID_STATE_DIR/runs/kickoff/<repo_hash>/<run_id>/. Adds verify_run_ledger() to validate ledger identity, required artifacts, checkpoint presence with resume_hint, and step-boundary markers; self-test constructs ledger fixture and provides state_dir.
Ship-it Integration
skills/ship-it/SKILL.md, skills/ship-it/phase-1-detect.md, skills/ship-it/phase-2-gates.md, skills/ship-it/phase-3-review.md, skills/ship-it/phase-4-ship.md, .beislid/ship-it-templates.md
Initialize/resume ledger after config/cache; append ledger events at phase boundaries, gates, reviews, and side effects; create per-phase checkpoints and artifact dirs (artifacts/gates/..., artifacts/reviews/...); require final-report.md and mark run.json completed before cache writeback; enforce transcript-safety for persisted content; extend session memory brief with enriched gates/review evidence and ledger paths in PR body.
Ship-it helpers & tests
tests/agent-smoke/scenarios/ship-it/evidence_helper.py, tests/agent-smoke/scenarios/ship-it/verify.py, tests/agent-smoke/scenarios/ship-it/scenario.json
Evidence helper adds ledger utilities (ledger_paths(), append_event(), write_checkpoint(), maybe_write_artifact()), initializes ledger/checkpoints, appends events, writes interruption/completion checkpoints, broadens recursive redaction for Authorization/GitHub-token-like strings, and adds interrupt CLI subcommand. Verifier extracts memory from transcript or evidence.json, validates incomplete-resume and completed-ledger shapes, checks artifact path correspondence, and self-tests produce both incomplete and completed ledger fixtures.
Phase 2 gate docs
skills/ship-it/phase-2-gates.md
Document per-attempt gate artifact checkpoint dirs, raw vs summary persistence rules, autofix confirmation step, strict non-autofix failure prompting, and richer gate tracking/checkpoint expectations for final report.
Supporting wording & minor docs
.beislid/doctor-templates.md, .beislid/probe-semantics.md, skills/*
Replace “P0” phrasing with “current implementation/currently runnable”, narrow doctor gate warning triggers, adjust probe/lifecycle-action wording, and other small textual clarifications.

Sequence Diagram(s)

sequenceDiagram
  participant Orchestrator
  participant StateDir
  participant User
  Orchestrator->>StateDir: scan ${BEISLID_STATE_DIR}/runs/<flow>/*
  StateDir-->>Orchestrator: list run-ledger-v1 candidates
  Orchestrator->>User: prompt (resume / inspect / new / abort)
  User-->>Orchestrator: chosen action
  Orchestrator->>StateDir: open selected run-ledger-v1 (run.json/events.jsonl/checkpoints)
  Orchestrator->>StateDir: append events/checkpoints during run
  Orchestrator->>StateDir: write final-report & mark run.json completed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes


Possibly related issues


Possibly related PRs

  • sandsower/beislid#89: Both PRs edit similar templates/docs (run-ledger/checkpoint protocol vs gate result envelope), so they are related.
  • sandsower/beislid#88: Related wording and gate handling/documentation updates.

Poem

🐰 A ledger hops into the state directory,
Checkpoints tucked neat, resume hints for when we tarry,
Events append like footprints in the sand,
Artifacts rest in folders, tidy and planned—
Now runs can pause and hop right back to carry on!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding durable run ledger and checkpoints for Rondo-style runs, which aligns with the core objective documented in the PR objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue-55-run-ledger-checkpoints

Comment @coderabbitai help to get the list of available commands and usage tips.

…heckpoints

# Conflicts:
#	.beislid/output-templates.md
#	.beislid/ship-it-templates.md
#	skills/ship-it/phase-2-gates.md
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.beislid/kickoff-templates.md:
- Around line 21-23: The three unlabeled fenced code blocks in
kickoff-templates.md (the blocks containing "💭 Run ledger: `<run_id>`...", the
"⚠️ Found incomplete kickoff run `<run_id>`..." block, and the "✓ Run
checkpoint: `<checkpoint>`..." block) should be annotated with a fence language
to satisfy markdownlint MD040; update each triple-backtick fence to include a
language label such as text (e.g., ```text) so the blocks are labeled
consistently.

In @.beislid/output-templates.md:
- Around line 89-92: The artifact path components (<scope>, <gate-name>,
<review-kind>) are unconstrained and can enable path traversal or invalid paths;
update the contract and any code that writes under artifacts/gates/... and
artifacts/reviews/... to require and validate a safe-slug normalization (e.g.,
lowercase alphanumerics, hyphens/underscores only) and explicitly reject path
separators or empty segments and characters like "../" or "/" before any
filesystem write; document the rule in the contract text and ensure callers (the
code that composes "artifacts/gates/<scope>/<gate-name>/<attempt>/" and
"artifacts/reviews/<review-kind>.md") must normalize inputs and fail validation
rather than write unsafe paths.

In `@docs/configuration.md`:
- Line 262: The placeholder in docs/configuration.md currently uses dash-style
tokens "<repo-hash>/<run-id>" which conflicts with the ledger contract's
snake_case tokens "<repo_hash>/<run_id>"; update the documentation string
"${BEISLID_STATE_DIR:-~/.local/state/beislid}/runs/<flow>/<repo-hash>/<run-id>/"
to use the canonical "<repo_hash>/<run_id>" form so it matches the ledger
contract and other docs (search for occurrences of "<repo-hash>" and "<run-id>"
and replace them with "<repo_hash>" and "<run_id>" respectively).

In `@skills/kickoff/SKILL.md`:
- Line 37: Update the kickoff flow to initialize or resume a run-ledger-v1
directory immediately after cache setup and before Step 1 by scanning for
incomplete ledgers for the repo hash and offering
resume/inspect/start-new/abort; on resume reuse run_id, append a resume event to
events.jsonl, and continue only from the next safe step boundary named by
resume_hint while always requiring ticket update approval. When starting-new
write run.json and the first events.jsonl entry including run_id, branch, repo
hash, workflow hash, cache state, and ledger path, and print the ≤240-char
orientation prose from kickoff-templates.md (include run id/path in verbose mode
or if ledger init warned). Add a required durable checkpoint named run-start
immediately after kickoff (before Step 1) and ensure an additional checkpoint is
written immediately after any irreversible side effect (e.g., ticket
update/post) so resume state is fully defined.

In `@skills/ship-it/phase-4-ship.md`:
- Around line 41-42: Update the PR body guidance so it prohibits publishing
host-local filesystem paths: modify the "Body: terse record-facing summary..."
text to remove the allowance of "local run-ledger artifact paths" and instead
instruct authors to reference repository-facing paths, sanitized artifact IDs,
or attach artifacts to the PR; add a note to redact or replace any host-specific
info (e.g., home-directory usernames) with placeholders or repo-relative paths
and preserve existing advice about carried warnings (like AI-generated
translation notices).

In `@tests/agent-smoke/scenarios/kickoff/verify.py`:
- Around line 74-80: The current check aggregates all checkpoint files into one
big string (checkpoint_text) and searches for "resume_hint" globally; instead
iterate over each file in checkpoints (the variable checkpoints from the run_dir
/ "checkpoints" glob), read each file (using path.read_text(encoding="utf-8",
errors="replace")), and for every path ensure the substring "resume_hint" is
present; if any file is missing it, call fail(errors, f"...") referencing that
specific path so the failure points to the offending checkpoint rather than
masking missing fields across files.
- Around line 52-57: The code currently picks the last path from ledgers
(created in tests/agent-smoke/scenarios/kickoff/verify.py) using lexicographic
order; change it to select the most recently modified ledger by using filesystem
mtime (or an updated_at field) when choosing run_json so run_json = max(ledgers,
key=lambda p: p.stat().st_mtime) (keep the existing empty-check/fail behavior),
then continue using run_dir = run_json.parent; this ensures the latest ledger is
chosen reliably instead of relying on sorted(...)[-1].

In `@tests/agent-smoke/scenarios/ship-it/evidence_helper.py`:
- Around line 305-318: Read run_json into run_payload first and use that to make
finalize idempotent: before calling write_checkpoint/append_event and mutating
run_payload, check run_payload.get("status") == "completed" and whether
run_payload.get("last_checkpoint") or run_payload.get("checkpoints", []) already
contains the same final checkpoint path (compare against the would-be
final_checkpoint value or its string form); if so, skip re-writing the
final-report checkpoint, skip append_event("run-finalized") and avoid pushing
the duplicate checkpoint into run_payload, otherwise proceed to call
write_checkpoint, append_event and update/write_json; reference
final_checkpoint, append_event, write_checkpoint, run_payload, run_json and
write_json to locate the changes.

In `@tests/agent-smoke/scenarios/ship-it/verify.py`:
- Around line 65-79: The code currently returns {} as soon as
json.loads(match.group(1)) fails, which prevents the subsequent evidence.json
fallback from running; change the logic in the block that handles match (around
the json.loads(match.group(1)) call) so that on JSONDecodeError it does not
return immediately but instead continues to the later transcript marker check
and evidence.json reading (i.e., remove or replace the early return in the
except block and allow the function to proceed to the
evidence_path/evidence.json logic); keep the existing behavior of returning a
dict when parsed successfully and still validate that evidence.get("memory") is
a dict before returning it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e2f3201e-4e75-4168-b10a-b3cb0a21358b

📥 Commits

Reviewing files that changed from the base of the PR and between 3b33974 and 2452946.

📒 Files selected for processing (19)
  • .beislid/kickoff-templates.md
  • .beislid/output-templates.md
  • .beislid/ship-it-templates.md
  • .beislid/workflow-md-format.md
  • docs/configuration.md
  • docs/workflows.md
  • skills/kickoff/SKILL.md
  • skills/kickoff/step-1-ticket.md
  • skills/kickoff/step-8-ticket-update.md
  • skills/ship-it/SKILL.md
  • skills/ship-it/phase-1-detect.md
  • skills/ship-it/phase-2-gates.md
  • skills/ship-it/phase-3-review.md
  • skills/ship-it/phase-4-ship.md
  • tests/agent-smoke/scenarios/kickoff/scenario.json
  • tests/agent-smoke/scenarios/kickoff/verify.py
  • tests/agent-smoke/scenarios/ship-it/evidence_helper.py
  • tests/agent-smoke/scenarios/ship-it/scenario.json
  • tests/agent-smoke/scenarios/ship-it/verify.py

Comment on lines +21 to +23
```
💭 Run ledger: `<run_id>` at `<ledger_path>`.
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add fence languages to fix markdownlint MD040 warnings.

Lines 21, 27, and 62 use unlabeled fenced blocks. Add a language (e.g., text) to keep docs lint-clean.

Proposed fix
-```
+```text
 💭 Run ledger: `<run_id>` at `<ledger_path>`.

- +text
⚠️ Found incomplete kickoff run <run_id>: , last checkpoint <checkpoint>, resume hint: <resume_hint>.
What now? (a) resume this run, (b) inspect the ledger path, (c) start a new run, (d) abort.


-```
+```text
✓ Run checkpoint: `<checkpoint>` saved; resume hint: <next-safe-action>.
</details>


Also applies to: 27-30, 62-64

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>

[warning] 21-21: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.beislid/kickoff-templates.md around lines 21 - 23, The three unlabeled
fenced code blocks in kickoff-templates.md (the blocks containing "💭 Run
ledger: <run_id>...", the "⚠️ Found incomplete kickoff run <run_id>..."
block, and the "✓ Run checkpoint: <checkpoint>..." block) should be annotated
with a fence language to satisfy markdownlint MD040; update each triple-backtick
fence to include a language label such as text (e.g., ```text) so the blocks are
labeled consistently.


</details>

<!-- fingerprinting:phantom:poseidon:hawk -->

<!-- d98c2f50 -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines +89 to +92
Gate attempts use predictable paths under `artifacts/gates/<scope>/<gate-name>/<attempt>/`. Store raw stdout/stderr as `raw.log` when safe, or `summary.txt` when raw output should stay out of the ledger. The gate checkpoint/envelope references that path and records gate name, scope, cwd, command label, status, duration, safe summary, retry/fix decisions, and accepted risk if any.

Review outputs use predictable paths under `artifacts/reviews/<review-kind>.md` (`review.md`, `fresh-eyes.md`, or `combined-review.md`). Store the human-readable finding summary, addressed/accepted-risk state, and reduced-coverage notes; do not store hidden reviewer reasoning.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Constrain artifact path components to safe slugs before filesystem writes.

<scope> / <gate-name> / <review-kind> are currently unconstrained in the contract. If implementations map these directly, config-controlled values can cause path traversal or invalid paths. Define a required normalization/validation rule in the contract (e.g., slug + separator rejection).

Suggested contract hardening text
-Gate attempts use predictable paths under `artifacts/gates/<scope>/<gate-name>/<attempt>/`.
+Gate attempts use predictable paths under `artifacts/gates/<scope_slug>/<gate_slug>/<attempt>/`.
+`<scope_slug>` and `<gate_slug>` MUST be filesystem-safe slugs (e.g., `[a-z0-9._-]+`), with path separators, `..`, drive prefixes, and absolute paths rejected.
...
-Review outputs use predictable paths under `artifacts/reviews/<review-kind>.md`
+Review outputs use predictable paths under `artifacts/reviews/<review_kind_slug>.md`
+with the same slug/validation rule.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.beislid/output-templates.md around lines 89 - 92, The artifact path
components (<scope>, <gate-name>, <review-kind>) are unconstrained and can
enable path traversal or invalid paths; update the contract and any code that
writes under artifacts/gates/... and artifacts/reviews/... to require and
validate a safe-slug normalization (e.g., lowercase alphanumerics,
hyphens/underscores only) and explicitly reject path separators or empty
segments and characters like "../" or "/" before any filesystem write; document
the rule in the contract text and ensure callers (the code that composes
"artifacts/gates/<scope>/<gate-name>/<attempt>/" and
"artifacts/reviews/<review-kind>.md") must normalize inputs and fail validation
rather than write unsafe paths.

Comment thread docs/configuration.md
Comment thread skills/kickoff/SKILL.md Outdated
- Hash matches → load capability entries; cache state `fresh`. Per-cap freshness uses `probe_cache.ttl_hours`, default 24.

Print orientation prose from `kickoff-templates.md` (≤240 chars).
Initialize or resume a `run-ledger-v1` directory per `output-templates.md` immediately after cache setup and before Step 1. Use flow `kickoff`; first scan for incomplete ledgers for this repo hash and offer resume / inspect / start-new / abort. On resume, reuse `run_id`, append a resume event, and continue only from the next safe step boundary named by `resume_hint`; never skip ticket update approval. On start-new, write `run.json` and the first `events.jsonl` entry with `run_id`, branch, repo hash, workflow hash, cache state, and ledger path. Print orientation prose from `kickoff-templates.md` (≤240 chars), including the run id/path in verbose mode or when ledger initialization needed a warning.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a durable checkpoint before Step 1 and immediately after irreversible side effects.

Right now start-new only requires run.json plus the first event, and the global step rule only checkpoints on step exit. That leaves resume state undefined if kickoff is interrupted before Step 1 completes or right after a side effect like a ticket update/post. For a P0 resume contract, the top-level protocol should require a run-start checkpoint with resume_hint, and another checkpoint immediately after any irreversible side effect.

Also applies to: 69-69

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/kickoff/SKILL.md` at line 37, Update the kickoff flow to initialize or
resume a run-ledger-v1 directory immediately after cache setup and before Step 1
by scanning for incomplete ledgers for the repo hash and offering
resume/inspect/start-new/abort; on resume reuse run_id, append a resume event to
events.jsonl, and continue only from the next safe step boundary named by
resume_hint while always requiring ticket update approval. When starting-new
write run.json and the first events.jsonl entry including run_id, branch, repo
hash, workflow hash, cache state, and ledger path, and print the ≤240-char
orientation prose from kickoff-templates.md (include run id/path in verbose mode
or if ledger init warned). Add a required durable checkpoint named run-start
immediately after kickoff (before Step 1) and ensure an additional checkpoint is
written immediately after any irreversible side effect (e.g., ticket
update/post) so resume state is fully defined.

Comment on lines +41 to 42
- Body: terse record-facing summary of changes, why, verification, reviewer warnings, accepted risks, reduced review coverage if any, and local run-ledger artifact paths when useful for maintainers in the same environment.
- Include carried warnings such as AI-generated translation notices.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t publish local ledger filesystem paths in PR bodies.

Line 41 allows local run-ledger paths in PR descriptions. That can leak host-local identifiers (e.g., home-directory usernames) into remote PR metadata. Keep local paths in ledger/memory artifacts and keep PR body repository-facing.

Proposed fix
-- Body: terse record-facing summary of changes, why, verification, reviewer warnings, accepted risks, reduced review coverage if any, and local run-ledger artifact paths when useful for maintainers in the same environment.
+- Body: terse record-facing summary of changes, why, verification, reviewer warnings, accepted risks, and reduced review coverage if any.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/ship-it/phase-4-ship.md` around lines 41 - 42, Update the PR body
guidance so it prohibits publishing host-local filesystem paths: modify the
"Body: terse record-facing summary..." text to remove the allowance of "local
run-ledger artifact paths" and instead instruct authors to reference
repository-facing paths, sanitized artifact IDs, or attach artifacts to the PR;
add a note to redact or replace any host-specific info (e.g., home-directory
usernames) with placeholders or repo-relative paths and preserve existing advice
about carried warnings (like AI-generated translation notices).

Comment on lines +52 to +57
ledgers = sorted(state_dir.glob("runs/kickoff/*/*/run.json"))
if not ledgers:
fail(errors, f"no kickoff run ledger found under {state_dir}/runs/kickoff")
return ""
run_json = ledgers[-1]
run_dir = run_json.parent
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Select the latest ledger by updated_at/mtime, not lexicographic filename order.

sorted(...)[-1] is path-order based and can pick the wrong run when multiple ledgers exist. That makes smoke verification flaky and can validate stale artifacts.

Suggested fix
-    ledgers = sorted(state_dir.glob("runs/kickoff/*/*/run.json"))
+    ledgers = list(state_dir.glob("runs/kickoff/*/*/run.json"))
     if not ledgers:
         fail(errors, f"no kickoff run ledger found under {state_dir}/runs/kickoff")
         return ""
-    run_json = ledgers[-1]
+    def ledger_rank(path: Path) -> tuple[str, float]:
+        try:
+            payload = json.loads(path.read_text(encoding="utf-8"))
+            return (str(payload.get("updated_at", "")), path.stat().st_mtime)
+        except Exception:
+            return ("", path.stat().st_mtime)
+    run_json = max(ledgers, key=ledger_rank)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/agent-smoke/scenarios/kickoff/verify.py` around lines 52 - 57, The code
currently picks the last path from ledgers (created in
tests/agent-smoke/scenarios/kickoff/verify.py) using lexicographic order; change
it to select the most recently modified ledger by using filesystem mtime (or an
updated_at field) when choosing run_json so run_json = max(ledgers, key=lambda
p: p.stat().st_mtime) (keep the existing empty-check/fail behavior), then
continue using run_dir = run_json.parent; this ensures the latest ledger is
chosen reliably instead of relying on sorted(...)[-1].

Comment on lines +74 to +80
checkpoints = list((run_dir / "checkpoints").glob("*.json")) if (run_dir / "checkpoints").exists() else []
if len(checkpoints) < 2:
fail(errors, f"kickoff run ledger should have step checkpoints: {run_dir / 'checkpoints'}")
else:
checkpoint_text = "\n".join(path.read_text(encoding="utf-8", errors="replace") for path in checkpoints)
if "resume_hint" not in checkpoint_text:
fail(errors, "kickoff run ledger checkpoints should include resume_hint for resume semantics")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate resume_hint in each checkpoint, not via one merged text search.

Current logic only checks whether "resume_hint" appears anywhere across all checkpoint text, so one good file can mask missing fields in others.

Suggested fix
-    else:
-        checkpoint_text = "\n".join(path.read_text(encoding="utf-8", errors="replace") for path in checkpoints)
-        if "resume_hint" not in checkpoint_text:
-            fail(errors, "kickoff run ledger checkpoints should include resume_hint for resume semantics")
+    else:
+        for checkpoint in checkpoints:
+            try:
+                payload = json.loads(checkpoint.read_text(encoding="utf-8"))
+            except json.JSONDecodeError as exc:
+                fail(errors, f"checkpoint is not valid JSON: {checkpoint}: {exc}")
+                continue
+            if not payload.get("resume_hint"):
+                fail(errors, f"checkpoint missing resume_hint: {checkpoint}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/agent-smoke/scenarios/kickoff/verify.py` around lines 74 - 80, The
current check aggregates all checkpoint files into one big string
(checkpoint_text) and searches for "resume_hint" globally; instead iterate over
each file in checkpoints (the variable checkpoints from the run_dir /
"checkpoints" glob), read each file (using path.read_text(encoding="utf-8",
errors="replace")), and for every path ensure the substring "resume_hint" is
present; if any file is missing it, call fail(errors, f"...") referencing that
specific path so the failure points to the offending checkpoint rather than
masking missing fields across files.

Comment on lines +305 to +318
final_checkpoint = write_checkpoint(transcript, "final-report", {"status": "completed", "summary": safe_summary, "resume_hint": "run complete", "final_report": str(final_report)})
append_event(transcript, "run-finalized", safe_summary, checkpoint=str(final_checkpoint), final_report=str(final_report))
run_payload = read_json(run_json)
run_payload["status"] = "completed"
run_payload["updated_at"] = now()
run_payload["final_report"] = str(final_report)
run_payload["resume_hint"] = "run complete"
run_payload.pop("blocking_reason", None)
run_payload.setdefault("interruptions", [])
if run_payload.get("interrupted_at"):
run_payload["interruptions"].append({"interrupted_at": run_payload.pop("interrupted_at"), "checkpoint": str(transcript.parent / "checkpoints" / "interrupted.json")})
run_payload.setdefault("checkpoints", []).append(str(final_checkpoint))
run_payload["last_checkpoint"] = str(final_checkpoint)
write_json(run_json, run_payload)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make finalize idempotent before appending ledger history.

command_finalize() rewrites the same final-report checkpoint, appends another run-finalized event, and pushes the same checkpoint path into run.json every time it is retried. The self-test already calls command_finalize(...) twice, so this currently mutates ledger history on a retry instead of treating finalize as a no-op once the run is completed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/agent-smoke/scenarios/ship-it/evidence_helper.py` around lines 305 -
318, Read run_json into run_payload first and use that to make finalize
idempotent: before calling write_checkpoint/append_event and mutating
run_payload, check run_payload.get("status") == "completed" and whether
run_payload.get("last_checkpoint") or run_payload.get("checkpoints", []) already
contains the same final checkpoint path (compare against the would-be
final_checkpoint value or its string form); if so, skip re-writing the
final-report checkpoint, skip append_event("run-finalized") and avoid pushing
the duplicate checkpoint into run_payload, otherwise proceed to call
write_checkpoint, append_event and update/write_json; reference
final_checkpoint, append_event, write_checkpoint, run_payload, run_json and
write_json to locate the changes.

Comment on lines +65 to +79
if match:
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
return {}
if transcript and re.search(r"kind:\s*ship-it-session-memory-v1", text, re.IGNORECASE):
evidence_path = transcript.parent / "evidence.json"
if evidence_path.exists():
try:
evidence = json.loads(evidence_path.read_text(encoding="utf-8"))
except json.JSONDecodeError:
return {}
memory = evidence.get("memory")
return memory if isinstance(memory, dict) else {}
return {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the evidence.json fallback when transcript memory JSON is malformed.

If the transcript has the ship-it-session-memory-v1 marker but the fenced JSON is invalid, Lines 65-69 return {} immediately and never consult evidence.json. That breaks the new fallback path and will fail otherwise-valid smoke runs whose transcript marker is present but not parseable.

Suggested fix
     if match:
         try:
             return json.loads(match.group(1))
         except json.JSONDecodeError:
-            return {}
+            pass
     if transcript and re.search(r"kind:\s*ship-it-session-memory-v1", text, re.IGNORECASE):
         evidence_path = transcript.parent / "evidence.json"
         if evidence_path.exists():
             try:
                 evidence = json.loads(evidence_path.read_text(encoding="utf-8"))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if match:
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
return {}
if transcript and re.search(r"kind:\s*ship-it-session-memory-v1", text, re.IGNORECASE):
evidence_path = transcript.parent / "evidence.json"
if evidence_path.exists():
try:
evidence = json.loads(evidence_path.read_text(encoding="utf-8"))
except json.JSONDecodeError:
return {}
memory = evidence.get("memory")
return memory if isinstance(memory, dict) else {}
return {}
if match:
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
pass
if transcript and re.search(r"kind:\s*ship-it-session-memory-v1", text, re.IGNORECASE):
evidence_path = transcript.parent / "evidence.json"
if evidence_path.exists():
try:
evidence = json.loads(evidence_path.read_text(encoding="utf-8"))
except json.JSONDecodeError:
return {}
memory = evidence.get("memory")
return memory if isinstance(memory, dict) else {}
return {}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/agent-smoke/scenarios/ship-it/verify.py` around lines 65 - 79, The code
currently returns {} as soon as json.loads(match.group(1)) fails, which prevents
the subsequent evidence.json fallback from running; change the logic in the
block that handles match (around the json.loads(match.group(1)) call) so that on
JSONDecodeError it does not return immediately but instead continues to the
later transcript marker check and evidence.json reading (i.e., remove or replace
the early return in the except block and allow the function to proceed to the
evidence_path/evidence.json logic); keep the existing behavior of returning a
dict when parsed successfully and still validate that evidence.get("memory") is
a dict before returning it.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.beislid/ship-it-templates.md:
- Around line 282-284: Add explicit language identifiers to the three fenced
code blocks that currently use plain triple backticks: change their opening
fences to ```text. Specifically update the block containing
"${BEISLID_STATE_DIR:-~/.local/state/beislid}/runs/ship-it/<repo_hash>/<run_id>/",
the block with the two warning lines beginning "⚠️ Run ledger could not be
written to `<path>`..." and "⚠️ Verbose transcript could not be written to
`<path>`...", and the block with the two info lines beginning "💭 Run ledger
saved: `<path>`" and "💭 Verbose transcript saved: `<path>`" so markdownlint
MD040 is satisfied.

In @.beislid/workflow-md-format.md:
- Line 209: Change the capitalized "The" at the start of the sentence in the
lifecycle-actions paragraph to lowercase "the" so the sentence reads "...Under
these events, the current implementation supports `type: artifact` only...";
update the text in the same paragraph that contains "spec owns the
`spec_approved` event; `blueprint` owns the `blueprint_approved` event..." to
use the lowercase "the".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 55dd70e4-1b68-4b40-942d-a83ee6c41cb3

📥 Commits

Reviewing files that changed from the base of the PR and between 2452946 and 66af348.

📒 Files selected for processing (13)
  • .beislid/doctor-templates.md
  • .beislid/output-templates.md
  • .beislid/probe-semantics.md
  • .beislid/ship-it-templates.md
  • .beislid/workflow-md-format.md
  • docs/configuration.md
  • skills/blueprint/SKILL.md
  • skills/doctor/SKILL.md
  • skills/heard-chef/phase-3-push.md
  • skills/kickoff/step-1-ticket.md
  • skills/setup/SKILL.md
  • skills/ship-it/phase-2-gates.md
  • skills/spec/SKILL.md
✅ Files skipped from review due to trivial changes (7)
  • skills/setup/SKILL.md
  • skills/heard-chef/phase-3-push.md
  • .beislid/probe-semantics.md
  • docs/configuration.md
  • skills/blueprint/SKILL.md
  • skills/spec/SKILL.md
  • skills/doctor/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • skills/kickoff/step-1-ticket.md
  • .beislid/output-templates.md
  • skills/ship-it/phase-2-gates.md

Comment on lines 282 to 284
```
${BEISLID_STATE_DIR:-~/.local/state/beislid}/runs/ship-it/<repo_hash>/<timestamp>/transcript.md
${BEISLID_STATE_DIR:-~/.local/state/beislid}/runs/ship-it/<repo_hash>/<run_id>/
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language tags to fenced code blocks (MD040).

These three fences are missing language identifiers and currently trip markdownlint.

Suggested patch
-```
+```text
 ${BEISLID_STATE_DIR:-~/.local/state/beislid}/runs/ship-it/<repo_hash>/<run_id>/

- +text
⚠️ Run ledger could not be written to <path>: . Continuing without persisted ledger.
⚠️ Verbose transcript could not be written to <path>: . Continuing without persisted transcript.


-```
+```text
💭 Run ledger saved: `<path>`
💭 Verbose transcript saved: `<path>`
</details>


Also applies to: 304-307, 311-314

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>

[warning] 282-282: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.beislid/ship-it-templates.md around lines 282 - 284, Add explicit language
identifiers to the three fenced code blocks that currently use plain triple
backticks: change their opening fences to ```text. Specifically update the block
containing
"${BEISLID_STATE_DIR:-~/.local/state/beislid}/runs/ship-it/<repo_hash>/<run_id>/",
the block with the two warning lines beginning "⚠️ Run ledger could not be
written to <path>..." and "⚠️ Verbose transcript could not be written to
``...", and the block with the two info lines beginning "💭 Run ledger
saved: ``" and "💭 Verbose transcript saved: ``" so markdownlint
MD040 is satisfied.


</details>

<!-- fingerprinting:phantom:triton:hawk -->

<!-- d98c2f50 -->

<!-- This is an auto-generated comment by CodeRabbit -->

````

`spec` owns the `spec_approved` event; `blueprint` owns the `blueprint_approved` event. Kickoff only passes context in and records returned artifact status/path. Under these events, P0 supports `type: artifact` only; CLI, MCP, and other providers are reserved and skipped. Artifact actions write the approved spec/design Markdown to a repo file. `approval: prompt` asks before writing; `approval: auto` creates a missing target without another prompt; omitted approval defaults to `prompt`. Existing targets always prompt for overwrite / choose another path / skip. Skip and write failures do not block routing to downstream skills.
`spec` owns the `spec_approved` event; `blueprint` owns the `blueprint_approved` event. Kickoff only passes context in and records returned artifact status/path. Under these events, The current implementation supports `type: artifact` only; CLI, MCP, and other providers are reserved and skipped. Artifact actions write the approved spec/design Markdown to a repo file. `approval: prompt` asks before writing; `approval: auto` creates a missing target without another prompt; omitted approval defaults to `prompt`. Existing targets always prompt for overwrite / choose another path / skip. Skip and write failures do not block routing to downstream skills.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix capitalization typo in lifecycle-actions paragraph.

Line 209 has “Under these events, The current implementation...”; “the” should be lowercase for sentence flow.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.beislid/workflow-md-format.md at line 209, Change the capitalized "The" at
the start of the sentence in the lifecycle-actions paragraph to lowercase "the"
so the sentence reads "...Under these events, the current implementation
supports `type: artifact` only..."; update the text in the same paragraph that
contains "spec owns the `spec_approved` event; `blueprint` owns the
`blueprint_approved` event..." to use the lowercase "the".

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@skills/ship-it/phase-2-gates.md`:
- Line 42: Replace the ambiguous phrase with a concrete storage spec: state that
the user's decision to proceed without a passing gate must be recorded as a new
field in the checkpoint JSON (add "proceed_without_passing" to the checkpoint
object), and describe its schema minimally (user id, timestamp, reason,
assessed_risk, and related_artifact_path) so implementers know to persist
decision alongside existing checkpoint data rather than as a separate file or
inside the gate envelope.
- Line 39: Clarify the artifact path template
artifacts/gates/<scope>/<gate-name>/<attempt>/ by specifying that for gates
defined at the top-level (under the gates key rather than within scopes) the
<scope> placeholder should be the literal "top-level" (so
artifacts/gates/top-level/<gate-name>/<attempt>/), and update the docs where the
artifacts path and the gates vs scopes distinction are described (references:
the artifacts/gates/<scope>/<gate-name>/<attempt>/ template, the gates key, and
scopes key).
- Line 39: Clarify selection between raw.log and summary.txt by stating that
raw.log should be used when the run's stdout/stderr is safe for inclusion in
transcripts (no sensitive PII, secrets, or unacceptable data), and summary.txt
should be used when the output contains sensitive or unsafe content—in that case
store a redacted/sanitized summary or structured excerpt suitable for sharing;
keep the same artifact path pattern
artifacts/gates/<scope>/<gate-name>/<attempt>/ and ensure the
envelope/checkpoint references which file was stored and why (i.e., include a
small metadata flag like "transcript_safe": true/false).
- Line 31: The document references three missing docs (workflow-md-format.md,
output-templates.md, ship-it-templates.md); either create those files with the
expected schemas/content or inline/remove the references: add
workflow-md-format.md containing the rich gate schema (attributes stage, kind,
command, execution, output, failure and rules for pre-pr sensor detection), add
output-templates.md with the Gate result envelope format and pytest/generic
parsing guidance, and add ship-it-templates.md with all prompt text, skip
messages, and templates used across Phase 2a–2d and walkthrough flows;
alternatively, replace each reference in phase-2-gates.md with the full required
schema/text so implementers have the definitions in-place.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b78ea7b4-b932-45be-8ecb-a99f8c641422

📥 Commits

Reviewing files that changed from the base of the PR and between 66af348 and b24c409.

📒 Files selected for processing (6)
  • skills/kickoff/SKILL.md
  • skills/kickoff/step-1-ticket.md
  • skills/ship-it/SKILL.md
  • skills/ship-it/phase-1-detect.md
  • skills/ship-it/phase-2-gates.md
  • skills/ship-it/phase-3-review.md
✅ Files skipped from review due to trivial changes (2)
  • skills/ship-it/phase-1-detect.md
  • skills/kickoff/SKILL.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • skills/kickoff/step-1-ticket.md
  • skills/ship-it/phase-3-review.md
  • skills/ship-it/SKILL.md

- neither: print `no gates configured — skipping` and continue.

Normalize per `workflow-md-format.md`: flat `name`+`command` means pre-pr computational sensor. P0 ship-it executes legacy gates plus rich gates where `stage` is absent/`pre-pr`, `kind` is absent/`sensor`, `command` exists, and `execution` is absent/`computational`. Record other stages as `skipped-by-stage`; record pre-pr non-computational/non-sensor as `skipped-by-execution`. Treat rich `output`/`failure` as prompt context.
Normalize per `workflow-md-format.md`: flat `name`+`command` means pre-pr computational sensor. Ship-it currently executes legacy gates plus rich gates where `stage` is absent/`pre-pr`, `kind` is absent/`sensor`, `command` exists, and `execution` is absent/`computational`. Record other stages as `skipped-by-stage`; record pre-pr non-computational/non-sensor as `skipped-by-execution`. Treat rich `output`/`failure` as prompt context.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify workflow-md-format.md exists and documents gate schema

# Check if the file exists
if fd -t f '^workflow-md-format\.md$' | grep -q .; then
  echo "✓ workflow-md-format.md found"
  # Check for key gate-related terms
  fd -t f '^workflow-md-format\.md$' --exec rg -n -C3 'stage|kind|execution|output|failure' {} \; | head -40
else
  echo "✗ workflow-md-format.md not found"
fi

Repository: sandsower/beislid

Length of output: 95


🏁 Script executed:

# Search for all referenced files
echo "=== Checking for referenced files ==="
fd -t f '^(workflow-md-format|output-templates|ship-it-templates)' --type f

echo ""
echo "=== Checking for files in skills/ship-it directory ==="
fd -t f '\.md$' skills/ship-it

echo ""
echo "=== Checking for files with 'format' or 'template' in name ==="
fd -t f '(format|template)' --type f --extension md | head -20

Repository: sandsower/beislid

Length of output: 372


🏁 Script executed:

# Read the phase-2-gates.md file to see all references and context
cat skills/ship-it/phase-2-gates.md

Repository: sandsower/beislid

Length of output: 6979


Create missing documentation files or remove references.

The file references three documentation files that do not exist in the repository:

  1. workflow-md-format.md — referenced at line 31 for rich gate schema definition (with stage, kind, command, execution, output, failure attributes)
  2. output-templates.md — referenced at line 42 for Gate result envelope structure and pytest/generic parsing guidelines
  3. ship-it-templates.md — referenced multiple times (lines 13, 35, 37, 50, 56, and later sections) for prompt text, skip messages, and template content for Phase 2a, 2b, 2c, 2d, and walkthrough flows

Implementers cannot complete Phase 2 execution without these references. Either create these documentation files with the required schema and content, or remove/inline the references and clarify the expectations directly in this file.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/ship-it/phase-2-gates.md` at line 31, The document references three
missing docs (workflow-md-format.md, output-templates.md, ship-it-templates.md);
either create those files with the expected schemas/content or inline/remove the
references: add workflow-md-format.md containing the rich gate schema
(attributes stage, kind, command, execution, output, failure and rules for
pre-pr sensor detection), add output-templates.md with the Gate result envelope
format and pytest/generic parsing guidance, and add ship-it-templates.md with
all prompt text, skip messages, and templates used across Phase 2a–2d and
walkthrough flows; alternatively, replace each reference in phase-2-gates.md
with the full required schema/text so implementers have the definitions
in-place.

1. If `fast_ship_eligible=true`, batch only gates with `parallel_safe: true`, no `autofix`, and `mutates` not true. Run concurrently when supported; otherwise run sequentially and record `parallel_unavailable`.
2. Run non-batched gates once in configured order. Normal mode treats every selected gate as non-batched.
3. For each run, capture duration and parse stdout/stderr into the shared Gate result envelope from `output-templates.md` (pytest parser when pytest-like, otherwise generic). Store raw logs by path when possible, else a safe summary.
3. For each run, capture duration and parse stdout/stderr into the shared Gate result envelope from `output-templates.md` (pytest when pytest-like, else generic). If the ledger is available, create `artifacts/gates/<scope>/<gate-name>/<attempt>/`; store safe raw output as `raw.log` or `summary.txt`, and reference it from the envelope/checkpoint.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify the artifact path structure for top-level gates.

The path template artifacts/gates/<scope>/<gate-name>/<attempt>/ includes a <scope> placeholder, but top-level gates (from the gates key rather than scopes) don't belong to a scope. Please specify what value to use for <scope> in the top-level case (e.g., top-level, root, or a different path structure).

📝 Suggested clarification
-3. For each run, capture duration and parse stdout/stderr into the shared Gate result envelope from `output-templates.md` (pytest when pytest-like, else generic). If the ledger is available, create `artifacts/gates/<scope>/<gate-name>/<attempt>/`; store safe raw output as `raw.log` or `summary.txt`, and reference it from the envelope/checkpoint.
+3. For each run, capture duration and parse stdout/stderr into the shared Gate result envelope from `output-templates.md` (pytest when pytest-like, else generic). If the ledger is available, create `artifacts/gates/<scope>/<gate-name>/<attempt>/` (use `top-level` for `<scope>` when running top-level gates); store safe raw output as `raw.log` or `summary.txt`, and reference it from the envelope/checkpoint.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
3. For each run, capture duration and parse stdout/stderr into the shared Gate result envelope from `output-templates.md` (pytest when pytest-like, else generic). If the ledger is available, create `artifacts/gates/<scope>/<gate-name>/<attempt>/`; store safe raw output as `raw.log` or `summary.txt`, and reference it from the envelope/checkpoint.
3. For each run, capture duration and parse stdout/stderr into the shared Gate result envelope from `output-templates.md` (pytest when pytest-like, else generic). If the ledger is available, create `artifacts/gates/<scope>/<gate-name>/<attempt>/` (use `top-level` for `<scope>` when running top-level gates); store safe raw output as `raw.log` or `summary.txt`, and reference it from the envelope/checkpoint.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/ship-it/phase-2-gates.md` at line 39, Clarify the artifact path
template artifacts/gates/<scope>/<gate-name>/<attempt>/ by specifying that for
gates defined at the top-level (under the gates key rather than within scopes)
the <scope> placeholder should be the literal "top-level" (so
artifacts/gates/top-level/<gate-name>/<attempt>/), and update the docs where the
artifacts path and the gates vs scopes distinction are described (references:
the artifacts/gates/<scope>/<gate-name>/<attempt>/ template, the gates key, and
scopes key).

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify the criteria for choosing between raw.log and summary.txt.

The documentation states "store safe raw output as raw.log or summary.txt" but doesn't specify the selection criteria on this line. The AI summary mentions "transcript-safety" as the determining factor. Please add explicit criteria here to prevent implementation ambiguity.

📝 Suggested clarification
-3. For each run, capture duration and parse stdout/stderr into the shared Gate result envelope from `output-templates.md` (pytest when pytest-like, else generic). If the ledger is available, create `artifacts/gates/<scope>/<gate-name>/<attempt>/`; store safe raw output as `raw.log` or `summary.txt`, and reference it from the envelope/checkpoint.
+3. For each run, capture duration and parse stdout/stderr into the shared Gate result envelope from `output-templates.md` (pytest when pytest-like, else generic). If the ledger is available, create `artifacts/gates/<scope>/<gate-name>/<attempt>/`; store transcript-safe raw output as `raw.log`, otherwise store redacted output as `summary.txt`, and reference the artifact path from the envelope/checkpoint.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
3. For each run, capture duration and parse stdout/stderr into the shared Gate result envelope from `output-templates.md` (pytest when pytest-like, else generic). If the ledger is available, create `artifacts/gates/<scope>/<gate-name>/<attempt>/`; store safe raw output as `raw.log` or `summary.txt`, and reference it from the envelope/checkpoint.
3. For each run, capture duration and parse stdout/stderr into the shared Gate result envelope from `output-templates.md` (pytest when pytest-like, else generic). If the ledger is available, create `artifacts/gates/<scope>/<gate-name>/<attempt>/`; store transcript-safe raw output as `raw.log`, otherwise store redacted output as `summary.txt`, and reference the artifact path from the envelope/checkpoint.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/ship-it/phase-2-gates.md` at line 39, Clarify selection between
raw.log and summary.txt by stating that raw.log should be used when the run's
stdout/stderr is safe for inclusion in transcripts (no sensitive PII, secrets,
or unacceptable data), and summary.txt should be used when the output contains
sensitive or unsafe content—in that case store a redacted/sanitized summary or
structured excerpt suitable for sharing; keep the same artifact path pattern
artifacts/gates/<scope>/<gate-name>/<attempt>/ and ensure the
envelope/checkpoint references which file was stored and why (i.e., include a
small metadata flag like "transcript_safe": true/false).

4. Autofix only when the envelope is `status: fail` and `environment_failure: false`: show envelope summary, run only that autofix, show diff, and ask before committing.
5. For `status: error`, `environment_failure: true`, or no `autofix`, prompt from the envelope (`summary`, failures, flags, action, raw-log reference) plus configured `failure.*` / `output.parser` context. Do not guess. In parallel, wait for siblings and surface all failure envelopes together.
6. After a fix, re-run the applicable gate before advancing. If the user explicitly proceeds without a passing gate, record that decision/risk.
6. After a fix, re-run the applicable gate before advancing. If the user explicitly proceeds without a passing gate, record that decision/risk beside the artifact path or unavailable reason.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify where and how to record the user's proceed decision.

The phrase "record that decision/risk beside the artifact path" is ambiguous about the storage mechanism. Please specify whether this should be:

  • A separate file in the artifact directory (e.g., decision.json)
  • A field in the checkpoint JSON
  • An entry in the gate envelope
  • Another mechanism

Clear specification is important for this ledger/persistence feature to ensure consistent implementation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@skills/ship-it/phase-2-gates.md` at line 42, Replace the ambiguous phrase
with a concrete storage spec: state that the user's decision to proceed without
a passing gate must be recorded as a new field in the checkpoint JSON (add
"proceed_without_passing" to the checkpoint object), and describe its schema
minimally (user id, timestamp, reason, assessed_risk, and related_artifact_path)
so implementers know to persist decision alongside existing checkpoint data
rather than as a separate file or inside the gate envelope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant