Skip to content

Cross-machine run ledger#24

Merged
KadenMc merged 11 commits into
mainfrom
feat/cross-machine-ledger
May 26, 2026
Merged

Cross-machine run ledger#24
KadenMc merged 11 commits into
mainfrom
feat/cross-machine-ledger

Conversation

@KadenMc
Copy link
Copy Markdown
Owner

@KadenMc KadenMc commented May 26, 2026

Summary

  • New canonical artifact: .aexp/ledger/<job_id>.json — sanitized projections of terminal-state signac runs, committed to git so every machine sees the same view after git pull. Auto-populated by a hook in aexp.runs.mark_status on every terminal-status transition (complete/failed/abandoned/stopped). Per-machine debris (absolute paths from tracker_log/events.jsonl, wandb offline-run dirs, user artifacts) stays in the gitignored .runs/workspace/.
  • New CLI verbs: aexp ledger promote <id> (manual one-shot) and aexp ledger backfill (migration tool — walks the local run store and promotes everything terminal). A transitional aexp runs-export-index is also included but ships with a deprecation warning.
  • New validator surface: aexp validate --strict-runs={error|warn|off}. Default error preserves 0.5 behavior. warn downgrades broken/empty-batch citations to warnings (exit 0); off skips existence checks entirely. Structural-shape checks always emit at error regardless.
  • New finding codes: finding.absent_run_citation (warning), finding.absent_batch_runs (warning), finding.no_run_store (warning, replaces a previously-silent tolerance branch).
  • New installed.json::machine_label field + aexp install --machine-label <name> flag. Default: short hostname. Sticky across re-installs. Used to tag registered_machine on ledger entries.
  • aexp install manages a .gitignore block via # agentic-experiments:begin/:end markers. Sets .aexp/* + !.aexp/runs-index/ + !.aexp/ledger/ so per-machine state stays ignored while the cross-machine shared subdirs are committable. Warns when a legacy .aexp/ pattern lives outside the managed block.

Internally: aexp.runs.mark_status becomes the single hook point for all four terminal transitions (lifecycle's complete/failed now route through it with set_ended_at=False to dodge a Windows file-lock race with stop_queued). aexp.linking.link_to_experiment re-promotes on terminal so the ledger stays in sync after a retroactive re-link.

What the user-facing change actually fixes

Today on a laptop authoring against cluster-side runs, aexp validate reports finding.broken_run_citation (error) for every citation the cluster has but the laptop can't see — agents reading the red output cannot tell "broken citation" from "ledger lives elsewhere" and will try to "fix" things that aren't broken. After this PR, once each machine has run aexp ledger backfill && git push, the validator resolves every cited job universally with 0 errors and 0 warnings.

Test plan

  • poetry install to refresh poetry env's metadata to 0.6.0 (the test_package_version_matches_pyproject test currently fails because of stale 0.2.1 metadata in the env — known pre-existing condition per CLAUDE.local.md).
  • poetry run pytest — expect 521 passed, 5 skipped (platform-specific), 2 deselected (test_cli.py::test_package_version_matches_pyproject + test_queue.py::test_stop_queued_kills_running_subprocess_via_sigterm, both known-flaky pre-existing).
  • poetry run ruff check src/ tests/ — clean.
  • On a real electricrag install: pip install -e <agentic-experiments> + aexp install --force + aexp ledger backfill on the cluster, git push, git pull on the laptop, aexp validate should resolve the 8 currently-failing F004/F005 citations to 0 errors + 0 warnings.

Commit ladder

Reviewable commit-by-commit, ordered by dependency:

  1. feat(runs): unify terminal-status writes through mark_status() — prerequisite for the ledger hook
  2. feat(validate): add --strict-runs={error|warn|off} flag
  3. feat(install): add machine_label to installed.json + .gitignore management
  4. feat(runs-index): aexp runs-export-index verb + three-state validator
  5. feat(ledger): aexp.ledger module + promote/backfill verbs + auto-hook
  6. feat(validate): read from ledger when present; deprecate runs-index
  7. fix(ledger): tracker projection handles signac synced collections
  8. test(ledger): comprehensive coverage for aexp.ledger module
  9. test+docs: end-to-end cross-machine smoke + slash-command code list update
  10. docs: queue.md cross-machine workflow + 0.6.0 release notes
  11. docs+code: scrub plan-internal naming from user-facing surfaces

Migration (for current consumers, 0.5 → 0.6)

  1. pip install -e <agentic-experiments> in each env.
  2. aexp install --force in each consumer (writes the new gitignore block + materializes machine_label). If a gitignore_migration_warning is emitted, delete the legacy .aexp/ line from .gitignore manually — the new block uses .aexp/* so the existing parent-exclude needs to come out.
  3. aexp ledger backfill on each machine that has terminal-state runs.
  4. Commit .gitignore and .aexp/ledger/, push, then pull on every other consumer.

See docs/queue.md § cross-machine for the full runbook.

KadenMc added 11 commits May 25, 2026 22:24
Route the lifecycle context manager's failed/complete status writes
through mark_status() so all four terminal transitions (complete,
failed, abandoned, stopped) flow through one function. This is the
prerequisite for the ledger-promotion hook (Phase 2 of the
cross-machine-ledger plan) — there's now exactly one place to add it.

mark_status() gains a `set_ended_at: bool = True` keyword. Default
preserves the historical behavior (status write + ended_at setdefault)
so existing callers in aexp.queue work unchanged. run_lifecycle passes
False because the lifecycle's except/else branches already manage
ended_at themselves; routing through mark_status with set_ended_at=True
would add an extra doc-load inside the Windows file-lock race window
during stop_queued shutdown and cause test_stop_queued_kills_running_*
to fail under contention.

The lifecycle's `_should_write_terminal_status` gate and outer
try/PermissionError tolerance are preserved. Behavior is observationally
identical to pre-commit on the happy path; the only difference is the
status write goes through one function instead of inline _safe_doc_set.

Adds TERMINAL_STATUSES module constant in runs.py for reuse.

Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md
in the electricrag repo, v4 plan section 2.0.
Phase 1A of the cross-machine-ledger plan: a manual escape hatch for
the validator's "this citation is broken" / "this run lives on another
machine" conflation. Before this commit, `aexp validate` had no way to
distinguish the two cases — every finding citing a cluster-registered
run looked broken on the laptop.

New StrictRuns Literal in aexp.validate. validate_repo and
_check_finding_citations both take a strict_runs kwarg (default "error"
preserves pre-0.6 behavior). When "warn", finding.broken_run_citation
and finding.empty_batch are constructed with severity="warning" so the
validator exits 0. When "off", existence checks are skipped entirely
but structural-shape checks (32-hex id format, mapping shape, missing
experiment_id) still emit errors — those reflect malformed citations,
not ledger gaps.

CLI surface: `aexp validate --strict-runs=warn`. Invalid values exit 2
with a clear message.

Tests in tests/test_validate_cross_machine.py — 10 cases parametrized
over the three severity values for both citation types, plus an
end-to-end CLI invalid-value test.

Docs: new subsection in docs/queue.md § cross-machine-sync-workflow
explaining when to reach for --strict-runs.

Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md
in the electricrag repo, v4 plan section Phase 1A.
…ement

Phase 1B prerequisite of the cross-machine-ledger plan. Two
mechanically-independent additions that need to land together so the
later runs-export-index + ledger work has somewhere to write:

1) machine_label field on .aexp/installed.json. Used by
   `aexp runs-export-index` (next commit) and Phase 2's ledger entries
   to tag which install registered a run. Default: short hostname
   (socket.gethostname().split(".")[0]). Override: `aexp install
   --machine-label <name>`, or edit installed.json directly (the file
   is per-machine, gitignored, so direct edits are safe and sticky).
   Sticky across re-installs that don't pass --machine-label. New
   read_machine_label() helper in aexp.utils.paths returns the value
   with hostname fallback for older markers.

2) `.gitignore` block-merge managed by `aexp install`. Writes a
   begin/end-markered block setting `.aexp/*` + `!.aexp/runs-index/`
   + `!.aexp/ledger/` so per-machine files stay ignored while the
   cross-machine shared subdirs are committable. Uses `#`-prefixed
   markers (HTML-comment style wouldn't be recognized as gitignore
   comments). New block_merge_gitignore() helper structurally mirrors
   block_merge_markdown().

   The classic `.aexp/` rule (excluding the whole directory) makes
   git's `!.aexp/runs-index/` re-include rule a no-op — git can't
   re-include a child of an excluded parent. Consumers that already
   have such a rule outside our block get a clear migration warning
   ("gitignore_migration_warning" action) telling them to delete the
   legacy line. We don't auto-delete user-authored content.

CLI: `aexp install --machine-label <name>` flag plumbs through to
install_scaffold. _print_actions surfaces the new merged_gitignore
and gitignore_migration_warning kinds. 10 new tests in test_install.py
cover the gitignore block-merge, idempotency, user-content
preservation, migration warning, and the machine_label seed/override/
sticky/helper behaviors.

Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md
in the electricrag repo, v4 plan section Phase 1B (steps 1, 2).
Phase 1B of the cross-machine-ledger plan. Each machine periodically
dumps a JSON list of its terminal-state runs to
`.aexp/runs-index/<machine_label>.json` and commits it. Other machines
pull and the validator unions these indexes to gain a three-state
vocabulary for finding citations:

- here: citation resolves in local .runs/workspace/ — clean
- elsewhere: citation in known_elsewhere (union of all index files) but
  not local — finding.absent_run_citation (warning, regardless of
  --strict-runs; the citation is good, the data just lives on a
  different ledger)
- broken: in neither — finding.broken_run_citation (severity per
  --strict-runs)

Same three-state treatment for batch citations:
finding.absent_batch_runs warning when the selector matches only
elsewhere-indexed runs.

Replaces the silent tolerance branch at validate.py:341-342 with an
explicit finding.no_run_store warning emitted once per validate run
when neither store nor any index exists — was invisible before, easy
to miss on a fresh checkout.

New module aexp.runs_index:
- IndexEntry / IndexFile typed dicts (schema_version 1)
- build_index, export_index, load_all_indexes, collect_known_elsewhere
- Stable byte-output: entries sorted by job_id so repeat exports
  produce identical content (excepting exported_at timestamp)
- Skip non-terminal jobs (queued/running/created stay machine-local)
- Tolerant of malformed index files at load time — corrupt index
  shouldn't break the validator

CLI: `aexp runs-export-index [--as <label>] [--out <path>]`. Default
output path is `<repo>/.aexp/runs-index/<machine_label>.json`. `--as`
overrides both filename and body field.

21 tests in test_validate_cross_machine.py covering: three-state
behavior on missing/elsewhere/local jobs and batches,
finding.no_run_store warning emission, runs_index module behaviors
(skip non-terminal, default path, override label, malformed file
tolerance, ledger_machine tagging, byte-stable output), CLI smoke.

Phase 2's universal ledger (next commit) supersedes this mechanism;
this module stays through one minor-version deprecation window after
Phase 2 ships. Index files double as the migration tool for
`aexp ledger backfill`.

Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md
in the electricrag repo, v4 plan section Phase 1B (steps 3-8).
Phase 2 of the cross-machine-ledger plan (sections 2.1-2.3 of the v4
plan). Introduces the universal cross-machine ledger as a sanitized
projection of terminal-state signac jobs, stored at
`.aexp/ledger/<job_id>.json` and committed to git. Every machine that
pulls sees the same ledger, so finding citations resolve consistently
everywhere — dissolving the "elsewhere vs broken" distinction that
Phase 1B's three-state vocabulary worked around.

New module `aexp.ledger`:
- LedgerEntry typed dict (schema_version=1)
- project_to_ledger_entry — allowlist-based sanitization (statepoint,
  run_link, status, ended_at, wallclock_s, tracker pointers, code
  commit, registered_machine, promoted_at). Explicitly excludes
  per-machine debris: absolute paths in tracker_log/events.jsonl,
  wandb offline run dirs, user-written artifacts. Tracker block has
  pointers (backend, run_id, url, group, project) but NOT the events.
- promote_to_ledger — idempotent atomic write. Re-promotion overwrites,
  safe under retroactive `aexp link` corrections.
- backfill_ledger — walks the local run store and promotes every
  terminal-state job not yet in `.aexp/ledger/`. Returns (promoted,
  skipped). One-job failures don't kill the whole backfill.
- ledger_path / load_ledger_entry / list_ledger_job_ids helpers for
  Commit 6's validator switch.

Auto-promote hook in `aexp.runs.mark_status`: on every terminal
transition (complete/failed/abandoned/stopped), call promote_to_ledger.
Lazy-imported to break the ledger -> runs circular dependency. Wrapped
in try/except + stderr log; promotion failure never crashes the
lifecycle. backfill_ledger recovers any missed jobs.

`aexp.linking.link_to_experiment` integration: after write_run_link,
if the job is terminal, re-promote so the ledger entry's run_link
field reflects the new linkage. Without this the ledger projection
would lag job.doc until the next backfill.

CLI verbs via new `ledger_app` sub-typer:
- `aexp ledger promote <job_id> [--machine-label X]`
- `aexp ledger backfill [--machine-label X] [--overwrite]`

End-to-end smoke verified: create_run + mark_status('complete') auto-
writes a ledger entry with the expected sanitized shape (status,
statepoint, run_link, code_commit, registered_machine, etc.). All
existing tests still pass.

Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md
in the electricrag repo, v4 plan sections 2.1-2.3.
Phase 2 sections 2.4-2.5 of the cross-machine-ledger plan. The
validator now treats `.aexp/ledger/<job_id>.json` files as a
source-of-truth for "this run exists":

- known_job_ids = (ledger_ids | local_signac_ids). A run in the ledger
  on the laptop counts as "here" even when there's no local workspace
  for it. After every machine has run `aexp ledger backfill && git
  push && git pull`, the laptop validator resolves all cited jobs
  cleanly with no warnings.

- known_elsewhere (Phase 1B index union) is now scoped to ONLY jobs
  absent from the ledger. Entries in both ledger and index are
  ledger-wins; the index becomes a transitional bridge for runs not
  yet backfilled.

- has_authority tracks whether ANY source (ledger, local store, or
  index) is available. The broken_run_citation error fires when a
  citation matches none of them but at least one is present (so the
  validator can speak with authority); finding.no_run_store warning
  fires when nothing is available.

`aexp runs-export-index` prints a deprecation warning pointing at
`aexp ledger backfill`. The mechanism stays for one minor-version
window so consumers mid-upgrade can keep using it.

7 new tests in test_validate_cross_machine.py cover:
- citation to job in ledger → clean
- citation to job in BOTH ledger and index → ledger wins, no warning
- mark_status hook auto-promotes terminal jobs
- backfill_ledger promotes only terminal jobs
- backfill idempotency
- aexp link re-promotes terminal jobs (ledger.run_link updates)
- finding.no_run_store fires only when no authority source exists
- promotion sanitization drops abs paths + tracker config blob

Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md
in the electricrag repo, v4 plan sections 2.4-2.5.
signac wraps nested dict values in synced_collections types that
aren't strict `dict` subclasses, so `isinstance(raw, dict)` was False
and the tracker block fell through to {}. Use duck-typed `dict(raw)`
conversion instead, which both accepts Mapping-shaped wrappers and
strips the in-memory synced overhead.

Should have been in commit e79735e; surfaced by the sanitization test
in commit d68358b.
Phase 2.6 of the cross-machine-ledger plan. Adds tests/test_ledger.py
with 21 focused cases covering:

Projection shape:
- required fields present (schema_version, job_id, statepoint, status,
  registered_machine, promoted_at)
- statepoint values round-trip
- run_link preserved when present
- wallclock_s / ended_at carried through when set
- optional fields absent when source has them missing
- non-terminal status raises ValueError
- code_commit / code_dirty lifted from statepoint to top-level

Sanitization (per-machine debris must not leak):
- tracker.config and tracker.init_kwargs dropped (they embed abs paths)
- tracker pointers (backend/run_id/url/group/project) preserved
- allowlist-based: arbitrary doc fields don't bleed into the projection
- Explicit assertion that the repo's absolute path never appears
  anywhere in the JSON

Idempotency:
- two consecutive promotes produce byte-identical entries modulo promoted_at
- atomic_write smoke check

Backfill:
- only terminal jobs promoted, non-terminal skipped
- idempotent (skips already-present)
- --overwrite re-promotes
- no run store → returns (0, 0) gracefully
- --machine-label overrides registered_machine

Defensive paths:
- load_ledger_entry returns None on missing/malformed
- list_ledger_job_ids returns empty set when dir doesn't exist
- mark_status's hook tolerates promote_to_ledger raising — status
  still gets written, error goes to stderr

Integration tests for the ledger-aware validator + auto-promote hook
live in test_validate_cross_machine.py (added in commits 4 + 6); this
file is the deep module-level coverage.

Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md
in the electricrag repo, v4 plan section Phase 2.6.
Phase 2.7 of the cross-machine-ledger plan. Three changes:

1. docs/queue.md § cross-machine: rewritten end-to-end. Documents
   the three-source-of-truth model (ledger > local store > index),
   the steady-state workflow (auto-promote on mark_status), the
   --strict-runs escape hatch for the 0.5→0.6 transition window,
   and the migration runbook including the gitignore-migration
   warning handling. New machine_label subsection explains the HPC
   hostname-noise mitigation.

2. pyproject.toml: bump to 0.6.0 (semver minor — new public surface:
   ledger module, ledger/runs-export-index CLI verbs, --strict-runs
   flag, machine_label install marker field).

3. CHANGELOG.md: new [0.6.0] entry under Keep-a-Changelog format
   covering Added / Changed / Fixed / Migration sections. Cites
   the friction doc in the electricrag repo for the design
   rationale.

Note: the in-poetry-env test_package_version_matches_pyproject test
will fail until you re-run `poetry install` (poetry env's installed
metadata is stale at 0.2.1 per existing CLAUDE.local.md guidance —
that's a known pre-existing condition). Deselect or refresh the env;
either is fine.

Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md
in the electricrag repo, v4 plan section Phase 2.7.
…pdate

End-to-end smoke test (tests/test_cross_machine_e2e.py): simulates the
exact TL;DR scenario from the friction doc — two consumer repos
sharing a bare remote, cluster registers + auto-promotes jobs, pushes,
laptop pulls, validator resolves the citation cleanly. Catches the
whole-stack failure modes that the unit tests can't (gitignore
block-merge actually flowing through, real git push/pull cycles,
install-on-cluster-then-install-on-laptop interaction).

Two cases:

- test_e2e_cluster_backfill_then_laptop_validate_is_clean: happy path.
  Cluster install + register + mark_status (auto-promote fires) +
  commit + push. Laptop pull + install + cite. Validator: 0 errors,
  0 warnings.

- test_e2e_strict_runs_warn_unblocks_pre_ledger_state: escape-hatch
  flow. Bogus citation in a fresh laptop, --strict-runs=warn lets the
  validator exit 0. Defensive about list_kb_artifacts not always
  picking up minimally-formatted findings — asserts warn-vs-error
  consistency rather than a specific count.

Slash command (src/aexp/slash_commands/aexp-validate.md): updated
the code list with the three new codes (finding.absent_run_citation,
finding.absent_batch_runs, finding.no_run_store) and their fix
suggestions. New "Cross-machine escape hatch" section explains
`--strict-runs=warn` and points at `aexp ledger backfill` as the
long-term fix.
Replaces every "Phase 1A / 1B / 2 / 2.0 / 2.5 / v4 plan" reference in
user-visible content with descriptive language naming the actual
mechanism. Plan numbering was only meaningful relative to the design
doc; a user reading the CHANGELOG, --help text, slash command, or
module docstring shouldn't need that doc to understand what each piece
does.

Touched user-facing:
- CHANGELOG.md [0.6.0] section — "Cross-machine ledger (Phase 2)" ->
  "Cross-machine run ledger"; "Phase 1B index file" -> "per-machine
  index file"; "Phase 1B, transitional" -> "transitional"; removed
  the trailing pointer to the friction doc in another repo.
- docs/queue.md § cross-machine — same treatment for the three
  source-of-truth list, the steady-state workflow heading, and the
  migration heading.
- CLI help: --machine-label and --strict-runs help strings; the
  runs-export-index docstring + deprecation message; ledger sub-app
  + promote + backfill docstrings.
- aexp.ledger / aexp.runs_index module docstrings.
- Code comments in install.py, linking.py, runs.py, validate.py,
  utils/paths.py.
- Test-file docstrings + section comments.

Behavior unchanged; this is pure naming. Test suite still passes
(139 in the cross-machine + install + ledger + e2e subset checked
post-cleanup).

Branch also renamed locally from feat/cross-machine-ledger-v4 to
feat/cross-machine-ledger to match.
@KadenMc KadenMc merged commit 75c4eda into main May 26, 2026
6 checks passed
@KadenMc KadenMc deleted the feat/cross-machine-ledger branch May 26, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant