Cross-machine run ledger#24
Merged
Merged
Conversation
Route the lifecycle context manager's failed/complete status writes through mark_status() so all four terminal transitions (complete, failed, abandoned, stopped) flow through one function. This is the prerequisite for the ledger-promotion hook (Phase 2 of the cross-machine-ledger plan) — there's now exactly one place to add it. mark_status() gains a `set_ended_at: bool = True` keyword. Default preserves the historical behavior (status write + ended_at setdefault) so existing callers in aexp.queue work unchanged. run_lifecycle passes False because the lifecycle's except/else branches already manage ended_at themselves; routing through mark_status with set_ended_at=True would add an extra doc-load inside the Windows file-lock race window during stop_queued shutdown and cause test_stop_queued_kills_running_* to fail under contention. The lifecycle's `_should_write_terminal_status` gate and outer try/PermissionError tolerance are preserved. Behavior is observationally identical to pre-commit on the happy path; the only difference is the status write goes through one function instead of inline _safe_doc_set. Adds TERMINAL_STATUSES module constant in runs.py for reuse. Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md in the electricrag repo, v4 plan section 2.0.
Phase 1A of the cross-machine-ledger plan: a manual escape hatch for the validator's "this citation is broken" / "this run lives on another machine" conflation. Before this commit, `aexp validate` had no way to distinguish the two cases — every finding citing a cluster-registered run looked broken on the laptop. New StrictRuns Literal in aexp.validate. validate_repo and _check_finding_citations both take a strict_runs kwarg (default "error" preserves pre-0.6 behavior). When "warn", finding.broken_run_citation and finding.empty_batch are constructed with severity="warning" so the validator exits 0. When "off", existence checks are skipped entirely but structural-shape checks (32-hex id format, mapping shape, missing experiment_id) still emit errors — those reflect malformed citations, not ledger gaps. CLI surface: `aexp validate --strict-runs=warn`. Invalid values exit 2 with a clear message. Tests in tests/test_validate_cross_machine.py — 10 cases parametrized over the three severity values for both citation types, plus an end-to-end CLI invalid-value test. Docs: new subsection in docs/queue.md § cross-machine-sync-workflow explaining when to reach for --strict-runs. Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md in the electricrag repo, v4 plan section Phase 1A.
…ement
Phase 1B prerequisite of the cross-machine-ledger plan. Two
mechanically-independent additions that need to land together so the
later runs-export-index + ledger work has somewhere to write:
1) machine_label field on .aexp/installed.json. Used by
`aexp runs-export-index` (next commit) and Phase 2's ledger entries
to tag which install registered a run. Default: short hostname
(socket.gethostname().split(".")[0]). Override: `aexp install
--machine-label <name>`, or edit installed.json directly (the file
is per-machine, gitignored, so direct edits are safe and sticky).
Sticky across re-installs that don't pass --machine-label. New
read_machine_label() helper in aexp.utils.paths returns the value
with hostname fallback for older markers.
2) `.gitignore` block-merge managed by `aexp install`. Writes a
begin/end-markered block setting `.aexp/*` + `!.aexp/runs-index/`
+ `!.aexp/ledger/` so per-machine files stay ignored while the
cross-machine shared subdirs are committable. Uses `#`-prefixed
markers (HTML-comment style wouldn't be recognized as gitignore
comments). New block_merge_gitignore() helper structurally mirrors
block_merge_markdown().
The classic `.aexp/` rule (excluding the whole directory) makes
git's `!.aexp/runs-index/` re-include rule a no-op — git can't
re-include a child of an excluded parent. Consumers that already
have such a rule outside our block get a clear migration warning
("gitignore_migration_warning" action) telling them to delete the
legacy line. We don't auto-delete user-authored content.
CLI: `aexp install --machine-label <name>` flag plumbs through to
install_scaffold. _print_actions surfaces the new merged_gitignore
and gitignore_migration_warning kinds. 10 new tests in test_install.py
cover the gitignore block-merge, idempotency, user-content
preservation, migration warning, and the machine_label seed/override/
sticky/helper behaviors.
Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md
in the electricrag repo, v4 plan section Phase 1B (steps 1, 2).
Phase 1B of the cross-machine-ledger plan. Each machine periodically dumps a JSON list of its terminal-state runs to `.aexp/runs-index/<machine_label>.json` and commits it. Other machines pull and the validator unions these indexes to gain a three-state vocabulary for finding citations: - here: citation resolves in local .runs/workspace/ — clean - elsewhere: citation in known_elsewhere (union of all index files) but not local — finding.absent_run_citation (warning, regardless of --strict-runs; the citation is good, the data just lives on a different ledger) - broken: in neither — finding.broken_run_citation (severity per --strict-runs) Same three-state treatment for batch citations: finding.absent_batch_runs warning when the selector matches only elsewhere-indexed runs. Replaces the silent tolerance branch at validate.py:341-342 with an explicit finding.no_run_store warning emitted once per validate run when neither store nor any index exists — was invisible before, easy to miss on a fresh checkout. New module aexp.runs_index: - IndexEntry / IndexFile typed dicts (schema_version 1) - build_index, export_index, load_all_indexes, collect_known_elsewhere - Stable byte-output: entries sorted by job_id so repeat exports produce identical content (excepting exported_at timestamp) - Skip non-terminal jobs (queued/running/created stay machine-local) - Tolerant of malformed index files at load time — corrupt index shouldn't break the validator CLI: `aexp runs-export-index [--as <label>] [--out <path>]`. Default output path is `<repo>/.aexp/runs-index/<machine_label>.json`. `--as` overrides both filename and body field. 21 tests in test_validate_cross_machine.py covering: three-state behavior on missing/elsewhere/local jobs and batches, finding.no_run_store warning emission, runs_index module behaviors (skip non-terminal, default path, override label, malformed file tolerance, ledger_machine tagging, byte-stable output), CLI smoke. Phase 2's universal ledger (next commit) supersedes this mechanism; this module stays through one minor-version deprecation window after Phase 2 ships. Index files double as the migration tool for `aexp ledger backfill`. Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md in the electricrag repo, v4 plan section Phase 1B (steps 3-8).
Phase 2 of the cross-machine-ledger plan (sections 2.1-2.3 of the v4
plan). Introduces the universal cross-machine ledger as a sanitized
projection of terminal-state signac jobs, stored at
`.aexp/ledger/<job_id>.json` and committed to git. Every machine that
pulls sees the same ledger, so finding citations resolve consistently
everywhere — dissolving the "elsewhere vs broken" distinction that
Phase 1B's three-state vocabulary worked around.
New module `aexp.ledger`:
- LedgerEntry typed dict (schema_version=1)
- project_to_ledger_entry — allowlist-based sanitization (statepoint,
run_link, status, ended_at, wallclock_s, tracker pointers, code
commit, registered_machine, promoted_at). Explicitly excludes
per-machine debris: absolute paths in tracker_log/events.jsonl,
wandb offline run dirs, user-written artifacts. Tracker block has
pointers (backend, run_id, url, group, project) but NOT the events.
- promote_to_ledger — idempotent atomic write. Re-promotion overwrites,
safe under retroactive `aexp link` corrections.
- backfill_ledger — walks the local run store and promotes every
terminal-state job not yet in `.aexp/ledger/`. Returns (promoted,
skipped). One-job failures don't kill the whole backfill.
- ledger_path / load_ledger_entry / list_ledger_job_ids helpers for
Commit 6's validator switch.
Auto-promote hook in `aexp.runs.mark_status`: on every terminal
transition (complete/failed/abandoned/stopped), call promote_to_ledger.
Lazy-imported to break the ledger -> runs circular dependency. Wrapped
in try/except + stderr log; promotion failure never crashes the
lifecycle. backfill_ledger recovers any missed jobs.
`aexp.linking.link_to_experiment` integration: after write_run_link,
if the job is terminal, re-promote so the ledger entry's run_link
field reflects the new linkage. Without this the ledger projection
would lag job.doc until the next backfill.
CLI verbs via new `ledger_app` sub-typer:
- `aexp ledger promote <job_id> [--machine-label X]`
- `aexp ledger backfill [--machine-label X] [--overwrite]`
End-to-end smoke verified: create_run + mark_status('complete') auto-
writes a ledger entry with the expected sanitized shape (status,
statepoint, run_link, code_commit, registered_machine, etc.). All
existing tests still pass.
Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md
in the electricrag repo, v4 plan sections 2.1-2.3.
Phase 2 sections 2.4-2.5 of the cross-machine-ledger plan. The validator now treats `.aexp/ledger/<job_id>.json` files as a source-of-truth for "this run exists": - known_job_ids = (ledger_ids | local_signac_ids). A run in the ledger on the laptop counts as "here" even when there's no local workspace for it. After every machine has run `aexp ledger backfill && git push && git pull`, the laptop validator resolves all cited jobs cleanly with no warnings. - known_elsewhere (Phase 1B index union) is now scoped to ONLY jobs absent from the ledger. Entries in both ledger and index are ledger-wins; the index becomes a transitional bridge for runs not yet backfilled. - has_authority tracks whether ANY source (ledger, local store, or index) is available. The broken_run_citation error fires when a citation matches none of them but at least one is present (so the validator can speak with authority); finding.no_run_store warning fires when nothing is available. `aexp runs-export-index` prints a deprecation warning pointing at `aexp ledger backfill`. The mechanism stays for one minor-version window so consumers mid-upgrade can keep using it. 7 new tests in test_validate_cross_machine.py cover: - citation to job in ledger → clean - citation to job in BOTH ledger and index → ledger wins, no warning - mark_status hook auto-promotes terminal jobs - backfill_ledger promotes only terminal jobs - backfill idempotency - aexp link re-promotes terminal jobs (ledger.run_link updates) - finding.no_run_store fires only when no authority source exists - promotion sanitization drops abs paths + tracker config blob Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md in the electricrag repo, v4 plan sections 2.4-2.5.
signac wraps nested dict values in synced_collections types that
aren't strict `dict` subclasses, so `isinstance(raw, dict)` was False
and the tracker block fell through to {}. Use duck-typed `dict(raw)`
conversion instead, which both accepts Mapping-shaped wrappers and
strips the in-memory synced overhead.
Should have been in commit e79735e; surfaced by the sanitization test
in commit d68358b.
Phase 2.6 of the cross-machine-ledger plan. Adds tests/test_ledger.py with 21 focused cases covering: Projection shape: - required fields present (schema_version, job_id, statepoint, status, registered_machine, promoted_at) - statepoint values round-trip - run_link preserved when present - wallclock_s / ended_at carried through when set - optional fields absent when source has them missing - non-terminal status raises ValueError - code_commit / code_dirty lifted from statepoint to top-level Sanitization (per-machine debris must not leak): - tracker.config and tracker.init_kwargs dropped (they embed abs paths) - tracker pointers (backend/run_id/url/group/project) preserved - allowlist-based: arbitrary doc fields don't bleed into the projection - Explicit assertion that the repo's absolute path never appears anywhere in the JSON Idempotency: - two consecutive promotes produce byte-identical entries modulo promoted_at - atomic_write smoke check Backfill: - only terminal jobs promoted, non-terminal skipped - idempotent (skips already-present) - --overwrite re-promotes - no run store → returns (0, 0) gracefully - --machine-label overrides registered_machine Defensive paths: - load_ledger_entry returns None on missing/malformed - list_ledger_job_ids returns empty set when dir doesn't exist - mark_status's hook tolerates promote_to_ledger raising — status still gets written, error goes to stderr Integration tests for the ledger-aware validator + auto-promote hook live in test_validate_cross_machine.py (added in commits 4 + 6); this file is the deep module-level coverage. Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md in the electricrag repo, v4 plan section Phase 2.6.
Phase 2.7 of the cross-machine-ledger plan. Three changes: 1. docs/queue.md § cross-machine: rewritten end-to-end. Documents the three-source-of-truth model (ledger > local store > index), the steady-state workflow (auto-promote on mark_status), the --strict-runs escape hatch for the 0.5→0.6 transition window, and the migration runbook including the gitignore-migration warning handling. New machine_label subsection explains the HPC hostname-noise mitigation. 2. pyproject.toml: bump to 0.6.0 (semver minor — new public surface: ledger module, ledger/runs-export-index CLI verbs, --strict-runs flag, machine_label install marker field). 3. CHANGELOG.md: new [0.6.0] entry under Keep-a-Changelog format covering Added / Changed / Fixed / Migration sections. Cites the friction doc in the electricrag repo for the design rationale. Note: the in-poetry-env test_package_version_matches_pyproject test will fail until you re-run `poetry install` (poetry env's installed metadata is stale at 0.2.1 per existing CLAUDE.local.md guidance — that's a known pre-existing condition). Deselect or refresh the env; either is fine. Related: docs/reference/process/aexp_friction_cross_machine_run_ledger.md in the electricrag repo, v4 plan section Phase 2.7.
…pdate End-to-end smoke test (tests/test_cross_machine_e2e.py): simulates the exact TL;DR scenario from the friction doc — two consumer repos sharing a bare remote, cluster registers + auto-promotes jobs, pushes, laptop pulls, validator resolves the citation cleanly. Catches the whole-stack failure modes that the unit tests can't (gitignore block-merge actually flowing through, real git push/pull cycles, install-on-cluster-then-install-on-laptop interaction). Two cases: - test_e2e_cluster_backfill_then_laptop_validate_is_clean: happy path. Cluster install + register + mark_status (auto-promote fires) + commit + push. Laptop pull + install + cite. Validator: 0 errors, 0 warnings. - test_e2e_strict_runs_warn_unblocks_pre_ledger_state: escape-hatch flow. Bogus citation in a fresh laptop, --strict-runs=warn lets the validator exit 0. Defensive about list_kb_artifacts not always picking up minimally-formatted findings — asserts warn-vs-error consistency rather than a specific count. Slash command (src/aexp/slash_commands/aexp-validate.md): updated the code list with the three new codes (finding.absent_run_citation, finding.absent_batch_runs, finding.no_run_store) and their fix suggestions. New "Cross-machine escape hatch" section explains `--strict-runs=warn` and points at `aexp ledger backfill` as the long-term fix.
Replaces every "Phase 1A / 1B / 2 / 2.0 / 2.5 / v4 plan" reference in user-visible content with descriptive language naming the actual mechanism. Plan numbering was only meaningful relative to the design doc; a user reading the CHANGELOG, --help text, slash command, or module docstring shouldn't need that doc to understand what each piece does. Touched user-facing: - CHANGELOG.md [0.6.0] section — "Cross-machine ledger (Phase 2)" -> "Cross-machine run ledger"; "Phase 1B index file" -> "per-machine index file"; "Phase 1B, transitional" -> "transitional"; removed the trailing pointer to the friction doc in another repo. - docs/queue.md § cross-machine — same treatment for the three source-of-truth list, the steady-state workflow heading, and the migration heading. - CLI help: --machine-label and --strict-runs help strings; the runs-export-index docstring + deprecation message; ledger sub-app + promote + backfill docstrings. - aexp.ledger / aexp.runs_index module docstrings. - Code comments in install.py, linking.py, runs.py, validate.py, utils/paths.py. - Test-file docstrings + section comments. Behavior unchanged; this is pure naming. Test suite still passes (139 in the cross-machine + install + ledger + e2e subset checked post-cleanup). Branch also renamed locally from feat/cross-machine-ledger-v4 to feat/cross-machine-ledger to match.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.aexp/ledger/<job_id>.json— sanitized projections of terminal-state signac runs, committed to git so every machine sees the same view aftergit pull. Auto-populated by a hook inaexp.runs.mark_statuson every terminal-status transition (complete/failed/abandoned/stopped). Per-machine debris (absolute paths fromtracker_log/events.jsonl, wandb offline-run dirs, user artifacts) stays in the gitignored.runs/workspace/.aexp ledger promote <id>(manual one-shot) andaexp ledger backfill(migration tool — walks the local run store and promotes everything terminal). A transitionalaexp runs-export-indexis also included but ships with a deprecation warning.aexp validate --strict-runs={error|warn|off}. Defaulterrorpreserves 0.5 behavior.warndowngrades broken/empty-batch citations to warnings (exit 0);offskips existence checks entirely. Structural-shape checks always emit at error regardless.finding.absent_run_citation(warning),finding.absent_batch_runs(warning),finding.no_run_store(warning, replaces a previously-silent tolerance branch).installed.json::machine_labelfield +aexp install --machine-label <name>flag. Default: short hostname. Sticky across re-installs. Used to tagregistered_machineon ledger entries.aexp installmanages a.gitignoreblock via# agentic-experiments:begin/:endmarkers. Sets.aexp/*+!.aexp/runs-index/+!.aexp/ledger/so per-machine state stays ignored while the cross-machine shared subdirs are committable. Warns when a legacy.aexp/pattern lives outside the managed block.Internally:
aexp.runs.mark_statusbecomes the single hook point for all four terminal transitions (lifecycle'scomplete/failednow route through it withset_ended_at=Falseto dodge a Windows file-lock race withstop_queued).aexp.linking.link_to_experimentre-promotes on terminal so the ledger stays in sync after a retroactive re-link.What the user-facing change actually fixes
Today on a laptop authoring against cluster-side runs,
aexp validatereportsfinding.broken_run_citation(error) for every citation the cluster has but the laptop can't see — agents reading the red output cannot tell "broken citation" from "ledger lives elsewhere" and will try to "fix" things that aren't broken. After this PR, once each machine has runaexp ledger backfill && git push, the validator resolves every cited job universally with 0 errors and 0 warnings.Test plan
poetry installto refresh poetry env's metadata to 0.6.0 (thetest_package_version_matches_pyprojecttest currently fails because of stale 0.2.1 metadata in the env — known pre-existing condition perCLAUDE.local.md).poetry run pytest— expect 521 passed, 5 skipped (platform-specific), 2 deselected (test_cli.py::test_package_version_matches_pyproject+test_queue.py::test_stop_queued_kills_running_subprocess_via_sigterm, both known-flaky pre-existing).poetry run ruff check src/ tests/— clean.electricraginstall:pip install -e <agentic-experiments>+aexp install --force+aexp ledger backfillon the cluster,git push,git pullon the laptop,aexp validateshould resolve the 8 currently-failing F004/F005 citations to 0 errors + 0 warnings.Commit ladder
Reviewable commit-by-commit, ordered by dependency:
feat(runs): unify terminal-status writes through mark_status()— prerequisite for the ledger hookfeat(validate): add --strict-runs={error|warn|off} flagfeat(install): add machine_label to installed.json + .gitignore managementfeat(runs-index): aexp runs-export-index verb + three-state validatorfeat(ledger): aexp.ledger module + promote/backfill verbs + auto-hookfeat(validate): read from ledger when present; deprecate runs-indexfix(ledger): tracker projection handles signac synced collectionstest(ledger): comprehensive coverage for aexp.ledger moduletest+docs: end-to-end cross-machine smoke + slash-command code list updatedocs: queue.md cross-machine workflow + 0.6.0 release notesdocs+code: scrub plan-internal naming from user-facing surfacesMigration (for current consumers, 0.5 → 0.6)
pip install -e <agentic-experiments>in each env.aexp install --forcein each consumer (writes the new gitignore block + materializesmachine_label). If agitignore_migration_warningis emitted, delete the legacy.aexp/line from.gitignoremanually — the new block uses.aexp/*so the existing parent-exclude needs to come out.aexp ledger backfillon each machine that has terminal-state runs..gitignoreand.aexp/ledger/, push, then pull on every other consumer.See
docs/queue.md§ cross-machine for the full runbook.