Skip to content

fix: batch citations resolve against the universal ledger#25

Merged
KadenMc merged 1 commit into
mainfrom
fix/validator-batch-ledger-resolution
May 26, 2026
Merged

fix: batch citations resolve against the universal ledger#25
KadenMc merged 1 commit into
mainfrom
fix/validator-batch-ledger-resolution

Conversation

@KadenMc
Copy link
Copy Markdown
Owner

@KadenMc KadenMc commented May 26, 2026

Summary

In 0.6.0 the job-id citation check was correctly taught to consult both the local signac project AND .aexp/ledger/<id>.json, but the batch-citation check still went through aexp.linking.list_batches(), which walks the local signac project only. The result: a finding citing {type: batch, experiment_id: E005, selector: {condition: X}} spuriously emitted finding.empty_batch whenever the matching terminal runs lived only in the ledger (e.g., backfilled from another machine).

The fix is purely additive: build an (experiment_id, condition) -> [job_ids] lookup from ledger entries and count those matches in the batch-resolution branch before falling through to the elsewhere/empty paths.

Surfaced by

The electricrag laptop validate during the 0.6 migration. After the cluster backfilled 20 ledger entries (including the 3 E005 batch members F005 cites), the laptop validator still emitted 3 finding.empty_batch errors — exactly because none of the cited E005 jobs had local workspaces; they existed only in the just-pulled ledger.

No behavioral change for existing cases

  • Local-only batch matches still resolve via list_batches.
  • Elsewhere (per-machine index) matches still emit finding.absent_batch_runs warnings.
  • Empty/broken batches with no matches anywhere still emit finding.empty_batch errors (or warnings under --strict-runs=warn).

Only new: ledger-only batch matches now correctly count as "here."

Regression test

tests/test_validate_cross_machine.py::test_validator_batch_citation_resolves_via_ledger_only writes a ledger entry with no corresponding local workspace and asserts a batch citation pointing at that entry's (experiment_id, condition) resolves cleanly.

The test's registered_machine field is documented inline as informational only — the validator does not consult it. This makes the Phase 2 design choice explicit: the ledger is universal; machine identity is provenance, not a query filter. (If we ever want machine-aware filtering for batch citations, it would be a selector extension — {type: batch, ..., selector: {condition: X, registered_machine: cluster}} — handled in _selector_matches. Separate feature, separate discussion.)

Version

0.6.0 → 0.6.1 (patch). PyPI history goes 0.5.0 → 0.6.1 since 0.6.0 was tagged in source but never published.

Test plan

  • poetry run pytest tests/test_validate_cross_machine.py tests/test_validate.py tests/test_ledger.py — 69 passed locally
  • poetry run ruff check src/ tests/ — clean
  • After merge: on the consumer (electricrag laptop), aexp validate drops from 4 errors → 1 error (just the remaining d84ce97... never-ran job citation, which is a data issue, not a validator bug).

When 0.6.0 shipped the cross-machine ledger, the job-id citation check
was correctly taught to consult both the local signac project AND
`.aexp/ledger/<id>.json` (via `known_job_ids = ledger_ids | local_ids`).
But the batch-citation check still went through
`aexp.linking.list_batches()`, which walks the local signac project
only. The result: a finding citing a batch selector like
`{type: batch, experiment_id: E005, selector: {condition: X}}` would
spuriously emit `finding.empty_batch` whenever the matching terminal
runs lived only in the ledger (e.g., backfilled from another machine).

This was the exact symptom surfaced when migrating electricrag from
aexp 0.2.1 -> 0.6.0: after the cluster backfilled 20 ledger entries
including 3 E005 batch members, the laptop validator continued to
emit `finding.empty_batch` for all 3 F005 batch citations because
none of the cited E005 jobs had local workspaces — they existed only
in the just-pulled ledger.

The fix is purely additive in `_check_finding_citations`:

1. Build a `(experiment_id, condition) -> [job_ids]` lookup from
   ledger entries (one pass through `list_ledger_job_ids` +
   `load_ledger_entry`).
2. In the batch resolution branch, count ledger matches before
   falling through to the elsewhere/empty paths.

No behavioral change for any existing case:

- Local-only batch matches still resolve as before via list_batches.
- Elsewhere (per-machine index) matches still emit
  `finding.absent_batch_runs` warnings.
- Empty/broken batches with no matches anywhere still emit
  `finding.empty_batch` errors (or warnings under --strict-runs=warn).

Only new: ledger-only batch matches now correctly count as "here."

The new regression test
`test_validator_batch_citation_resolves_via_ledger_only` writes a
ledger entry with NO corresponding local workspace and asserts a
batch citation pointing at that entry's (experiment_id, condition)
resolves cleanly. The test's `registered_machine` field is
informational only (commented inline) — the validator does not
consult it during resolution. This makes the Phase 2 design choice
explicit: the ledger is universal; machine identity is provenance,
not a query filter.

Version: 0.6.0 -> 0.6.1 (patch). PyPI history goes 0.5.0 -> 0.6.1
since 0.6.0 was tagged in source but never published.
@KadenMc KadenMc merged commit c27e2f6 into main May 26, 2026
6 checks passed
@KadenMc KadenMc deleted the fix/validator-batch-ledger-resolution branch May 26, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant