diff --git a/CHANGELOG.md b/CHANGELOG.md index 8abf777..cbcc913 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,26 @@ All notable changes to Elume are documented here. Format loosely follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). Versions follow semantic versioning once `0.1.0` ships; pre-alpha releases may break anything. +## [Unreleased] + +### Added + +- Track 026 adds content-addressed provider snapshots: + `strategy_content_digest(...)`, + `serialize_strategies_content_addressed(...)`, + `deserialize_strategies_content_addressed(...)`, and + `snapshot_bytes(..., mode="content_addressed")`. +- `InMemoryProvider.snapshot(mode="content_addressed")` now emits a stable + manifest/root-hash artifact, with optional manifest-only output via + `include_objects=False`. + +### Unchanged + +- `snapshot_bytes(strategies)` remains backward-compatible in default + `mode="full"`. +- `InMemoryProvider.snapshot()` remains a full deterministic provider dump by + default. + ## [0.3.0] — 2026-05-05 — Cognitive substrate recategorization + linoss-dynamics dependency edge v0.3.0 reframes Elume from "agentic memory engine" to its honest description: diff --git a/README.md b/README.md index f8a8e94..ba4737d 100644 --- a/README.md +++ b/README.md @@ -53,8 +53,9 @@ Things that did not exist anywhere before this project: - **The deterministic envelope** (`elume.envelope`, v0.1) — a canonical pre-image (BLAKE2b-256) over operation inputs, RNG state, result, and provider snapshot, giving every cognitive op a byte-identical replay - contract. Five reference operations registered today (belief embed, - basin recall, thought competition, evolution step, self-model step). + contract. Six reference operations registered today (belief embed, + basin recall, thought competition, evolution step, self-model step, + curiosity score). - **The platform-tagged float-hash policy** — `platform_fingerprint()` folded into the canonical pre-image so cross-platform replay drift surfaces as a hash mismatch by construction, not silent agreement on @@ -102,6 +103,7 @@ These components are integrated into a shared memory pipeline for agentic learni - **No framework lock-in** — no FastAPI, Graphiti, or agent runtime in the core. Adapters live in consumers. - **Cross-platform float-hash policy** — `platform_fingerprint()` is folded into the canonical hash pre-image. Cross-platform drift is a visible mismatch, not silent corruption. - **Curiosity-driven hyperevolution** — the optional curiosity homing signal biases memory retrieval toward entropy-reducing directions, turning uniform-random search into goal-directed exploration. +- **Content-addressed replay artifacts** — provider snapshots can stay full dumps by default or emit a manifest/root-hash form for larger replay stores. ## MemEvolve cartridge @@ -140,13 +142,13 @@ BibTeX entries for all upstream academic citations are in [`CITATIONS.bib`](./CI Elume is an open-source integration project under active development. -Twenty-five tracks landed: kernel bootstrap, core data models, LinOSS solver + timing, Hopfield network, basin field engine, attractor basin core, embedder protocol, provider contracts, the evolution engine, the self-modeling network engine, immutable cognitive record types, immutable mental-model domain records, immutable metacognitive control records, prior hierarchy records, mental-model subnetworks, the cognitive event protocol, cognitive-event embedders, immutable thought-level records, immutable neuronal-packet records, deterministic thought competition, prior-gated cognition, the MemEvolve cartridge, curiosity homing device, and hyperevolution wiring. Track `007` was retired after source review showed it was framed against the wrong dionysus3 concept. **1177 tests passing, ruff clean.** +Twenty-six tracks landed: kernel bootstrap, core data models, LinOSS solver + timing, Hopfield network, basin field engine, attractor basin core, embedder protocol, provider contracts, the evolution engine, the self-modeling network engine, immutable cognitive record types, immutable mental-model domain records, immutable metacognitive control records, prior hierarchy records, mental-model subnetworks, the cognitive event protocol, cognitive-event embedders, immutable thought-level records, immutable neuronal-packet records, deterministic thought competition, prior-gated cognition, the MemEvolve cartridge, curiosity homing device, hyperevolution wiring, and content-addressed provider snapshots. Track `007` was retired after source review showed it was framed against the wrong dionysus3 concept. **1194 tests passing, ruff clean.** Phase 2 is complete through the prior gate: `Track 011` shipped `elume.network`, `Tracks 014`, `016`, `018`, `021`, and `022` landed the minimal cognition gate from `MentalModel` through `LinOSSEncoder`, `Tracks 012`, `013`, and `019` landed immutable thought and packet records plus deterministic EFE competition, and `Tracks 015`, `017`, and `020` landed metacognitive control, generic priors, and prior-gated cognition. See [`conductor/tracks.md`](./conductor/tracks.md). -Phase 3 is complete: the MemEvolve cartridge (`elume.adapters.memevolve`), curiosity homing (`elume.cognition.curiosity`), and hyperevolution wiring now connect Elume's deterministic substrate to MemEvolve's outer evolutionary loop. +Phase 3 is complete: the MemEvolve cartridge (`elume.adapters.memevolve`), curiosity homing (`elume.cognition.curiosity`), and hyperevolution wiring now connect Elume's deterministic substrate to MemEvolve's outer evolutionary loop. Phase 4 has started with content-addressed provider snapshots for larger replay artifacts. -Archon-style deterministic-harness adoption is complete for v0.1.0. The kernel has injected RNGs, frozen trajectory metadata, provider snapshots, and an `elume.envelope` v0 operation registry covering belief embedding, evolution step, thought competition, self-model stepping, Hopfield recall, and (v0.2.0) curiosity scoring. Cross-platform float-hash policy is documented in `docs/archon-readiness/21-float-hash-policy.md`. +Archon-style deterministic-harness adoption is complete for v0.1.0. The kernel has injected RNGs, frozen trajectory metadata, provider snapshots, content-addressed snapshot manifests, and an `elume.envelope` v0 operation registry covering belief embedding, evolution step, thought competition, self-model stepping, Hopfield recall, and (v0.2.0) curiosity scoring. Cross-platform float-hash policy is documented in `docs/archon-readiness/21-float-hash-policy.md`. ## Install diff --git a/conductor/tracks.md b/conductor/tracks.md index d91771a..83cd2a1 100644 --- a/conductor/tracks.md +++ b/conductor/tracks.md @@ -54,6 +54,10 @@ The fleets are **parallelizable** where dependencies allow — Fleet A and Fleet - [x] **Track 024: Curiosity Homing Device — `elume.cognition.curiosity`** — [./tracks/024-curiosity-homing/](./tracks/024-curiosity-homing/) - [x] **Track 025: Hyperevolution Wiring** — [./tracks/025-hyperevolution-wiring/](./tracks/025-hyperevolution-wiring/) +### Phase 4 — Replay Artifact Scale + +- [x] **Track 026: Content-addressed provider snapshots** — [./tracks/026-content-addressed-snapshots/](./tracks/026-content-addressed-snapshots/) + ### Downstream (dionysus3-side — not in this repo) - Dionysus3 consumes Elume via `pip install -e ../elume`. The dionysus3 adapter layer (routing, classification, Graphiti persistence, event bus, PSM/SMT enrichment) stays in dionysus3 and gets its own tracks in that repo's conductor. @@ -87,6 +91,7 @@ The fleets are **parallelizable** where dependencies allow — Fleet A and Fleet | 023 | B | MemEvolve Cartridge | done | A2/A3/A4 | bingreeky/MemEvolve `dionysus_memory_provider.py`, `entity_extractor.py` (port, HTTP/HMAC stripped) | `elume/adapters/memevolve/` | **84 tests passing**; 2 seeded runs → byte-equal `MemoryResponse` | 002, 004, 006, 008, 009, 010 | | 024 | B | Curiosity Homing Device | done | A1 | dionysus3 `mosaeic_self_discovery.py:300-443`, `arousal_system_service.py:44-143` (port, FastAPI/Pydantic stripped) | `elume/cognition/curiosity.py`, `elume/envelope/ops/curiosity_score.py` | **42 tests passing**; entropy ordering, prior threshold tests, envelope replay, prior-gating integration | 017, 019, 020 | | 025 | B | Hyperevolution Wiring | done | A5 | — (net new) | `elume/adapters/memevolve/provider.py` (extend) | **5 integration tests passing**; curiosity=False and curiosity=True both deterministic; curiosity changes retrieval order and preserves fixture outcome metric | 023, 024 | +| 026 | B | Content-addressed provider snapshots | done | — | — (net new) | `elume/envelope/snapshot.py`, `elume/providers/in_memory.py` | **17 new tests passing**; full snapshot compatibility plus content-addressed manifest/root/store tests | 010, envelope v0 | ## Phase 2 Stage 0 preflight findings diff --git a/conductor/tracks/026-content-addressed-snapshots/plan.md b/conductor/tracks/026-content-addressed-snapshots/plan.md new file mode 100644 index 0000000..b6a078e --- /dev/null +++ b/conductor/tracks/026-content-addressed-snapshots/plan.md @@ -0,0 +1,52 @@ +# Track 026 Plan — Content-Addressed Provider Snapshots + +**Status:** Complete. Focused verification: +`.venv/bin/pytest tests/unit/envelope/test_snapshot.py tests/contract/test_provider_contract.py -q` +reported `63 passed`. Full verification: +`.venv/bin/pytest -q` reported `1194 passed`; ruff was clean. + +## Implementation steps + +### Spec and compatibility guard + +- [x] Add Track 026 Conductor spec and plan. +- [x] Add regression tests proving `snapshot_bytes(strategies)` default output + is unchanged. + +### Content-addressed snapshot helpers + +- [x] Add `strategy_content_digest(...)`. +- [x] Add `serialize_strategies_content_addressed(...)` with: + - sorted `{name, digest}` manifest entries; + - `root_hash` over the stable manifest; + - optional embedded object payloads keyed by digest. +- [x] Add `deserialize_strategies_content_addressed(...)` supporting embedded + objects and caller-provided object stores. +- [x] Reject missing or tampered objects during restore. + +### Provider surface + +- [x] Extend `MemoryProvider.snapshot(...)` documentation/signature with an + optional mode. +- [x] Extend `InMemoryProvider.snapshot(...)` to support + `mode="content_addressed"` while keeping `mode="full"` as default. + +### Verification + +- [x] Run focused tests: + +```bash +.venv/bin/pytest tests/unit/envelope/test_snapshot.py tests/contract/test_provider_contract.py -q +``` + +- [x] Run lint: + +```bash +.venv/bin/ruff check src tests reference_service/src +``` + +- [x] Run full suite: + +```bash +.venv/bin/pytest -q +``` diff --git a/conductor/tracks/026-content-addressed-snapshots/spec.md b/conductor/tracks/026-content-addressed-snapshots/spec.md new file mode 100644 index 0000000..ca91c3a --- /dev/null +++ b/conductor/tracks/026-content-addressed-snapshots/spec.md @@ -0,0 +1,55 @@ +# Track 026: Content-Addressed Provider Snapshots + +## Objective + +Add a compact, content-addressed provider snapshot mode for replay artifacts +without changing the existing full snapshot behavior. The current full dump is +correct and deterministic; this track gives envelope consumers a Merkle-style +manifest they can store separately from the strategy objects when populations +become large. + +## Background + +The Archon envelope v0 deliberately shipped full provider snapshots because +that was the simplest deterministic replay surface. The post-phase plan left +one scale question open: whether provider snapshots should stay full dumps or +gain a Merkle-root/content-addressed reference mode for large artifacts. + +## In Scope + +- `src/elume/envelope/snapshot.py` + - Keep `serialize_strategies(...)`, `deserialize_strategies(...)`, and + default `snapshot_bytes(...)` behavior backward-compatible. + - Add a content-addressed strategy population snapshot with stable leaf + digests and a stable population root hash. + - Support embedded objects for self-contained artifacts. + - Support manifest-only snapshots that can be restored with an external + object store. + - Validate object digests during restore. +- `src/elume/providers/in_memory.py` + - Expose the new mode through `snapshot(mode="content_addressed")`. +- `src/elume/providers/contracts.py` + - Document the optional snapshot mode without requiring existing callers to + change. +- Unit/contract tests for determinism, digest sensitivity, embedded restore, + external-store restore, and tamper detection. + +## Out of Scope + +- Disk-backed or remote content-addressed storage. +- Chunking non-strategy provider state such as basins or trajectories. +- Changing the envelope hash pre-image. +- Making content-addressed snapshots the default. + +## Acceptance + +- Existing full snapshot tests continue to pass unchanged. +- Content-addressed snapshots are insertion-order stable. +- The same strategy has the same object digest across populations. +- Root hash changes when strategy content changes. +- Manifest-only snapshots omit object payloads and can restore with a provided + object store. +- Tampered embedded or external objects fail restore. +- `snapshot_bytes(strategies)` remains byte-for-byte compatible in default mode. +- `ruff check src tests reference_service/src` is clean. +- Focused snapshot/provider tests pass. diff --git a/docs/archon-readiness/00-summary.md b/docs/archon-readiness/00-summary.md index 97846a7..fd49fa6 100644 --- a/docs/archon-readiness/00-summary.md +++ b/docs/archon-readiness/00-summary.md @@ -8,13 +8,13 @@ landed, the envelope registry covers the five candidate operations from report `19` plus curiosity scoring, the cross-platform float-hash policy is documented in [`21-float-hash-policy.md`](./21-float-hash-policy.md) and folded into the -canonical hash pre-image, and the current suite reports `1177 passed`. +canonical hash pre-image, Track `026` adds content-addressed provider snapshot +manifests, and the current suite reports `1194 passed`. **2026-05-05 reconciliation:** The blocker list below is historical audit context. Hopfield RNG injection, timestamp injection, trajectory metadata -freezing, provider snapshots, float-hash policy, and envelope replay coverage -have all landed. Remaining design work is limited to future provider-snapshot -granularity if artifact size forces a Merkle/content-addressed variant. +freezing, provider snapshots, float-hash policy, envelope replay coverage, and +content-addressed provider snapshot manifests have all landed. ## Verdict diff --git a/docs/plans/archon-adoption-phase-1.md b/docs/plans/archon-adoption-phase-1.md index b9c5a48..1e1095c 100644 --- a/docs/plans/archon-adoption-phase-1.md +++ b/docs/plans/archon-adoption-phase-1.md @@ -6,13 +6,13 @@ **Post-plan update:** Tracks `017`, `020`, `023`, `024`, and `025` have landed, six reference envelope operations are registered, the reference demo is -runnable, `EnvelopeOutput` records the live platform fingerprint, and the -current verification baseline is `1177 passed`. +runnable, `EnvelopeOutput` records the live platform fingerprint, Track `026` +adds content-addressed provider snapshot manifests, and the current +verification baseline is `1194 passed`. **2026-05-05 reconciliation:** This plan is now historical. The only remaining -design item from the original open list is provider-snapshot granularity -(full dump vs. Merkle/content-addressed reference) if artifact size becomes a -problem. +design item from the original open list, provider-snapshot granularity, is now +resolved at v0 by Track `026` with an optional content-addressed manifest mode. ## 1. Objective @@ -48,7 +48,7 @@ All 20 lanes are confined to their audit-designated write scope per the fleet-ow ## 3. Scope (out) - **Resolved:** `Track 017`, `Track 020`, the formal envelope operations, and cross-platform float-hash policy have landed. -- **Still out of scope:** provider snapshot granularity beyond the current full deterministic dump. A Merkle/content-addressed snapshot can be added later if artifact size becomes a real constraint. +- **Resolved:** provider snapshot granularity now has an optional content-addressed manifest mode in Track `026`; the full deterministic dump remains the default. - No changes to `pyproject.toml`, top-level `src/elume/__init__.py`, or `conductor/tracks.md` beyond what the lead pod merges at phase close. ## 4. Verification gates @@ -71,4 +71,4 @@ After all 20 lanes merge into the integration branch: ## 6. Post-phase next steps (require user approval) - Resolved: Tracks `017`, `020`, `023`, `024`, and `025`; formal envelope ops; platform-tagged float-hash policy; `EnvelopeOutput.platform_fingerprint`. -- Optional future work: decide whether `provider_snapshot` should stay a full deterministic dump or gain a Merkle-root/content-addressed mode for large artifacts. +- Resolved: Track `026` keeps full provider snapshots as default and adds a Merkle-style content-addressed mode for large artifacts. diff --git a/docs/plans/phase-2-handoff.md b/docs/plans/phase-2-handoff.md index 5ebf5f8..b8bc407 100644 --- a/docs/plans/phase-2-handoff.md +++ b/docs/plans/phase-2-handoff.md @@ -6,8 +6,8 @@ This document is the resume point for Phase 2 work if session context is cleared. **Post-handoff update:** Stage 4 and Phase 3 are complete. Tracks `015`, -`017`, `020`, `023`, `024`, and `025` have landed, and the current suite -reports `1177 passed` with ruff clean. +`017`, `020`, `023`, `024`, `025`, and `026` have landed, and the current +suite reports `1194 passed` with ruff clean. ## Current Status @@ -29,7 +29,7 @@ reports `1177 passed` with ruff clean. - `Track 012` shipped immutable thought-level records in `src/elume/models/thought.py`. - `Track 013` shipped immutable packet records plus pure intrinsic-value computation in `src/elume/models/neural.py`. - `Track 019` shipped deterministic EFE competition in `src/elume/cognition/competition.py`. - - Full suite status at updated handoff: **`1177 passed`** + - Full suite status at updated handoff: **`1194 passed`** - Lint status at updated handoff: **`ruff check src tests reference_service/src` clean** - **Stage 4 complete.** @@ -42,6 +42,9 @@ reports `1177 passed` with ruff clean. - `Track 024` shipped curiosity homing in `src/elume/cognition/curiosity.py` and `src/elume/envelope/ops/curiosity_score.py`. - `Track 025` shipped hyperevolution wiring inside `src/elume/adapters/memevolve/provider.py`. +- **Phase 4 replay artifact scale started.** + - `Track 026` shipped content-addressed provider snapshots in `src/elume/envelope/snapshot.py` and `src/elume/providers/in_memory.py`. + ## Key Documents - Original proposal: [phase-2-proposal.md](/Volumes/Asylum/dev/elume/docs/plans/phase-2-proposal.md) diff --git a/src/elume/envelope/snapshot.py b/src/elume/envelope/snapshot.py index 1256889..2119371 100644 --- a/src/elume/envelope/snapshot.py +++ b/src/elume/envelope/snapshot.py @@ -17,12 +17,15 @@ from __future__ import annotations +import hashlib import json from collections.abc import Iterable, Mapping from typing import Any from elume.models.strategy import Strategy +SNAPSHOT_SCHEMA_VERSION = "elume.provider_snapshot/v1" + # Canonical entry fields for a serialized strategy. Order irrelevant for the # output (we sort keys), but fixing the set guards against drift. _ENTRY_FIELDS: tuple[str, ...] = ( @@ -49,6 +52,21 @@ def _sorted_genotype(genotype: Mapping[str, Any]) -> dict[str, Any]: return sorted_items +def _canonical_bytes(obj: Any) -> bytes: + """Return compact canonical JSON bytes for snapshot payloads.""" + return json.dumps( + obj, + sort_keys=True, + separators=(",", ":"), + ensure_ascii=True, + ).encode("utf-8") + + +def _digest_payload(obj: Any) -> str: + """Return a BLAKE2b-256 digest for a canonical JSON payload.""" + return hashlib.blake2b(_canonical_bytes(obj), digest_size=32).hexdigest() + + def _strategy_entry(strategy: Strategy) -> dict[str, Any]: """Serialize a single ``Strategy`` to a canonical, key-sorted dict.""" entry = { @@ -61,6 +79,38 @@ def _strategy_entry(strategy: Strategy) -> dict[str, Any]: return {key: entry[key] for key in sorted(entry.keys())} +def _strategy_object(strategy: Strategy) -> dict[str, Any]: + """Serialize full strategy content for content-addressed storage.""" + entry = strategy.to_dict() + entry["genotype"] = _sorted_genotype(strategy.genotype) + return {key: entry[key] for key in sorted(entry.keys())} + + +def _normalize_strategy_object(data: Mapping[str, Any]) -> dict[str, Any]: + """Canonicalize caller-supplied strategy object data before hashing.""" + return _strategy_object(Strategy.from_dict(data)) + + +def strategy_content_digest(strategy: Strategy) -> str: + """Return the content address for a single strategy object. + + The digest covers the full canonical ``Strategy`` representation, + including description, genotype, fitness, creation timestamp, and lineage. + """ + return _digest_payload(_strategy_object(strategy)) + + +def _root_hash(entries: Iterable[Mapping[str, Any]]) -> str: + """Return the population root hash for sorted manifest entries.""" + manifest = { + "entries": list(entries), + "kind": "strategy_population", + "mode": "content_addressed", + "schema_version": SNAPSHOT_SCHEMA_VERSION, + } + return _digest_payload(manifest) + + def serialize_strategies(strategies: Iterable[Strategy]) -> dict[str, Any]: """Serialize a population of strategies to a canonical dict. @@ -85,6 +135,42 @@ def serialize_strategies(strategies: Iterable[Strategy]) -> dict[str, Any]: return {"strategies": entries} +def serialize_strategies_content_addressed( + strategies: Iterable[Strategy], + *, + include_objects: bool = True, +) -> dict[str, Any]: + """Serialize strategies as a content-addressed population manifest. + + The returned payload contains sorted manifest entries of the form + ``{"name": , "digest": }`` plus a + population ``root_hash`` over those entries. When ``include_objects`` is + true, the payload is self-contained and includes an ``objects`` mapping + from digest to the canonical full strategy object. When false, callers can + store the manifest separately and provide the objects during restore. + """ + entries: list[dict[str, str]] = [] + objects: dict[str, dict[str, Any]] = {} + + for strategy in sorted(strategies, key=lambda s: s.name): + obj = _strategy_object(strategy) + digest = _digest_payload(obj) + entries.append({"digest": digest, "name": strategy.name}) + if include_objects: + objects[digest] = obj + + payload: dict[str, Any] = { + "entries": entries, + "kind": "strategy_population", + "mode": "content_addressed", + "root_hash": _root_hash(entries), + "schema_version": SNAPSHOT_SCHEMA_VERSION, + } + if include_objects: + payload["objects"] = {key: objects[key] for key in sorted(objects)} + return {key: payload[key] for key in sorted(payload.keys())} + + def deserialize_strategies(data: Mapping[str, Any]) -> list[Strategy]: """Inverse of :func:`serialize_strategies`. @@ -127,10 +213,72 @@ def deserialize_strategies(data: Mapping[str, Any]) -> list[Strategy]: return result -def snapshot_bytes(strategies: Iterable[Strategy]) -> bytes: +def deserialize_strategies_content_addressed( + data: Mapping[str, Any], + *, + object_store: Mapping[str, Mapping[str, Any]] | None = None, +) -> list[Strategy]: + """Restore strategies from a content-addressed snapshot payload. + + Embedded ``objects`` are used when present. Manifest-only snapshots can be + restored by passing an external ``object_store`` mapping digest to strategy + object. Every object is re-hashed before restore; mismatches raise + ``ValueError`` instead of silently accepting corrupted replay state. + """ + if data.get("mode") != "content_addressed": + raise ValueError("expected content_addressed snapshot mode") + if data.get("kind") != "strategy_population": + raise ValueError("expected strategy_population snapshot kind") + if "entries" not in data: + raise KeyError("content-addressed snapshot missing 'entries'") + + entries = list(data["entries"]) + expected_root = _root_hash(entries) + if data.get("root_hash") != expected_root: + raise ValueError("content-addressed snapshot root_hash mismatch") + + embedded = data.get("objects", {}) + objects: dict[str, Mapping[str, Any]] = {} + if isinstance(embedded, Mapping): + objects.update(embedded) + if object_store is not None: + objects.update(object_store) + + result: list[Strategy] = [] + for entry in entries: + digest = entry["digest"] + if digest not in objects: + raise KeyError(f"missing object for digest {digest}") + + obj = _normalize_strategy_object(objects[digest]) + actual_digest = _digest_payload(obj) + if actual_digest != digest: + raise ValueError( + "content-addressed snapshot digest mismatch for " + f"{entry['name']}" + ) + + strategy = Strategy.from_dict(obj) + if strategy.name != entry["name"]: + raise ValueError( + "content-addressed snapshot manifest name mismatch for " + f"{entry['name']}" + ) + result.append(strategy) + + return result + + +def snapshot_bytes( + strategies: Iterable[Strategy], + *, + mode: str = "full", + include_objects: bool = True, +) -> bytes: """Return a deterministic, byte-stable snapshot of a strategy population. - Equivalent to JSON-encoding :func:`serialize_strategies` with:: + In default ``mode="full"``, this is equivalent to JSON-encoding + :func:`serialize_strategies` with:: json.dumps(obj, sort_keys=True, separators=(",", ":"), ensure_ascii=True) @@ -140,18 +288,24 @@ def snapshot_bytes(strategies: Iterable[Strategy]) -> bytes: Two calls on populations with identical strategy contents produce identical bytes regardless of insertion order. """ - payload = serialize_strategies(strategies) - text = json.dumps( - payload, - sort_keys=True, - separators=(",", ":"), - ensure_ascii=True, - ) - return text.encode("utf-8") + if mode == "full": + payload = serialize_strategies(strategies) + elif mode == "content_addressed": + payload = serialize_strategies_content_addressed( + strategies, + include_objects=include_objects, + ) + else: + raise ValueError(f"unknown snapshot mode: {mode}") + return _canonical_bytes(payload) __all__ = [ + "SNAPSHOT_SCHEMA_VERSION", + "strategy_content_digest", "serialize_strategies", + "serialize_strategies_content_addressed", "deserialize_strategies", + "deserialize_strategies_content_addressed", "snapshot_bytes", ] diff --git a/src/elume/providers/contracts.py b/src/elume/providers/contracts.py index ef04b29..e636553 100644 --- a/src/elume/providers/contracts.py +++ b/src/elume/providers/contracts.py @@ -40,10 +40,12 @@ from __future__ import annotations -from typing import Protocol, runtime_checkable +from typing import Literal, Protocol, runtime_checkable from elume.models.strategy import Strategy +SnapshotMode = Literal["full", "content_addressed"] + @runtime_checkable class MemoryProvider(Protocol): @@ -105,10 +107,19 @@ def delete_strategy(self, name: str) -> bool: """ ... - def snapshot(self) -> bytes: + def snapshot( + self, + *, + mode: SnapshotMode = "full", + include_objects: bool = True, + ) -> bytes: """Return a deterministic, canonical byte serialization of the provider state. - Must be stable under repeated calls for unchanged state; must - sort strategies by name. + ``mode="full"`` is the default compatibility mode and must be stable + under repeated calls for unchanged state; it must sort strategies by + name. ``mode="content_addressed"`` may return a manifest with a stable + root hash and object digests. ``include_objects=False`` is valid only + for content-addressed snapshots and allows callers to store the object + payloads separately. """ ... diff --git a/src/elume/providers/in_memory.py b/src/elume/providers/in_memory.py index a23d45a..4a0db3e 100644 --- a/src/elume/providers/in_memory.py +++ b/src/elume/providers/in_memory.py @@ -29,7 +29,9 @@ class to prove the contract is implementable and internally import json +from elume.envelope.snapshot import snapshot_bytes from elume.models.strategy import Strategy +from elume.providers.contracts import SnapshotMode class InMemoryProvider: @@ -71,18 +73,38 @@ def delete_strategy(self, name: str) -> bool: """Remove a strategy by name. Returns True if it existed, False otherwise.""" return self._store.pop(name, None) is not None - def snapshot(self) -> bytes: + def snapshot( + self, + *, + mode: SnapshotMode = "full", + include_objects: bool = True, + ) -> bytes: """Return a deterministic, canonical byte serialization of provider state. - Strategies are emitted sorted by canonical ``name``. The encoding - uses canonical JSON (``sort_keys=True``, no whitespace, ASCII) so - the output is stable across Python versions and platforms. + ``mode="full"`` preserves the original flat provider dump. Strategies + are emitted sorted by canonical ``name``. The encoding uses canonical + JSON (``sort_keys=True``, no whitespace, ASCII) so the output is stable + across Python versions and platforms. + + ``mode="content_addressed"`` returns the Track 026 manifest/root-hash + format from ``elume.envelope.snapshot``. ``include_objects=False`` + emits a manifest-only artifact for callers that store objects + separately. The genotype mapping is wrapped defensively through ``json.dumps(dict(s.genotype), sort_keys=True, ...)`` so a ``MappingProxyType`` or any other ``Mapping`` subclass serializes identically to a plain dict with the same contents. """ + if mode == "content_addressed": + return snapshot_bytes( + self._store.values(), + mode=mode, + include_objects=include_objects, + ) + if mode != "full": + raise ValueError(f"unknown snapshot mode: {mode}") + payload = [ { "name": s.name, diff --git a/tests/contract/test_provider_contract.py b/tests/contract/test_provider_contract.py index 24f0bcb..66c1b94 100644 --- a/tests/contract/test_provider_contract.py +++ b/tests/contract/test_provider_contract.py @@ -17,6 +17,7 @@ from __future__ import annotations +import json from dataclasses import FrozenInstanceError, is_dataclass from typing import cast @@ -294,6 +295,44 @@ def test_snapshot_changes_after_save(self, provider: MemoryProvider) -> None: after = provider.snapshot() assert before != after + def test_content_addressed_snapshot_mode_is_deterministic( + self, provider: MemoryProvider + ) -> None: + provider.save_strategy(Strategy(name="zulu", genotype={"z": 1})) + provider.save_strategy(Strategy(name="alpha", genotype={"a": 1})) + + first = provider.snapshot(mode="content_addressed") + second = provider.snapshot(mode="content_addressed") + parsed = json.loads(first.decode("utf-8")) + + assert first == second + assert parsed["mode"] == "content_addressed" + assert [entry["name"] for entry in parsed["entries"]] == [ + "alpha", + "zulu", + ] + assert "root_hash" in parsed + assert "objects" in parsed + + def test_content_addressed_manifest_only_mode_omits_objects( + self, provider: MemoryProvider + ) -> None: + provider.save_strategy(Strategy(name="alpha")) + + parsed = json.loads( + provider.snapshot( + mode="content_addressed", + include_objects=False, + ).decode("utf-8") + ) + + assert parsed["mode"] == "content_addressed" + assert "objects" not in parsed + + def test_unknown_snapshot_mode_raises(self, provider: MemoryProvider) -> None: + with pytest.raises(ValueError, match="snapshot mode"): + provider.snapshot(mode="unknown") + class TestReturnedStrategyImmutability: """Strategies returned by the provider are frozen dataclasses. diff --git a/tests/unit/envelope/test_snapshot.py b/tests/unit/envelope/test_snapshot.py index e6c10fc..0990866 100644 --- a/tests/unit/envelope/test_snapshot.py +++ b/tests/unit/envelope/test_snapshot.py @@ -8,8 +8,11 @@ from elume.envelope.snapshot import ( deserialize_strategies, + deserialize_strategies_content_addressed, serialize_strategies, + serialize_strategies_content_addressed, snapshot_bytes, + strategy_content_digest, ) from elume.models.strategy import Strategy @@ -154,6 +157,14 @@ def test_entry_missing_name_raises(self) -> None: class TestSnapshotBytesFormat: + def test_default_snapshot_bytes_shape_is_backward_compatible(self) -> None: + data = snapshot_bytes([Strategy(name="x", genotype={"a": 1})]) + assert ( + data + == b'{"strategies":[{"created_at":0.0,"genotype":{"a":1},' + b'"name":"x","parent_name":null}]}' + ) + def test_output_is_bytes(self) -> None: assert isinstance(snapshot_bytes(_make_population()), bytes) @@ -167,3 +178,125 @@ def test_output_has_no_whitespace_padding(self) -> None: data = snapshot_bytes([Strategy(name="x", genotype={"a": 1})]) assert b", " not in data assert b": " not in data + + +class TestContentAddressedSnapshots: + def test_strategy_digest_is_stable_for_same_content(self) -> None: + first = Strategy(name="x", genotype={"a": 1, "b": 2}) + second = Strategy(name="x", genotype={"b": 2, "a": 1}) + + assert strategy_content_digest(first) == strategy_content_digest(second) + + def test_strategy_digest_changes_when_content_changes(self) -> None: + before = Strategy(name="x", genotype={"a": 1}) + after = Strategy(name="x", genotype={"a": 2}) + + assert strategy_content_digest(before) != strategy_content_digest(after) + + def test_manifest_entries_sorted_by_name(self) -> None: + result = serialize_strategies_content_addressed(_make_population()) + names = [entry["name"] for entry in result["entries"]] + assert names == sorted(names) + + def test_root_hash_is_stable_across_insertion_order(self) -> None: + population = _make_population() + shuffled = [population[2], population[0], population[1]] + + first = serialize_strategies_content_addressed(population) + second = serialize_strategies_content_addressed(shuffled) + + assert first["root_hash"] == second["root_hash"] + assert first["entries"] == second["entries"] + + def test_root_hash_changes_when_strategy_content_changes(self) -> None: + before = serialize_strategies_content_addressed( + [Strategy(name="x", genotype={"a": 1})] + ) + after = serialize_strategies_content_addressed( + [Strategy(name="x", genotype={"a": 2})] + ) + + assert before["root_hash"] != after["root_hash"] + + def test_same_strategy_has_same_digest_across_populations(self) -> None: + shared = Strategy(name="shared", genotype={"temperature": 0.7}) + first = serialize_strategies_content_addressed([shared]) + second = serialize_strategies_content_addressed( + [Strategy(name="other"), shared] + ) + + shared_digest = first["entries"][0]["digest"] + second_digests = { + entry["name"]: entry["digest"] for entry in second["entries"] + } + + assert second_digests["shared"] == shared_digest + + def test_embedded_objects_round_trip(self) -> None: + population = _make_population() + snapshot = serialize_strategies_content_addressed(population) + rebuilt = deserialize_strategies_content_addressed(snapshot) + + original = {_strategy_field_tuple(s) for s in population} + restored = {_strategy_field_tuple(s) for s in rebuilt} + assert original == restored + + def test_manifest_only_omits_objects(self) -> None: + snapshot = serialize_strategies_content_addressed( + _make_population(), + include_objects=False, + ) + + assert "objects" not in snapshot + + def test_manifest_only_restores_with_external_store(self) -> None: + embedded = serialize_strategies_content_addressed(_make_population()) + manifest = serialize_strategies_content_addressed( + _make_population(), + include_objects=False, + ) + + rebuilt = deserialize_strategies_content_addressed( + manifest, + object_store=embedded["objects"], + ) + + original = {_strategy_field_tuple(s) for s in _make_population()} + restored = {_strategy_field_tuple(s) for s in rebuilt} + assert original == restored + + def test_manifest_only_without_store_raises(self) -> None: + manifest = serialize_strategies_content_addressed( + _make_population(), + include_objects=False, + ) + + with pytest.raises(KeyError, match="object"): + deserialize_strategies_content_addressed(manifest) + + def test_tampered_embedded_object_raises(self) -> None: + snapshot = serialize_strategies_content_addressed(_make_population()) + digest = snapshot["entries"][0]["digest"] + snapshot["objects"][digest]["genotype"] = {"tampered": True} + + with pytest.raises(ValueError, match="digest"): + deserialize_strategies_content_addressed(snapshot) + + def test_snapshot_bytes_supports_content_addressed_mode(self) -> None: + data = snapshot_bytes(_make_population(), mode="content_addressed") + parsed = json.loads(data.decode("utf-8")) + + assert parsed["mode"] == "content_addressed" + assert "root_hash" in parsed + assert "objects" in parsed + + def test_snapshot_bytes_manifest_only_omits_objects(self) -> None: + data = snapshot_bytes( + _make_population(), + mode="content_addressed", + include_objects=False, + ) + parsed = json.loads(data.decode("utf-8")) + + assert parsed["mode"] == "content_addressed" + assert "objects" not in parsed