feat: hook versioning, per-row provenance, and M2M deploy (#145)#146
feat: hook versioning, per-row provenance, and M2M deploy (#145)#146rorybyrne wants to merge 11 commits into
Conversation
WIP checkpoint. Foundational + US1 (bundled deploy) + US2 (per-row provenance) production layer for hook versioning. Production type-checks clean; new unit tests pass. Existing test fixtures not yet migrated (suite red); Alembic migration and US3/US4/US5 endpoints pending. - Split HookDefinition -> HookIdentity (name+feature); runtime+source_ref move to the immutable, integer-versioned HookRelease entity. - New validation-domain registry: Hook/HookRelease/HookRun; HookRegistry port + Postgres adapter (row-locked gap-free versions, idempotent-on-digest releases, live-pointer advance/rollback, resolve_live snapshot, append-only hook_runs). - Tables hooks/hook_releases/hook_runs; features.* gain run_id; conventions PK srn->id; records convention_srn->convention_id. - ConventionSRN -> ConventionId across deposition/record/ingest/curation/ validation; conventions reference hooks by name; transactional bundled POST /conventions deploy fan-out. - Provenance: resolve-at-run-start snapshot; one hook_run per hook per batch; run_id threaded through ingest and deposition feature-insert paths; runners take (identity, release). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Drop hook_run_ids pass-through from ValidationCompleted, DepositionApproved,
RecordDraft, RecordPublished, HookBatchCompleted, IngestBatchPublished — it was
tramp data riding through curation/record handlers that don't use it.
Instead reconstruct {hook_name: run_id} at feature-insert time from the keys the
consumer already holds, via a feature-domain HookRunReader port (Postgres
adapter: one indexed hook_runs ⋈ hook_releases join per batch / per deposition,
not per row). hook_runs are still written at execution time; provenance is
unchanged. Keeps provenance ownership in the validation tables and avoids a new
feature→validation domain-code edge.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The HookRunReader port I'd put in the feature domain was wrong: it inverted ownership (feature dictating a validation-aggregate query) and laundered a real feature->validation dependency through an infra adapter that read validation's tables. hook_runs/hook_releases are validation-domain data, so the read capability belongs there: run_ids_for_batch / run_ids_for_deposition now live on the HookRegistry port + Postgres adapter (SQL) + HookRegistryService. The feature insert handlers call the validation service directly — an honest, declared service-to-service cross-domain read. Deleted the feature-domain reader port, its adapter, and the DI binding. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… types (#145) Reconcile the committed #145 spine to the post-spec design-revisions doc, and tighten loose string types into nominal value objects. Identity & conventions - ConventionId (slug@version) -> ConventionSlug: a frozen RootModel[str], bare unversioned slug. Conventions are now mutable; deploy is a declarative upsert by slug (ON CONFLICT DO UPDATE) with no caller version and no 409 path. - DeployConvention command DTOs renamed to DeployConvention{Schema,Hook,Release}; the shared HookDeploySpec is removed (release input is the command DTO; the internal deploy unit is deposition-domain HookDeploy). Typed names (no bare str across boundaries) - HookName promoted to a frozen RootModel[str]; .root only at PG/k8s string boundaries. HookResult.hook_name is now HookName. - New FeatureName value object for the read/feature surface: expected_features is list[FeatureName]; /data/{schema}/{feature} and the feature store/service speak features, not hooks. HookName -> FeatureName is converted once at the hook->feature boundary (validate_deposition, publish_batch). Provenance (design-revisions §6) - HookRun slimmed to a pure execution record (drop ingest/deposition/batch context + XOR check; finished_at/duration_s/oom_retries required). Add HookRunStatus.from_hook_status; delete the duplicated _run_status helpers. - run_id is carried via a run.json written into each hook output dir and read by the feature-insert handlers (read_run_ref/write_run_ref on the storage ports; RunRef value object). Delete run_ids_for_batch/run_ids_for_deposition and drop the HookRegistryService dependency from the insert handlers. hook_runs table slimmed accordingly; records keep a slug convention ref. All 1126 unit tests pass; ruff + ty(osa) clean. Alembic migration and integration-test reconciliation follow.
…ion slug (#145) Create hooks / hook_releases / hook_runs (slim, no execution-context columns); swap conventions PK from the opaque srn to the slug `id`; rename convention_srn -> convention_id on depositions / ingest_runs / records with their indexes. Pre-launch: forward-only, no data backfill. Dynamic feature tables gain `run_id` via the build_feature_table helper, not a static migration. Excludes a pre-existing, unrelated depositions.owner_id NOT NULL drift.
…ation tests (#145) Two production bugs surfaced by the integration suite once feature tables gained a NOT NULL run_id FK and conventions stored hook names as bare strings: - schema_feature_reader._hook_names: conventions.hooks is now a JSON list of plain hook-name strings, not {"name": ...} dicts — read the element directly. - DataCatalogService.resolve_table: compare the manifest's str resource name against feature_name.root (a FeatureName), not the FeatureName object, so /data/{schema}/{feature} resolves instead of 404ing. Integration tests reconciled to the new API: ConventionSlug, mutable upsert-by- slug convention repo, bundled deploy(), and a seed_hook_run conftest helper that builds the hooks->releases->runs provenance chain so feature rows satisfy the run_id FK. 1126 unit + 120 integration tests pass; ruff + ty(osa) clean.
…#145) Pre-launch with no deployed DB, so collapse all 16 incremental migrations into one `initial_schema` migration generated from tables.py. The schema is now born in its final shape — no convention_srn->convention_id rename churn, no PK swaps, and the stale depositions.owner_id NOT NULL drift is gone. Also fixes a latent index/runtime mismatch the squash surfaced: the uq_records_source unique index used .as_string() (a redundant CAST) which the bulk-publish ON CONFLICT ((source->>'type'),(source->>'id')) could not match. Switched to .astext so the index expression matches the runtime exactly. Verified: `alembic upgrade head` applies clean, `alembic check` reports zero drift (migration == tables.py), and all 120 integration + 1126 unit tests pass.
Completes US3–US5 of the hook-versioning feature on top of the existing
registry spine (deploy + provenance).
US3 — incremental release: CreateRelease mints vN+1 (201) or returns the
existing release on a duplicate digest (200, idempotent), advancing the live
pointer without touching the referencing convention; ListHooks/ListReleases/
GetRelease serve the catalog, history, and detail reads. New hooks router
(POST releases, GET catalog/history/detail) wired in ValidationProvider.
US4 — rollback: SetLive + PUT /hooks/{name}/live repoints the live pointer to
a prior release; history and already-produced rows' provenance are unchanged.
US5 — scoped machine credentials: optional second JWT issuer (extra_issuer
config) with iss-routed verification (RS256/EdDSA); Principal gains scopes +
has_scope; new RequiresScope gate (scope OR ADMIN) wired into the command/query
auth metaclasses, startup validation, and identity resolution. Deploy now
requires conventions:write; release/live require hooks:write (ADMIN still
accepted). The primary HS256 path is byte-identical when no second issuer is
configured.
Tests: handler/gate/token unit tests + DB-free hook contract tests + error-map
unit tests. 1155 unit + 7 contract + 120 integration pass; ruff + ty(osa) clean.
|
|
@greptile |
Greptile SummaryThis PR delivers three tightly coupled features on top of the existing DDD/CQRS skeleton: a versioned hook registry (identity + immutable releases + live pointer), per-row provenance via
Confidence Score: 5/5All three previously flagged blocking issues are resolved; the remaining findings are cosmetic response-field staleness and an edge case in M2M provenance when a token omits its The two prior concurrency problems (TOCTOU server/osa/util/di/fastapi.py (M2M identity resolution — missing sub guard) and server/osa/domain/validation/command/create_release.py (is_live staleness)
|
| Filename | Overview |
|---|---|
| server/migrations/versions/c6d9f4c0c3ab_initial_schema.py | Consolidated initial schema migration: creates all tables including hooks/hook_releases/hook_runs with the deferred FK fk_hooks_live_release_id, resolving the previously flagged missing-FK issue. |
| server/osa/infrastructure/persistence/repository/hook_registry.py | Postgres adapter for the hook registry: row-locked version assignment, digest idempotency, live-pointer advance, and resolve_live join — concurrency semantics match the design spec. |
| server/osa/domain/auth/service/token.py | Adds iss-routed JWT validation: reads iss without verification for routing only, then verifies against the matched issuer's key; primary HS256 path is byte-identical when no extra issuer is configured. |
| server/osa/util/di/fastapi.py | Adds M2M identity resolution: parses scopes from extra-issuer tokens and derives a stable synthetic UserId via uuid5(issuer+sub); a missing sub claim silently yields an issuer-level UUID shared across all clients from that issuer. |
| server/osa/domain/validation/command/create_release.py | Implements CreateReleaseHandler; created flag is now correctly determined inside the registry row lock (ReleaseOutcome), fixing the prior TOCTOU. The is_live response field is derived from a separate get_hook call, which is cosmetically stale. |
| server/osa/domain/ingest/handler/run_hooks.py | Adds per-batch provenance: resolves live releases once (snapshot), pre-allocates run IDs, writes run.json to each hook's output dir, records append-only hook_run rows; oom_retries now propagated from HookResult. |
| server/osa/domain/shared/command.py | Adds RequiresScope gate enforcement: checks principal.has_scope OR principal.has_role(ADMIN), consistent with query.py. |
| server/osa/domain/deposition/service/convention.py | Replaces create_convention with deploy: fans out to schema, hook registry, and convention upsert in one call; schema reuse is idempotent on id@version; hook identity upsert rejects contract changes. |
| server/osa/domain/validation/service/validation.py | Validation service now resolves live releases, records hook_run rows with oom_retries from HookResult, and writes run.json; oom_retries on the exception-catch path correctly stays 0. |
| server/osa/application/api/v1/routes/hooks.py | New hooks REST router: POST /{name}/releases, PUT /{name}/live, GET catalog/list/detail; sets 201/200 from outcome.created; thin DTO coercion only. |
Sequence Diagram
sequenceDiagram
participant Client
participant API
participant TokenService
participant ResolveIdentity
participant Handler
participant ConventionService
participant HookRegistry
participant DB
Client->>API: POST /conventions (DeployConvention)
API->>TokenService: validate_access_token(token)
Note over TokenService: Read iss (unverified) for routing
alt "iss == extra_issuer"
TokenService-->>TokenService: verify RS256 signature
else primary path
TokenService-->>TokenService: verify HS256
end
TokenService-->>API: payload
API->>ResolveIdentity: build Principal (scopes or roles)
ResolveIdentity-->>Handler: Principal
Handler->>Handler: RequiresScope(conventions:write) OR ADMIN
Handler->>ConventionService: deploy(slug, schema, hooks)
ConventionService->>DB: create/reuse schema
loop each hook
ConventionService->>HookRegistry: upsert_identity(name, feature)
HookRegistry->>DB: SELECT FOR UPDATE hooks row
ConventionService->>HookRegistry: create_release(name, runtime)
HookRegistry->>DB: INSERT hook_releases + UPDATE live_release_id
end
ConventionService->>DB: pg_insert(conventions).on_conflict_do_update
ConventionService->>DB: outbox ConventionRegistered
API-->>Client: 201 ConventionCreated
Client->>API: POST /hooks/name/releases
API->>Handler: CreateReleaseHandler
Handler->>HookRegistry: "create_release -> ReleaseOutcome(created)"
Note over HookRegistry: Digest idempotency inside row lock
API-->>Client: 201 new or 200 duplicate digest
Reviews (2): Last reviewed commit: "feat: surface OOM retry count into hook-..." | Re-trigger Greptile
| "status", sa.String(length=32), server_default=sa.text("'pending'"), nullable=False | ||
| ), | ||
| sa.Column( | ||
| "ingestion_finished", sa.Boolean(), server_default=sa.text("false"), nullable=False | ||
| ), | ||
| sa.Column("batches_ingested", sa.Integer(), server_default=sa.text("0"), nullable=False), | ||
| sa.Column("batches_completed", sa.Integer(), server_default=sa.text("0"), nullable=False), | ||
| sa.Column("published_count", sa.Integer(), server_default=sa.text("0"), nullable=False), |
There was a problem hiding this comment.
Missing FK:
hooks.live_release_id → hook_releases.id
tables.py has an explicit comment: # FK added in migration (deferrable), but this migration never adds it. The hooks table is created with live_release_id UUID nullable and the hook_releases table is created right afterwards with a back-FK to hooks.name, but no op.create_foreign_key(...) call from hooks.live_release_id to hook_releases.id appears anywhere in the migration. The database therefore has no referential-integrity guard on the live pointer — a buggy or direct-SQL write could set live_release_id to a non-existent UUID and the adapter's _to_hook / resolve_live would silently return a Hook with an orphaned pointer. The fix is an op.create_foreign_key after both tables exist, using use_alter=True or a deferred constraint to handle the circular dependency.
|
|
||
| async def run(self, cmd: CreateRelease) -> ReleaseCreated: | ||
| built_by = str(self.principal.user_id) if self.principal.user_id else None | ||
|
|
||
| # Whether this is a brand-new version (201) or an idempotent no-op (200) | ||
| # for a digest already present. Read before minting; the adapter's row | ||
| # lock still guarantees gap-free versions regardless of this hint. | ||
| seen_digests = {r.runtime.digest for r in await self.service.list_releases(cmd.name)} | ||
| release = await self.service.create_release( | ||
| cmd.name, cmd.to_runtime(), cmd.source_ref, built_by | ||
| ) | ||
| hook = await self.service.get_hook(cmd.name) | ||
| is_live = hook is not None and hook.live_release_id == release.id | ||
|
|
||
| return ReleaseCreated( | ||
| hook_name=release.hook_name, |
There was a problem hiding this comment.
created flag is TOCTOU-racy under concurrent identical submissions
seen_digests is collected via list_releases before the adapter's SELECT … FOR UPDATE row lock is acquired inside create_release. Two concurrent requests carrying the same digest can both read an empty seen_digests, call create_release in series (the lock serializes them correctly for data integrity), and both compute created=True — returning 201 for what should be the second caller's idempotent 200. The simplest fix is to derive created from a flag the adapter already knows: return a (HookRelease, bool) tuple from create_release (or compare the release's built_at to now()), so the determination happens inside the lock.
| started_at=started_at, | ||
| finished_at=finished_at, | ||
| duration_s=duration_by_hook.get(hook.name, 0.0), | ||
| oom_retries=0, |
There was a problem hiding this comment.
oom_retries always recorded as 0 in provenance
HookRun.oom_retries is hardcoded to 0 for every run, even when the hook actually retried after an OOM eviction. The OOM retry loop (which calls with_doubled_memory) lives inside the hook runner infrastructure and the retry count isn't surfaced to HookResult, so this information is currently lost at the provenance layer. At minimum the field should reflect actual retries; if the runner doesn't expose the count yet, a follow-up ticket should be opened to propagate it so the provenance record is accurate.
| oom_retries=0, | |
| oom_retries=0, # TODO: surface actual retry count from hook runner (FR-provenance) |
Addresses two PR review findings on the hook registry.
1. Missing referential integrity on the live pointer: the initial_schema
migration created hooks.live_release_id without the FK to hook_releases.id
(the tables.py comment promised one). Add it as a DEFERRABLE, use_alter FK in
both the SQLAlchemy metadata and the migration (created after both tables
exist, dropped first on downgrade) so the circular dependency is handled and
`alembic check` stays zero-drift. Verified against Postgres: FK present,
DEFERRABLE INITIALLY DEFERRED, no drift.
2. TOCTOU on the release 201-vs-200 hint: the handler read existing digests
before the adapter's row lock, so concurrent identical submissions could both
report "created". Decide `created` inside the lock instead — create_release
now returns a ReleaseOutcome{release, created} (named result, not a tuple),
and the handler maps it straight to the status code with no pre-read.
Tests + integration conftest updated for the new return type. 1155 unit + 120
integration pass; ruff + ty(osa) clean.
Addresses the third PR review finding: HookRun.oom_retries was hardcoded to 0 at every record site, so provenance always claimed zero retries even when a hook OOM-retried with doubled memory. - HookResult gains oom_retries; HookService.run_hook counts each doubled-memory retry inside its loop and sets it on the returned result (success + rejected paths). - Layer-2 cleanup: the ingest handler's parallel status_by_hook/duration_by_hook dicts collapse into one result_by_hook, from which status, duration, and oom_retries are read. - Both record sites (ingest run_hooks, deposition ValidationService) stamp the real count. The two hard-failure paths (OOM-exhausted / other exceptions) keep 0 with a comment — the count isn't surfaced through the exception and those runs produce no feature rows (provenance is best-effort by design). Tests: run_hook counting (0 / 1 / 2 retries with memory doubling) and end-to-end that the recorded HookRun carries the count. 1156 unit + 120 integration pass; ruff + ty(osa) clean.
|
@greptile |
The optional second (machine) JWT issuer now accepts EdDSA/Ed25519 exclusively — the algorithm is no longer configurable. ExtraIssuerConfig drops its `algorithm` field and the verify path pins `algorithms=["EdDSA"]`, so the token header's `alg` is never trusted and can't downgrade to `none` or to HS256 treating the configured PEM as a shared secret. Operators only supply the issuer, Ed25519 public key, audience, and scope claim. Tests cover the EdDSA happy path, wrong-key/wrong-aud rejection, and that an RS256-signed token for the trusted issuer is rejected.
Implements GitHub issue #145 end-to-end: a versioned hook registry, exact per-row provenance, and scoped machine (M2M) deploy credentials. Built on the existing DDD/CQRS layering (API → Command/Query → Service → Repo), following the feature's
architecture.mdanddesign-revisions.md.What's in it
Registry spine (foundational)
Hookidentity (name + fixed output contract) · immutable, integer-versionedHookRelease· live pointer · append-onlyHookRun.HookRegistryadapter: row-locked, gap-free monotonic versions; live-pointer advance/repoint; digest idempotency; live-release snapshot.initial_schemamigration (zero-driftalembic check);records.convention_idtypedConventionSlug; conventions are unversioned, slug-keyed, mutable.US1 — single-call deploy (
POST /conventions): one atomic call creates/versions the schema, upserts hooks + first releases (+ feature tables), and upserts the convention by slug (declarative upsert, no 409).US2 — provenance: every feature row carries
run_id → hook_runs → hook_releases, yielding the exact version/digest/config/source_ref that produced it.run_idis carried via arun.jsonin each hook's output dir (no DB-lookup reconstruction). The live release is resolved once at run start and snapshotted, so in-flight runs are unaffected by a mid-run pointer move.US3 — incremental release (
POST /hooks/{name}/releases): mints vN+1 (201) or returns the existing release on a duplicate digest (200, idempotent), advancing live without touching the convention. Catalog/history/detail reads:GET /hooks,GET /hooks/{name}/releases[/{version}].US4 — rollback (
PUT /hooks/{name}/live): repoints live to a prior release; history and already-produced rows' provenance are preserved.US5 — scoped M2M credentials: optional second JWT issuer (
auth.extra_issuer) withiss-routed verification (RS256/EdDSA).Principalgainsscopes+has_scope; newRequiresScopegate (scope OR ADMIN) wired into the command/query auth metaclasses, startup validation, and identity resolution. Deploy →conventions:write; release/live →hooks:write(ADMIN still accepted). The primary HS256 path is byte-identical when no second issuer is configured.Testing
TDD throughout. 1155 unit + 7 contract + 120 integration tests pass;
ruffandty check osaclean. New this PR: handler/gate/token unit tests, DB-free hook contract tests, and centralized error-mapping unit tests. The integration suite exercises the registry adapter, migration, and provenance chain against real Postgres.Notes