From 52f24507ee87c7d03c1adad3256fd4453d4f9c33 Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 18 Jun 2026 00:20:29 +0000 Subject: [PATCH 1/7] docs: add contributor guide for idempotency & partition keys Design guidance for external PRs adding pg-boss-style idempotency keys (refs #293) and per-partition serialization. Maps both features onto PgQue's existing layering: sidecar tables (rotation-safe), send wrappers that reduce to insert_event(), consumer-side gating instead of engine changes, maint-cycle expiry to bound bloat. Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01WuaYcu1XXsVEpsnLhF1FFu --- blueprints/IDEMPOTENCY_AND_PARTITIONS.md | 287 +++++++++++++++++++++++ 1 file changed, 287 insertions(+) create mode 100644 blueprints/IDEMPOTENCY_AND_PARTITIONS.md diff --git a/blueprints/IDEMPOTENCY_AND_PARTITIONS.md b/blueprints/IDEMPOTENCY_AND_PARTITIONS.md new file mode 100644 index 00000000..3d96b2ef --- /dev/null +++ b/blueprints/IDEMPOTENCY_AND_PARTITIONS.md @@ -0,0 +1,287 @@ +# Contributor guide: idempotency keys & partition keys + +Status: design guidance for external PRs (refs #293). +Audience: contributors adding pg-boss-style **idempotency keys** and +**per-partition serialization** ("one job at a time per partition key") on top +of PgQue. + +This document is the answer to "how should I approach these two features?". It +maps each feature onto PgQue's existing layering so a PR reduces cleanly to PgQ +primitives, survives table rotation, and does not touch the sacred engine. + +--- + +## 0. The three facts that shape both designs + +Before writing any SQL, internalize these properties of the engine. They are +what make the naive approaches wrong. + +1. **Event data tables rotate and get truncated.** Each queue stores events in + `queue_ntables` (default 3) round-robin data tables + (`_`). Rotation recycles the oldest table with + `TRUNCATE` every `queue_rotation_period` (default 2h). Anything you want to + remember *for longer than one rotation* — a dedup ledger, a partition lease — + **cannot live in the event tables.** It must live in its own sidecar table. + The existing `pgque.delayed_events` holding table (see + `sql/experimental/delayed.sql`) is the precedent to copy. + +2. **`send` must reduce to `insert_event`.** Design rule #3 in `CLAUDE.md`: any + producer API must be explainable as "calls `pgque.insert_event(queue, type, + data)` with these args." Both features are *wrappers around* `insert_event`, + not replacements for it. + +3. **The batch/tick/snapshot engine is sacred** (design rule #2). A batch is a + snapshot window: `next_batch` + `get_batch_events` + `batch_event_sql` + deliver *every* event committed in the window to the consumer at once. Do + **not** modify `batch_event_sql`, `next_batch`, rotation, or consumer + tracking. Partition serialization must be built *on top of* these + primitives (consumer-side gating), never inside them. + +### Where code goes + +`build/transform.sh` assembles `sql/pgque.sql` from the transformed PgQ core +plus every file in `sql/pgque-additions/` (shipped in the default install) and +**excludes** `sql/experimental/`. So: + +- Land new features in `sql/experimental/.sql` first (opt-in, not in + the default single-file install), with tests in `tests/` registered in + `tests/run_experimental.sql`. +- Graduate to `sql/pgque-additions/.sql` once the API is settled. +- Either way, regenerate `sql/pgque.sql` via `build/transform.sh` and commit + the source and generated file together (keep them in sync — `CLAUDE.md`). +- Every `SECURITY DEFINER` function pins `SET search_path = pgque, pg_catalog`. + Grant producer-side functions to `pgque_writer`, consumer-side to + `pgque_reader`. Re-run the deny-by-default `revoke ... from public`. +- Red/green TDD: failing `tests/test_*.sql` first, then the implementation. + CI runs PG 14–18. + +Two features → **two PRs.** Keep changes surgical (one feature each). + +--- + +## 1. Idempotency keys (issue #293) + +> "the same send within a timeframe results in a no-op" + +### Why not a unique index on the event table + +That is the obvious move and it is wrong here: the unique index would live on a +rotating data table, so it (a) only dedups within the current table, and (b) is +destroyed on the next `TRUNCATE`. You would get non-deterministic dedup windows +tied to rotation timing. The dedup ledger must be a **separate, non-rotated +table** with an explicit TTL you control. + +### Recommended shape + +A sidecar table keyed by `(queue, idempotency_key)` with an expiry column, plus +a thin `send` wrapper that claims the key atomically before inserting the event. + +```sql +create table if not exists pgque.idempotency_key ( + ik_queue_name text not null, + ik_key text not null, + ik_msg_id bigint, -- event id produced on first send + ik_expires_at timestamptz not null, + constraint idempotency_key_pkey primary key (ik_queue_name, ik_key) +); +create index if not exists ik_expires_idx + on pgque.idempotency_key (ik_expires_at); +``` + +```sql +-- pgque.send_idempotent(queue, key, payload, ttl) +-- First send within the TTL window inserts the event and records the key. +-- Repeat sends with the same (queue, key) inside the window are a no-op and +-- return the original msg_id. Reduces to one insert_event() call. +create or replace function pgque.send_idempotent( + i_queue text, i_key text, i_payload text, + i_ttl interval default '1 hour') +returns bigint as $$ +declare + v_msg_id bigint; + v_now timestamptz := now(); +begin + -- Atomic claim: the unique index is the serialization point. A concurrent + -- duplicate loses the race and takes the "already present" branch. + insert into pgque.idempotency_key (ik_queue_name, ik_key, ik_expires_at) + values (i_queue, i_key, v_now + i_ttl) + on conflict (ik_queue_name, ik_key) do update + -- only "win" the upsert if the prior key has expired + set ik_expires_at = excluded.ik_expires_at, + ik_msg_id = null + where pgque.idempotency_key.ik_expires_at <= v_now + returning ik_msg_id into v_msg_id; + + if not found then + -- live duplicate: row exists and is unexpired, upsert WHERE filtered it + select ik_msg_id into v_msg_id + from pgque.idempotency_key + where ik_queue_name = i_queue and ik_key = i_key; + return v_msg_id; -- no-op, original id (may be the same event) + end if; + + -- we own the claim (fresh or expired-and-reclaimed): produce the event + v_msg_id := pgque.insert_event(i_queue, 'default', i_payload); + update pgque.idempotency_key + set ik_msg_id = v_msg_id + where ik_queue_name = i_queue and ik_key = i_key; + return v_msg_id; +end; +$$ language plpgsql security definer set search_path = pgque, pg_catalog; +``` + +Design decisions to settle in the PR (call them out explicitly): + +- **Return value on duplicate.** pg-boss returns `null` for a rejected + duplicate. PgQue can do better by storing `ik_msg_id` and returning the + original event id, so callers get an idempotent *result*, not just an + idempotent *side effect*. Pick one and document it. +- **TTL semantics.** Is the window "since first send" (above) or "since last + send" (sliding)? The above is fixed-from-first; a sliding window just bumps + `ik_expires_at` on every hit. pg-boss's `singletonKey` is closer to + fixed-window-per-slot — match the semantics you actually need. +- **Transaction visibility.** If the producer rolls back, the key insert rolls + back with it (same transaction) — correct. Document that `send_idempotent` + is meant to run in the caller's transaction. + +### The part that actually fixes their bloat: expiry maintenance + +Their pain is unbounded growth, so the ledger must self-prune. Add a maint +step modeled on `maint_deliver_delayed()` and hook it into `pgque.maint()`: + +```sql +create or replace function pgque.maint_expire_idempotency() +returns integer as $$ +declare cnt integer; +begin + delete from pgque.idempotency_key where ik_expires_at <= now(); + get diagnostics cnt = row_count; + return cnt; +end; +$$ language plpgsql security definer set search_path = pgque, pg_catalog; +``` + +This bounds the dedup table to roughly `throughput × TTL` rows regardless of +backlog — which is exactly the property they could not get from pg-boss. + +### Tests (red first) + +- send same `(queue, key)` twice inside TTL → exactly one event in the batch. +- send same key after TTL expiry (or after `maint_expire_idempotency()`) → a + second event is produced. +- two concurrent `send_idempotent` with the same key → exactly one event + (use the `tests/two_session_*.sh` pattern for the race). +- `maint_expire_idempotency()` deletes only expired rows and returns the count. + +--- + +## 2. Partition keys — "one job at a time per partition key" + +> "run 1 job at a time for a given partition key … the batch could contain 1 +> job per partition key" + +This is the harder request because it is about **consumption order / +concurrency control**, which lives in the sacred engine's territory. Split it +into two independent sub-problems and solve them separately. + +### 2a. Carrying the partition key (easy, no engine change) + +An event already has four free passthrough columns (`ev_extra1..ev_extra4`) +that survive batching and are returned by `get_batch_events` / `pgque.receive`. +Carry the partition key in one of them via the existing 7-arg +`insert_event(queue, type, data, extra1..4)`. A thin wrapper: + +```sql +-- pgque.send_partitioned(queue, partition_key, payload) +create or replace function pgque.send_partitioned( + i_queue text, i_partition_key text, i_payload text) +returns bigint as $$ +begin + -- partition key rides in ev_extra1; everything else is a normal send + return pgque.insert_event(i_queue, 'default', i_payload, + i_partition_key, null, null, null); +end; +$$ language plpgsql security definer set search_path = pgque, pg_catalog; +``` + +No schema change, still reduces to `insert_event`. The `pgque.message` type +already exposes `extra1`, so consumers see the key without API changes. + +### 2b. Serializing per key (the real work) + +The PgQ batch model hands a consumer *all* events in the tick window at once; +it has no built-in "skip events whose partition is busy." Do **not** try to add +that to `batch_event_sql`. Gate it **consumer-side**, on top of the existing +`next_batch` / `get_batch_events` / `event_retry` primitives. Three viable +approaches, in order of how well they match the request: + +1. **Partition lease table (recommended for true "one at a time").** + A non-rotated sidecar holding the currently in-flight key per queue: + + ```sql + create table if not exists pgque.partition_lease ( + pl_queue_name text not null, + pl_partition_key text not null, + pl_msg_id bigint not null, + pl_leased_at timestamptz not null default now(), + constraint partition_lease_pkey primary key (pl_queue_name, pl_partition_key) + ); + ``` + + A partition-aware receive walks the batch in `ev_id` order and, for each + distinct partition key, tries to claim the lease + (`insert ... on conflict do nothing`). The first event for a free key is + delivered; any further event whose key is already leased is **deferred** + back into retry via `pgque.event_retry(batch, ev_id, delay)` instead of + being returned. `ack`/`nack` for a leased message releases the lease + (`delete from partition_lease`), letting the next event for that key through + on a subsequent batch. Net effect: at most one in-flight job per key, across + all workers, with no engine change. Add a TTL/`pl_leased_at` reaper to the + maint cycle so a crashed worker's lease cannot wedge a partition forever. + +2. **Cooperative consumers (already in the tree).** + `sql/pgque-api/cooperative_consumers.sql` lets N members of one logical + consumer split a queue. Hashing `partition_key → member` gives you + *parallelism bounded by key* (all events for a key go to the same member), + which is often what people actually want. It does **not** by itself + guarantee strictly one in-flight per key within a member — combine with + per-key ordering in the worker, or with approach 1, if strictness matters. + +3. **At-most-one-per-key-per-batch filter.** + A `receive` variant that returns at most one event per distinct + `partition_key` in the current batch and defers the rest via `event_retry`. + Simpler than the lease table but only serializes *within a batch*, not + across concurrent consumers — weaker guarantee. Useful as a stepping stone. + +### Ordering caveat to flag in the PR + +PgQ batches are snapshot windows, so cross-batch ordering is by `ev_id` but a +deferred (retried) event reappears in a *later* batch. If they need strict +FIFO *within* a partition, the lease approach must also process a partition's +events in `ev_id` order and not advance the partition past a deferred event. +Make this an explicit, documented guarantee (or non-guarantee) — it is the +subtle part reviewers will care about. + +### Tests (red first) + +- two events, same partition key: first `receive` returns event 1 and leases + the key; event 2 is deferred (not returned) until event 1 is acked. +- two events, different keys: both delivered in the same batch. +- two concurrent consumers, same key: only one gets the event (lease race). +- crashed worker (lease never released) → reaper frees it after TTL. + +--- + +## 3. Suggested PR sequence + +1. **PR 1 — idempotency keys.** Sidecar table + `send_idempotent` + + `maint_expire_idempotency` hooked into `maint()`; tests; land in + `sql/experimental/` first. Closes #293. +2. **PR 2 — partition keys.** `send_partitioned` (key in `ev_extra1`) + + partition-aware receive over a lease table + lease reaper; tests. Open a + tracking issue first to settle the ordering guarantee before coding. + +Both follow the same rules: pin `search_path`, grant by role +(`send_*` → `pgque_writer`, partition receive → `pgque_reader`), keep +`pgque.sql` regenerated, never touch the batch/tick/rotation engine, and write +the failing test before the implementation. From 651736bb7fe3555806eb4715582d53d56a79ec36 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 19 Jun 2026 05:02:22 +0000 Subject: [PATCH 2/7] docs: add idempotency/dedup design decision note Records why pgque's producer idempotency is a TTL window (log model, not job queue) and why free-once-processed belongs on the consumer side. Includes prior-art survey (SQS/NATS/Rabbit vs pg-boss/Oban/River/ Graphile/Hatchet, pgmq gap) and the producer GC fork. Internal blueprint; basis for the reply to #293. Not yet pushed. Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01WuaYcu1XXsVEpsnLhF1FFu --- blueprints/IDEMPOTENCY_DESIGN.md | 278 +++++++++++++++++++++++++++++++ 1 file changed, 278 insertions(+) create mode 100644 blueprints/IDEMPOTENCY_DESIGN.md diff --git a/blueprints/IDEMPOTENCY_DESIGN.md b/blueprints/IDEMPOTENCY_DESIGN.md new file mode 100644 index 00000000..f851d653 --- /dev/null +++ b/blueprints/IDEMPOTENCY_DESIGN.md @@ -0,0 +1,278 @@ +# Idempotency & dedup: design decision and prior art + +Status: internal design note (not for the README/docs; basis for the Slack +reply to Fabrizio and for the eventual feature PRs). Refs issue #293. + +This note records *why* pgque's idempotency feature has the shape it does. The +short version: pgque is a **log**, not a job queue, and that single fact +determines the entire design. + +--- + +## 1. TL;DR — the decision + +Fabrizio asked for two things, framed as one ("idempotency keys") plus one +("one job at a time per partition key"). After working it through, they are +**two separate features at two different layers**, and the split is forced by +the log model: + +1. **Producer idempotency = a TTL/window dedup, enforced at produce time.** + A duplicate `send` with the same key inside a time window is a no-op that + returns the original event id. Append-only, garbage-collected by the + existing table rotation. This is what SQS, NATS JetStream, and the RabbitMQ + dedup plugin all do — because they are logs/brokers too. + +2. **"Free once processed" (pg-boss `singletonKey`) = a consumer-side, + per-consumer key lease.** This is the "one in-flight per partition key" + feature. It lives on the read side because that is the *only* place where + "processed" is a well-defined fact. + +The thing that *cannot* exist: a producer-side "reject the duplicate until the +prior one is processed" in a log. Section 3 explains why, three independent +ways. Section 5 is the prior-art evidence. + +--- + +## 2. The model: pgque is a log, not a job queue + +PgQ (and therefore pgque) is an append-only event **log** with independent +consumer cursors: + +- Producers append events to the current data table. Events are **never** + updated or deleted on consumption — `finish_batch` only advances a + per-subscription tick cursor (`subscription.sub_last_tick`). Events physically + vanish only when table rotation `TRUNCATE`s their child table. +- A queue can have **many independent consumers** (fan-out). Each has its own + cursor. An event can be done for consumer A and still pending for consumer B. +- Rotation recycles the oldest of N child tables (default 3) every + `queue_rotation_period` (default 2h), and only once no consumer still needs + it. **Rotation is the only garbage collector.** + +A job queue (pg-boss, Oban, River, Graphile Worker) is the opposite: each job +is a **mutable row** consumed once by one logical worker pool, carrying a +`state` column that is `UPDATE`d (`created → active → completed`). "Processed" +is a global, singular property of the row. + +That difference is the whole story. + +--- + +## 3. Why "free once processed" cannot be a producer feature in a log + +Three independent arguments, all pointing the same way. + +### 3.1 The model argument + +"Processed" is a **per-consumer** fact. In a fan-out log the question "is key K +processed?" has no answer without naming a consumer — K can be processed by A +and pending for B simultaneously. A producer sits before the fan-out; it has no +single "processed" state to free a key against. The predicate is not just hard +to compute, it is **undefined** at the producer. + +### 3.2 The mechanics argument + +The engine *does* expose one aggregate signal a producer could read: +`min(sub_last_tick)` across all subscriptions (it is exactly what rotation uses +to decide when a table is safe to truncate). So one could, in principle, probe +"has every consumer's cursor passed K's event?" But: + +- **"Free once ALL consumers processed"** means the key stays reserved until the + *slowest* consumer drains it; a lagging or dead consumer **wedges the key** + indefinitely (bounded only by a TTL backstop). +- **"Free once ANY consumer processed"** breaks the guarantee for the laggards: + if B has not yet seen the first K and the producer re-sends K, B now has two + in-flight copies of K — the exact duplicate the feature was meant to prevent. + +Either way, **producer dedup behavior becomes a function of consumer lag.** That +is operationally surprising (a dead consumer silently changes whether your sends +deduplicate) and conceptually backwards for a log, whose entire value is that +producers and consumers are decoupled. + +### 3.3 The prior-art argument + +No system in the field does append-only "free once processed." Every system that +offers it is a job queue that pays for it with a per-row state `UPDATE`. Every +log that does business-key dedup uses a wall-clock TTL window instead. This is +not an oversight — it is structural: "free once processed" must *observe* +"processed," and "processed" is row state. See §5. + +--- + +## 4. Recommended designs + +### 4.1 Producer idempotency — TTL window dedup (variant 1) + +Contract: `send` with an idempotency key is deduplicated against other sends +with the same key **within a time window**. Freeing is by wall clock, not by +consumption — identical to SQS's "tracking continues even after the message has +been received and deleted." + +Why this is the right (and only coherent) producer-side option for a log: §3. + +Why it does **not** reproduce pg-boss's bloat — the point that matters most for +Fabrizio: the dedup state is sized by **`throughput × window`**, not by the +backlog. pg-boss bloats because its state grows with the *pending pile* (millions +of stuck jobs, each an indexed mutable row). A TTL dedup ledger is bounded by the +send rate times a short window, completely independent of how far behind the +consumers are. The failure mode he is fleeing does not exist here even in the +naive implementation. + +Shape (pseudocode-level; final SQL is a later PR): + +``` +-- non-rotated sidecar, or a rotation-partitioned sidecar (see GC fork below) +pgque.idempotency (queue, key, ev_id, expires_at) -- unique (queue, key) + +function pgque.send_idempotent(queue, key, payload, ttl): + insert into pgque.idempotency (queue, key, ev_id, expires_at) + values (queue, key, , now() + ttl) + on conflict (queue, key) do nothing + -- if inserted: produce the real event, record ev_id, return (ev_id, deduped=false) + -- if conflict and not expired: return (existing ev_id, deduped=true) + -- if conflict and expired: reclaim the row, produce, return (new ev_id, deduped=false) +``` + +**Return contract.** pg-boss returns `null` on a deduped send (the caller gets +nothing). The log brokers do better — SQS returns a fresh `MessageId`, NATS sets +`PubAck.duplicate = true`. pgque should **return the existing event id plus a +`deduplicated` boolean**: strictly more useful than pg-boss, and free since the +dedup row already stores the id. + +**The one open engineering fork — how the ledger is GC'd:** + +- **(X) Non-partitioned table, global `unique (queue, key)` + `expires_at`, + pruned by a `maint`-cycle DELETE reaper.** Exact, predictable window; dedup is + a single `on conflict`. Cost: per-row delete churn → autovacuum on a small hot + table. (This is the in-tree precedent — `delayed_events` + a `maint_*` step, + and the DLQ's unique-index-on-conflict pattern.) +- **(Y) Rotation-partitioned ledger (or the key carried in the event stream), + GC'd by `TRUNCATE`/`DROP` of old buckets — append-only, zero vacuum.** Cost: + Postgres requires the partition key inside any unique constraint, so + uniqueness is per-bucket → a key can recur across buckets → dedup needs a probe + across the live buckets (the "previous-child probe" / sawtooth window). + +Net: **vacuum-churn (X) vs probe-cost (Y).** X's churn is window-bounded and +modest here (not pg-boss's monster); Y is append-only but pays a small +multi-bucket read per send and has a ragged window at the rotation boundary. +This is the single decision to make before writing the producer PR. + +### 4.2 Free-once-processed — consumer-side per-key lease (the partition feature) + +This is where "free once processed" legitimately lives, because a single +consumer's in-flight set is that consumer's own concern, small, and well-defined. + +- Carry the partition/idempotency key on the event (`ev_extra1`, no schema + change) via a `send_partitioned(queue, key, payload)` wrapper. +- A per-consumer lease sidecar: when a consumer receives an event for key K, it + claims the lease (`insert ... on conflict do nothing`); a second event for K is + **deferred** (re-queued via `event_retry`) until the first is acked, at which + point the lease is released. Net effect: at most one in-flight job per key per + consumer. Add a lease TTL reaper so a crashed worker cannot wedge a key. +- Policy knob: **drop** the duplicate (idempotency flavor) vs **defer** it + (serialization flavor) — same machinery, two surfaces. + +Because the lease is per-consumer, the fan-out ambiguity of §3.2 disappears: +each consumer enforces its own "one in-flight per key" without reference to any +other consumer. + +--- + +## 5. Prior art (evidence for §3.3) + +Sorted by the log-vs-job-queue axis. All facts are from primary sources +(source DDL / official docs); URLs inline. + +### Logs / brokers that do business-key dedup → all variant-1 (TTL window), produce-side + +| System | Mechanism | Freeing | Notes | +|---|---|---|---| +| **AWS SQS FIFO** | server dedup-ID set per queue (or SHA-256 of body) | fixed **5-min window**, wall-clock | docs: "continues tracking the deduplication ID **even after the message has been received and deleted**" — explicitly *not* free-once-processed. Returns success + a fresh `MessageId`. | +| **NATS JetStream** | per-stream table keyed by `Nats-Msg-Id` | configurable `duplicate_window`, **default 2 min** | `PubAck.duplicate = true` on a suppressed write. | +| **RabbitMQ** (noxdafox plugin) | in-mem cache keyed by `x-deduplication-header`, bounded by `x-cache-size` | TTL `x-cache-ttl` | core RabbitMQ has **no** dedup. | +| **Kafka** idempotent producer | per-(PID, partition) sequence numbers | n/a | **not** business-key dedup — only same-producer-instance retry dedup; new PID on restart. | +| **GCP Pub/Sub** | server `message_id` redelivery suppression | per-message, no redeliver after ack | **not** producer-key dedup — two `publish()` of the same logical message are two messages. | + +Sources: SQS [using-messagededuplicationid](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/using-messagededuplicationid-property.html), +[FIFO exactly-once](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues-exactly-once-processing.html); +NATS [model_deep_dive](https://docs.nats.io/using-nats/developer/develop_jetstream/model_deep_dive); +RabbitMQ plugin [README](https://github.com/noxdafox/rabbitmq-message-deduplication); +Kafka [KIP-98](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging); +Pub/Sub [exactly-once-delivery](https://cloud.google.com/pubsub/docs/exactly-once-delivery). + +### Job queues that do free-once-processed → all rely on a mutable per-row `state` column + +| System | Mechanism | Freeing | Per-row mutation? | +|---|---|---|---| +| **pg-boss** | partial unique indexes on `(name, COALESCE(singleton_key,''))` **predicated on the mutable `state` column** (`job_i1/i2/i3/i6`, e.g. `WHERE state <= 'active'`) | `UPDATE ... SET state='completed'` pushes the row out of the index predicate | **Yes** — and the index-on-mutable-`state` is the documented bloat source (HOT updates defeated; terminal rows linger under retention until `DELETE` + vacuum). Returns `null` on dedup. | +| **Oban** | `pg_try_advisory_xact_lock` + `SELECT` over `state ∈ states` within `period` (no DB constraint) | state leaves the watched set, or `period` (default 60s) elapses | **Yes**, in-place state UPDATE. Docs admit it is "prone to race conditions." | +| **River** (v0.12+) | partial unique index on `unique_key`, predicate over a per-row `unique_states BIT(8)` bitmask | row's `state` leaves the bitmask → drops out of the index, no cleanup job | **Yes** — elegant ("free on completion for free") but works *only because* `state` is an UPDATE'd column. | +| **Graphile Worker** | `UNIQUE (key)` on the job row, `INSERT ... ON CONFLICT (key) DO UPDATE` | job completes → **row DELETEd** | **Yes** (replace/upsert + delete-on-complete). | +| **Hatchet** | side `WorkflowRunDedupe` table, `UNIQUE (tenantId, workflowId, value)`, reject on conflict | run reaches terminal state → dedup row removed | side-table registry as a lock. | + +Sources: pg-boss [`src/plans.js`](https://cdn.jsdelivr.net/npm/pg-boss/src/plans.js) (partial unique indexes + `completeJobs`); +Oban [unique_jobs](https://hexdocs.pm/oban/unique_jobs.html) + `lib/oban/engines/basic.ex`; +River [unique-jobs](https://riverqueue.com/docs/unique-jobs) + migration `006_bulk_unique.up.sql`; +Graphile Worker [job-key](https://worker.graphile.org/docs/job-key); +Hatchet `WorkflowRunDedupe` migration (`20240726160629_v0_40_0.sql`). + +### The peer that has nothing + +**pgmq** — the closest architectural analog to pgque (simple single-extension +Postgres queue, `send`/`read`/`pop`/`archive`) — has **no dedup or idempotency +feature at all**. `send` always inserts; two identical sends yield two messages. +Source: [pgmq SQL functions](https://pgmq.github.io/pgmq/latest/api/sql/functions/). + +**Takeaway:** logs do TTL-window dedup; job queues do state-based +free-once-processed and eat the per-row UPDATE for it; nobody does append-only +free-once-processed. pgque's nearest analog ships neither — so both of pgque's +planned features are genuine differentiators, not catch-up. + +--- + +## 6. Open decisions (before writing PRs) + +1. **Producer GC fork: (X) vacuum-reaper vs (Y) rotation-partitioned probe** (§4.1). +2. **Default TTL** for the producer window, and its relation to rotation period. + Hard floor only matters for the consumer-lease variant; for pure window dedup + the TTL is just "how long do duplicate sends collapse." +3. **Return contract** confirmation: existing id + `deduplicated` flag (recommended + over pg-boss's `null`). +4. **Consumer lease**: drop-vs-defer policy surface; lease TTL reaper; whether the + lease key reuses `ev_extra1` or gets a dedicated column. +5. **Two PRs, in order**: producer window-dedup first (self-contained, closes the + spirit of #293), consumer lease second (the partition feature). + +--- + +## 7. Slack-reply-ready summary (for Fabrizio) + +> Great questions, and digging into them surfaced something important: pgque is +> a **log**, not a job queue (PgQ heritage — append-only events, independent +> consumer cursors, no per-row state). That changes how idempotency has to work. +> +> pg-boss's `singletonKey` ("dedupe until the job is processed") is implemented +> with a partial unique index on a mutable `state` column that gets `UPDATE`d to +> `completed` — and that index-on-mutable-state is *exactly* the write +> amplification / bloat you're migrating away from. We don't want to reintroduce +> it. +> +> In a log, "processed" is a per-consumer fact the producer can't see, so +> "free-once-processed" can't be a producer feature. So we'd split it: +> +> 1. **Producer idempotency** = a dedup **window** (like SQS's dedup ID or NATS's +> `Nats-Msg-Id` window) — a duplicate `send` with the same key inside the +> window is a no-op returning the original id. Append-only, GC'd by our table +> rotation, and crucially **sized by throughput × window, not by backlog** — +> so it can't bloat the way pg-boss does when consumers fall behind. +> 2. **"One in-flight per key" / free-once-processed** = a **consumer-side** key +> lease (this is also your partitions ask). It lives on the read side because +> that's the only place "processed" is defined, and it's per-consumer so +> fan-out stays clean. +> +> Both are things our closest analog (pgmq) doesn't have, so we're keen on the +> contributions. Happy to pair on the produce-side window dedup first — it's +> self-contained and closes the core of your idempotency issue. + +--- + +(This note is local only — not committed, not pushed.) From 5324db07fbb2a09f49fff09a18a8ea82334caa8a Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 19 Jun 2026 20:40:48 +0000 Subject: [PATCH 3/7] docs: add partition-keys spec and HTML brief SamoSpec-format spec (blueprints/partition-keys/SPEC.md) for consumer-side ordered, parallel consumption by partition key (Kafka-partition model: order within a key, parallelism across keys, no per-event state). Adds a self-contained on-brand HTML brief at web/public/briefs/partition-keys.html (served by Pages at /briefs/partition-keys.html on merge to main). Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01WuaYcu1XXsVEpsnLhF1FFu --- blueprints/partition-keys/SPEC.md | 218 +++++++++++++++++++++ web/public/briefs/partition-keys.html | 272 ++++++++++++++++++++++++++ 2 files changed, 490 insertions(+) create mode 100644 blueprints/partition-keys/SPEC.md create mode 100644 web/public/briefs/partition-keys.html diff --git a/blueprints/partition-keys/SPEC.md b/blueprints/partition-keys/SPEC.md new file mode 100644 index 00000000..32542ffa --- /dev/null +++ b/blueprints/partition-keys/SPEC.md @@ -0,0 +1,218 @@ +# PgQue Partition Keys — Spec + +- **Version:** v0.1 (draft) +- **Status:** draft for review; single-pass lead draft in SamoSpec format + (the live GPT+Claude review panel was not run in this environment) +- **Slug:** partition-keys +- **Scope:** consumer-side ordered, parallel consumption by partition key. + Producer-side idempotency/dedup is a *separate* spec (deferred — see §11). + +--- + +## 1. Goal + +Add a **partition key** to PgQue so that, within one queue, events sharing a key +are consumed **in order by a single consumer at a time**, while events with +different keys are consumed **in parallel**. This is the log-native ("Kafka +partition") model: order *within* a key, parallelism *across* keys. + +Concretely: `send(queue, key, payload)` tags an event with a partition key; +a partition-aware consumer guarantees that for any given key, its events are +delivered in `ev_id` order to exactly one worker at a time. + +## 2. Why it's needed + +PgQue is an **ordered, immutable log**, not a job queue. Real workloads need +**per-entity ordering without global ordering**. The motivating case (Supabase +Storage, evaluating PgQue to replace pg-boss): + +- Millions of file-lifecycle events (`FileCreated`, `FileDeleted`, + `FileOverwritten`). They **must be processed in order per tenant**, but + **order across tenants does not matter**. +- A single in-order consumer can't keep up with millions of events; naive + multi-worker consumption breaks per-tenant order. + +Today PgQue offers no way to parallelize a queue while preserving per-key order. +Cooperative consumers exist but distribute events without key affinity, so they +do not preserve order for a key. This spec closes that gap. + +Non-goal restatement: this is **not** "one job at a time per key via locks" +(that was a job-queue framing). Ordering here is achieved by **routing**, with +no per-event lock or mutable state — consistent with PgQue's no-bloat thesis. + +## 3. Scope and ICP + +**In scope (v0.1):** +- Carry a partition key on an event. +- Partition-aware assignment: stable `hash(key) → slot` mapping over a fixed set + of N consumer slots. +- Per-key ordering guarantee across batches. +- A documented failure policy (§7, decision D2). + +**Out of scope (v0.1):** +- Producer idempotency / dedup windows (separate spec). +- Dynamic rebalancing / elastic slot count (fixed N in v0.1; §10 D3). +- Cross-queue / cascaded (multi-node) partitioning. +- Hot-partition mitigation beyond documentation. + +**ICP:** multi-tenant SaaS on managed Postgres (RDS/Aurora/Cloud SQL/AlloyDB/ +Supabase/Neon) running a high-volume per-entity event stream where entity = +partition key (tenant, user, document, device). + +## 4. End-to-end workflow + +``` +producer: pgque.send('files', partition_key => tenant_id, payload => '{...}') + │ (key stored on the event; no new hot-path state) + ▼ +engine: append-only event tables, global ev_id order (UNCHANGED, sacred) + ▼ +consumer: N partition-aware sub-consumers; slot = hash(key) % N + - each slot processes its keys in ev_id order + - a key never spans two slots → per-key order preserved + - different keys → different slots → parallel +``` + +## 5. User stories + +- **US-1 (per-tenant order):** As a consumer, when I read `files`, all events for + `tenant=42` arrive in `ev_id` order, even under N parallel workers. +- **US-2 (cross-tenant parallelism):** As an operator, throughput scales with N + workers because distinct tenants are processed concurrently. +- **US-3 (single processor per key):** As a consumer author, I never have two + workers processing `tenant=42` events at the same instant, so I need no + external lock on the tenant's resources. +- **US-4 (no new bloat):** As a DBA, enabling partitions adds **no per-event + UPDATE/DELETE** and no vacuum-dependent side table. +- **US-5 (failure policy is explicit):** As a consumer author, I can choose + whether a failing event **pauses its partition** (strict order) or is + **skipped** (at-least-once, possible reorder). Default per D2. + +## 6. Architecture + + +``` + ┌─────────────────────────────────────────────┐ + producers │ pgque.send(queue, partition_key, payload) │ + └───────────────────────┬─────────────────────┘ + │ key on ev_extra1 (free today) + ▼ + ┌─────────────────────────────────────────────┐ + ENGINE │ append-only event tables · global ev_id │ <-- UNCHANGED + (sacred) │ next_batch / get_batch_events / rotation │ (no edits to + └───────────────────────┬─────────────────────┘ batch_event_sql) + │ batch of events + ▼ + ┌─────────────────────────────────────────────┐ + PARTITION │ assignment: slot = hash(key) % N │ <-- NEW logic, + LAYER │ (rides on cooperative consumers) │ distribution only + └───┬───────────────┬───────────────┬─────────┘ + ▼ ▼ ▼ + slot 0 slot 1 slot N-1 + worker A worker B worker C + keys h%N==0 keys h%N==1 keys h%N==N-1 + in ev_id order in ev_id order in ev_id order +``` + + +**Key property:** the new code lives entirely in the *distribution* step of the +cooperative-consumer layer. The batch/tick/snapshot/rotation engine +(`batch_event_sql`, `next_batch`, rotation) is **not modified** (design rule: +the PgQ engine is sacred). + +## 7. Decisions + +| ID | Decision | Choice (v0.1) | Rationale | +|----|----------|---------------|-----------| +| D1 | Where the key lives | `ev_extra1` (no schema change) | Already carried through batching and exposed on `pgque.message`. Dedicated column can come later. | +| D2 | Failure policy (head-of-line) | **Pause the partition** by default; `skip` opt-in | Motivating workload "cares about per-tenant order". PgQ retry re-inserts with a later `ev_id`, which would reorder — so strict order must block the key until the failure resolves. | +| D3 | Elasticity | Fixed N in v0.1 | Rebalancing safely (without reordering across the change) is its own hard problem; defer. | +| D4 | Assignment function | `hashtext(key)` mod N, stable | Deterministic key→slot affinity; standard partition model. | +| D5 | No per-event state | Routing only; no lease table, no advisory lock per event | Preserves the append-only / no-vacuum thesis. | + +## 8. Implementation details + +- **Producer:** `pgque.send(queue, partition_key text, payload …)` wrapper → + `insert_event(queue, type, payload, partition_key /*ev_extra1*/, …)`. + Pure reduction to the existing primitive. +- **Assignment:** extend cooperative-consumer distribution so a sub-consumer N + receives exactly the batch events where `hashtext(ev_extra1) % total = N`. + Filtering happens in the distribution/consume layer, **not** in + `batch_event_sql`. +- **Per-key order across batches:** a key always maps to the same slot, and a + slot processes its events in `ev_id` order, so order holds across ticks. +- **Pause-on-failure (D2):** when an event for key K fails, the slot must not + advance past K for that key until K succeeds or is dead-lettered. Built on the + existing `event_retry` / DLQ primitives plus a per-slot "blocked keys" set held + **in the consumer**, not in a table. (Exact mechanism is the main design risk — + §10.) +- **Security/grants:** producer wrapper → `pgque_writer`; partition consumer → + `pgque_reader`. `SECURITY DEFINER` functions pin `search_path = pgque, + pg_catalog`. + +## 9. Tests plan (red/green TDD) + +Write the failing test first, then the implementation. CI matrix PG 14–18. + +- **T1 (order):** interleave events for keys A,B,A,A,B; assert each key delivered + in `ev_id` order under N≥2 slots. *(red first)* +- **T2 (parallelism):** distinct keys land on distinct slots per `hash%N`. +- **T3 (affinity/stability):** same key always → same slot across batches. +- **T4 (single processor):** two workers, same key in one batch → only one slot + ever holds it concurrently. +- **T5 (pause-on-failure):** key A event #2 fails → A#3 is NOT delivered before + #2 resolves; B continues unaffected. +- **T6 (skip mode):** with `skip` policy, A#3 proceeds after A#2 fails (reorder + allowed). +- **T7 (no bloat):** processing M events adds zero rows to any side table and + issues no per-event UPDATE/DELETE (assert via `pg_stat`/row counts). +- **T8 (engine untouched):** `batch_event_sql` text/byte-identical to baseline. + +## 10. Risks and open questions + +- **R1 — pause-on-failure mechanism.** Keeping a key "blocked" without a mutable + table, across crashes and re-delivery, is the hard part. Needs a concrete + design that survives a worker restart (likely: re-derive blocked state from the + presence of an unacked/retrying event for the key at slot start). +- **R2 — cooperative-consumer internals.** Must confirm where assignment hooks in + pgq-coop without touching the engine, and whether coop guarantees a sub-consumer + sees a key consistently. *(Next concrete investigation step.)* +- **R3 — hot partitions.** One very active key saturates its slot. v0.1: document; + no automatic mitigation. +- **R4 — fixed N / rebalancing.** Changing N reshuffles affinity and can reorder + in-flight keys. Out of scope; needs a future spec. + +## 11. Relationship to producer idempotency (deferred sibling) + +A separate spec covers producer-side dedup as a **TTL window** (SQS/NATS model), +append-only, GC'd by rotation. It is intentionally decoupled: in a log, +"processed" is a per-consumer fact the producer cannot see, so dedup must be a +producer-side time window, while ordering/serialization is this consumer-side +partition feature. Prior-art and rationale: `blueprints/IDEMPOTENCY_DESIGN.md`. + +## 12. Team of veteran experts (review panel) + +- **Lead (spec author):** drafts and revises. +- **Reviewer A — ops/security:** scope creep, the pause-on-failure crash story, + grants, managed-PG constraints. +- **Reviewer B — QA/testability:** ordering under concurrency, the reorder edge + in skip mode, "engine untouched" assertion. + +*(Live multi-model review loop not run here; reviewer personas listed for when +this is iterated through the actual `samospec` CLI.)* + +## 13. Sprint plan + +1. **S1 — producer + key plumbing:** `send(queue, key, payload)`, key on + `ev_extra1`, exposed on `pgque.message`. Tests T2–T3. +2. **S2 — partition-aware assignment** over cooperative consumers. Tests T1, T4, + T7, T8. Resolves R2. +3. **S3 — pause-on-failure** (D2 default) + `skip` mode. Tests T5, T6. Resolves R1. +4. **S4 — docs + benchmark** (throughput vs N; per-tenant order under load). + +## 14. Changelog + +- **v0.1 (draft):** initial single-pass SamoSpec-format draft. Defines the + partition-key consumer feature, the hash-assignment architecture, the + pause-on-failure default (D2), and the no-per-event-state constraint (D5). + Producer idempotency split out to a sibling spec. diff --git a/web/public/briefs/partition-keys.html b/web/public/briefs/partition-keys.html new file mode 100644 index 00000000..ae8776df --- /dev/null +++ b/web/public/briefs/partition-keys.html @@ -0,0 +1,272 @@ + + + + + +PgQue Brief · Partition Keys + + + + +
+ +
+

PgQue · Design Brief

+

Partition KeysOrdered, parallel consumption — the log-native way

+
+ slug partition-keys + version v0.1 (draft) + layer consumer-side + engine untouched +
+
+ +

+ Within one queue, events that share a partition key are consumed + in order by a single worker at a time; events with different keys are + consumed in parallel. Order within a key, parallelism + across keys — Kafka's partition model, achieved by routing, with + no per-event locks or mutable state. +

+ +

01The problem

+

+ PgQue is an ordered, immutable log, not a job queue. The motivating + workload (a multi-tenant storage service evaluating PgQue to replace pg-boss) emits + millions of file-lifecycle events — FileCreated, + FileDeleted, FileOverwritten. They must be processed + in order per tenant, but order across tenants does not matter. +

+

+ A single in-order consumer can't keep up; naive multi-worker consumption breaks + per-tenant order. PgQue has cooperative consumers, but they distribute events without + key affinity — so a key's order is not preserved. This brief closes that gap. +

+ +

02Architecture

+
+ + + + + + + + + + + producers + pgque.send(queue, partition_key, payload) + + key → ev_extra1 (free today) + + + + engine · sacred — UNCHANGED + append-only event tables · global ev_id order + next_batch / get_batch_events / rotation — no edits to batch_event_sql + + batch of events + + + + partition layer · NEW (routing only) + slot = hash(partition_key) mod N + rides on cooperative consumers · no per-event state, no lock + + + + + + + + + + slot 0 · worker A + keys where h%N==0 + in ev_id order + tenants 0,3,6… + + + + slot 1 · worker B + keys where h%N==1 + in ev_id order + tenants 1,4,7… + + + + slot N-1 · worker C + keys where h%N==N-1 + in ev_id order + tenants 2,5,8… + + + one key → one slot → per-key order preserved · distinct keys → parallel + +
The new logic lives only in the distribution step. The PgQ engine is not modified.
+
+ +

03Scope

+
+
+

In · v0.1

+
    +
  • Partition key carried on an event
  • +
  • Stable hash(key) % N assignment
  • +
  • Per-key order across batches
  • +
  • Explicit failure policy (pause vs skip)
  • +
+
+
+

Out · later

+
    +
  • Producer idempotency / dedup window
  • +
  • Dynamic rebalancing / elastic N
  • +
  • Cross-queue / cascaded partitioning
  • +
  • Automatic hot-partition mitigation
  • +
+
+
+ +

04Key decisions

+ + + + + + + + + +
IDDecisionChoice (v0.1)
D1Where the key livesev_extra1 — no schema change
D2Failure policy (head-of-line)Pause the partition (default); skip opt-in
D3ElasticityFixed N in v0.1
D4Assignment functionhashtext(key) % N, stable
D5Per-event stateNone — routing only, no lease, no lock
+

D2 is the contract-shaping choice: PgQ retry re-inserts a failed event with a later ev_id, which would reorder a key — so strict order must block the key until the failure resolves.

+ +

05Sprint plan

+
+ S1  producer + key plumbing + S2  partition-aware assignment + S3  pause-on-failure + skip mode + S4  docs + throughput benchmark +
+

+ Biggest open risk: a crash-safe pause-on-failure that holds a key + "blocked" with no mutable table — likely re-derived at slot start from the presence of + an unacked/retrying event for the key. Next concrete step is confirming where assignment + hooks into cooperative consumers without touching the engine. +

+ +
+

+ Companion to the producer-side decision note + (idempotency = TTL window, not free-once-processed) and the full spec. + In a log, “processed” is a per-consumer fact the producer can't see — so ordering and + serialization live here, on the consumer. +

+

+ PgQue — zero-bloat PostgreSQL queue · pgque.dev · + github.com/NikolayS/pgque +

+

Drafted in SamoSpec format (single-pass lead draft; the live multi-model review loop was not run for this brief).

+
+ +
+ + From 2fd0f2ad755b8d4a958e86ba32801d972dfaa5ac Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 19 Jun 2026 21:30:48 +0000 Subject: [PATCH 4/7] =?UTF-8?q?docs(spec):=20partition-keys=20v0.2=20?= =?UTF-8?q?=E2=80=94=20review=20round=201=20applied?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Re-ground the consumer mechanism after ops/security + QA/testability review: drop the (impossible) cooperative-consumer overlay for N independent slot subscriptions filtering via get_batch_cursor extra_where; restate the guarantee as testable G1/G2/G3; correct the retry rationale (ev_id preserved, ev_txid changes); derive pause from existing retry_queue (no new table); fix send-signature collision, ev_extra1/trigger collision, unstable hashtext, fixed-N invariant, slot/owner definition. Add decisions.md and refresh the HTML brief. Remove superseded IDEMPOTENCY_AND_PARTITIONS.md contributor guide. Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01WuaYcu1XXsVEpsnLhF1FFu --- blueprints/IDEMPOTENCY_AND_PARTITIONS.md | 287 ---------------- blueprints/partition-keys/SPEC.md | 395 +++++++++++++---------- blueprints/partition-keys/decisions.md | 61 ++++ web/public/briefs/partition-keys.html | 224 ++++++------- 4 files changed, 379 insertions(+), 588 deletions(-) delete mode 100644 blueprints/IDEMPOTENCY_AND_PARTITIONS.md create mode 100644 blueprints/partition-keys/decisions.md diff --git a/blueprints/IDEMPOTENCY_AND_PARTITIONS.md b/blueprints/IDEMPOTENCY_AND_PARTITIONS.md deleted file mode 100644 index 3d96b2ef..00000000 --- a/blueprints/IDEMPOTENCY_AND_PARTITIONS.md +++ /dev/null @@ -1,287 +0,0 @@ -# Contributor guide: idempotency keys & partition keys - -Status: design guidance for external PRs (refs #293). -Audience: contributors adding pg-boss-style **idempotency keys** and -**per-partition serialization** ("one job at a time per partition key") on top -of PgQue. - -This document is the answer to "how should I approach these two features?". It -maps each feature onto PgQue's existing layering so a PR reduces cleanly to PgQ -primitives, survives table rotation, and does not touch the sacred engine. - ---- - -## 0. The three facts that shape both designs - -Before writing any SQL, internalize these properties of the engine. They are -what make the naive approaches wrong. - -1. **Event data tables rotate and get truncated.** Each queue stores events in - `queue_ntables` (default 3) round-robin data tables - (`_`). Rotation recycles the oldest table with - `TRUNCATE` every `queue_rotation_period` (default 2h). Anything you want to - remember *for longer than one rotation* — a dedup ledger, a partition lease — - **cannot live in the event tables.** It must live in its own sidecar table. - The existing `pgque.delayed_events` holding table (see - `sql/experimental/delayed.sql`) is the precedent to copy. - -2. **`send` must reduce to `insert_event`.** Design rule #3 in `CLAUDE.md`: any - producer API must be explainable as "calls `pgque.insert_event(queue, type, - data)` with these args." Both features are *wrappers around* `insert_event`, - not replacements for it. - -3. **The batch/tick/snapshot engine is sacred** (design rule #2). A batch is a - snapshot window: `next_batch` + `get_batch_events` + `batch_event_sql` - deliver *every* event committed in the window to the consumer at once. Do - **not** modify `batch_event_sql`, `next_batch`, rotation, or consumer - tracking. Partition serialization must be built *on top of* these - primitives (consumer-side gating), never inside them. - -### Where code goes - -`build/transform.sh` assembles `sql/pgque.sql` from the transformed PgQ core -plus every file in `sql/pgque-additions/` (shipped in the default install) and -**excludes** `sql/experimental/`. So: - -- Land new features in `sql/experimental/.sql` first (opt-in, not in - the default single-file install), with tests in `tests/` registered in - `tests/run_experimental.sql`. -- Graduate to `sql/pgque-additions/.sql` once the API is settled. -- Either way, regenerate `sql/pgque.sql` via `build/transform.sh` and commit - the source and generated file together (keep them in sync — `CLAUDE.md`). -- Every `SECURITY DEFINER` function pins `SET search_path = pgque, pg_catalog`. - Grant producer-side functions to `pgque_writer`, consumer-side to - `pgque_reader`. Re-run the deny-by-default `revoke ... from public`. -- Red/green TDD: failing `tests/test_*.sql` first, then the implementation. - CI runs PG 14–18. - -Two features → **two PRs.** Keep changes surgical (one feature each). - ---- - -## 1. Idempotency keys (issue #293) - -> "the same send within a timeframe results in a no-op" - -### Why not a unique index on the event table - -That is the obvious move and it is wrong here: the unique index would live on a -rotating data table, so it (a) only dedups within the current table, and (b) is -destroyed on the next `TRUNCATE`. You would get non-deterministic dedup windows -tied to rotation timing. The dedup ledger must be a **separate, non-rotated -table** with an explicit TTL you control. - -### Recommended shape - -A sidecar table keyed by `(queue, idempotency_key)` with an expiry column, plus -a thin `send` wrapper that claims the key atomically before inserting the event. - -```sql -create table if not exists pgque.idempotency_key ( - ik_queue_name text not null, - ik_key text not null, - ik_msg_id bigint, -- event id produced on first send - ik_expires_at timestamptz not null, - constraint idempotency_key_pkey primary key (ik_queue_name, ik_key) -); -create index if not exists ik_expires_idx - on pgque.idempotency_key (ik_expires_at); -``` - -```sql --- pgque.send_idempotent(queue, key, payload, ttl) --- First send within the TTL window inserts the event and records the key. --- Repeat sends with the same (queue, key) inside the window are a no-op and --- return the original msg_id. Reduces to one insert_event() call. -create or replace function pgque.send_idempotent( - i_queue text, i_key text, i_payload text, - i_ttl interval default '1 hour') -returns bigint as $$ -declare - v_msg_id bigint; - v_now timestamptz := now(); -begin - -- Atomic claim: the unique index is the serialization point. A concurrent - -- duplicate loses the race and takes the "already present" branch. - insert into pgque.idempotency_key (ik_queue_name, ik_key, ik_expires_at) - values (i_queue, i_key, v_now + i_ttl) - on conflict (ik_queue_name, ik_key) do update - -- only "win" the upsert if the prior key has expired - set ik_expires_at = excluded.ik_expires_at, - ik_msg_id = null - where pgque.idempotency_key.ik_expires_at <= v_now - returning ik_msg_id into v_msg_id; - - if not found then - -- live duplicate: row exists and is unexpired, upsert WHERE filtered it - select ik_msg_id into v_msg_id - from pgque.idempotency_key - where ik_queue_name = i_queue and ik_key = i_key; - return v_msg_id; -- no-op, original id (may be the same event) - end if; - - -- we own the claim (fresh or expired-and-reclaimed): produce the event - v_msg_id := pgque.insert_event(i_queue, 'default', i_payload); - update pgque.idempotency_key - set ik_msg_id = v_msg_id - where ik_queue_name = i_queue and ik_key = i_key; - return v_msg_id; -end; -$$ language plpgsql security definer set search_path = pgque, pg_catalog; -``` - -Design decisions to settle in the PR (call them out explicitly): - -- **Return value on duplicate.** pg-boss returns `null` for a rejected - duplicate. PgQue can do better by storing `ik_msg_id` and returning the - original event id, so callers get an idempotent *result*, not just an - idempotent *side effect*. Pick one and document it. -- **TTL semantics.** Is the window "since first send" (above) or "since last - send" (sliding)? The above is fixed-from-first; a sliding window just bumps - `ik_expires_at` on every hit. pg-boss's `singletonKey` is closer to - fixed-window-per-slot — match the semantics you actually need. -- **Transaction visibility.** If the producer rolls back, the key insert rolls - back with it (same transaction) — correct. Document that `send_idempotent` - is meant to run in the caller's transaction. - -### The part that actually fixes their bloat: expiry maintenance - -Their pain is unbounded growth, so the ledger must self-prune. Add a maint -step modeled on `maint_deliver_delayed()` and hook it into `pgque.maint()`: - -```sql -create or replace function pgque.maint_expire_idempotency() -returns integer as $$ -declare cnt integer; -begin - delete from pgque.idempotency_key where ik_expires_at <= now(); - get diagnostics cnt = row_count; - return cnt; -end; -$$ language plpgsql security definer set search_path = pgque, pg_catalog; -``` - -This bounds the dedup table to roughly `throughput × TTL` rows regardless of -backlog — which is exactly the property they could not get from pg-boss. - -### Tests (red first) - -- send same `(queue, key)` twice inside TTL → exactly one event in the batch. -- send same key after TTL expiry (or after `maint_expire_idempotency()`) → a - second event is produced. -- two concurrent `send_idempotent` with the same key → exactly one event - (use the `tests/two_session_*.sh` pattern for the race). -- `maint_expire_idempotency()` deletes only expired rows and returns the count. - ---- - -## 2. Partition keys — "one job at a time per partition key" - -> "run 1 job at a time for a given partition key … the batch could contain 1 -> job per partition key" - -This is the harder request because it is about **consumption order / -concurrency control**, which lives in the sacred engine's territory. Split it -into two independent sub-problems and solve them separately. - -### 2a. Carrying the partition key (easy, no engine change) - -An event already has four free passthrough columns (`ev_extra1..ev_extra4`) -that survive batching and are returned by `get_batch_events` / `pgque.receive`. -Carry the partition key in one of them via the existing 7-arg -`insert_event(queue, type, data, extra1..4)`. A thin wrapper: - -```sql --- pgque.send_partitioned(queue, partition_key, payload) -create or replace function pgque.send_partitioned( - i_queue text, i_partition_key text, i_payload text) -returns bigint as $$ -begin - -- partition key rides in ev_extra1; everything else is a normal send - return pgque.insert_event(i_queue, 'default', i_payload, - i_partition_key, null, null, null); -end; -$$ language plpgsql security definer set search_path = pgque, pg_catalog; -``` - -No schema change, still reduces to `insert_event`. The `pgque.message` type -already exposes `extra1`, so consumers see the key without API changes. - -### 2b. Serializing per key (the real work) - -The PgQ batch model hands a consumer *all* events in the tick window at once; -it has no built-in "skip events whose partition is busy." Do **not** try to add -that to `batch_event_sql`. Gate it **consumer-side**, on top of the existing -`next_batch` / `get_batch_events` / `event_retry` primitives. Three viable -approaches, in order of how well they match the request: - -1. **Partition lease table (recommended for true "one at a time").** - A non-rotated sidecar holding the currently in-flight key per queue: - - ```sql - create table if not exists pgque.partition_lease ( - pl_queue_name text not null, - pl_partition_key text not null, - pl_msg_id bigint not null, - pl_leased_at timestamptz not null default now(), - constraint partition_lease_pkey primary key (pl_queue_name, pl_partition_key) - ); - ``` - - A partition-aware receive walks the batch in `ev_id` order and, for each - distinct partition key, tries to claim the lease - (`insert ... on conflict do nothing`). The first event for a free key is - delivered; any further event whose key is already leased is **deferred** - back into retry via `pgque.event_retry(batch, ev_id, delay)` instead of - being returned. `ack`/`nack` for a leased message releases the lease - (`delete from partition_lease`), letting the next event for that key through - on a subsequent batch. Net effect: at most one in-flight job per key, across - all workers, with no engine change. Add a TTL/`pl_leased_at` reaper to the - maint cycle so a crashed worker's lease cannot wedge a partition forever. - -2. **Cooperative consumers (already in the tree).** - `sql/pgque-api/cooperative_consumers.sql` lets N members of one logical - consumer split a queue. Hashing `partition_key → member` gives you - *parallelism bounded by key* (all events for a key go to the same member), - which is often what people actually want. It does **not** by itself - guarantee strictly one in-flight per key within a member — combine with - per-key ordering in the worker, or with approach 1, if strictness matters. - -3. **At-most-one-per-key-per-batch filter.** - A `receive` variant that returns at most one event per distinct - `partition_key` in the current batch and defers the rest via `event_retry`. - Simpler than the lease table but only serializes *within a batch*, not - across concurrent consumers — weaker guarantee. Useful as a stepping stone. - -### Ordering caveat to flag in the PR - -PgQ batches are snapshot windows, so cross-batch ordering is by `ev_id` but a -deferred (retried) event reappears in a *later* batch. If they need strict -FIFO *within* a partition, the lease approach must also process a partition's -events in `ev_id` order and not advance the partition past a deferred event. -Make this an explicit, documented guarantee (or non-guarantee) — it is the -subtle part reviewers will care about. - -### Tests (red first) - -- two events, same partition key: first `receive` returns event 1 and leases - the key; event 2 is deferred (not returned) until event 1 is acked. -- two events, different keys: both delivered in the same batch. -- two concurrent consumers, same key: only one gets the event (lease race). -- crashed worker (lease never released) → reaper frees it after TTL. - ---- - -## 3. Suggested PR sequence - -1. **PR 1 — idempotency keys.** Sidecar table + `send_idempotent` + - `maint_expire_idempotency` hooked into `maint()`; tests; land in - `sql/experimental/` first. Closes #293. -2. **PR 2 — partition keys.** `send_partitioned` (key in `ev_extra1`) + - partition-aware receive over a lease table + lease reaper; tests. Open a - tracking issue first to settle the ordering guarantee before coding. - -Both follow the same rules: pin `search_path`, grant by role -(`send_*` → `pgque_writer`, partition receive → `pgque_reader`), keep -`pgque.sql` regenerated, never touch the batch/tick/rotation engine, and write -the failing test before the implementation. diff --git a/blueprints/partition-keys/SPEC.md b/blueprints/partition-keys/SPEC.md index 32542ffa..dda0f4bc 100644 --- a/blueprints/partition-keys/SPEC.md +++ b/blueprints/partition-keys/SPEC.md @@ -1,218 +1,267 @@ # PgQue Partition Keys — Spec -- **Version:** v0.1 (draft) -- **Status:** draft for review; single-pass lead draft in SamoSpec format - (the live GPT+Claude review panel was not run in this environment) +- **Version:** v0.2 (draft) +- **Status:** review round 1 applied (Reviewer A ops/security + Reviewer B + QA/testability). Core mechanism re-grounded against the engine; see §15 + changelog and `decisions.md`. - **Slug:** partition-keys - **Scope:** consumer-side ordered, parallel consumption by partition key. - Producer-side idempotency/dedup is a *separate* spec (deferred — see §11). + Producer-side idempotency/dedup is a *separate* spec (deferred — see §12). --- ## 1. Goal Add a **partition key** to PgQue so that, within one queue, events sharing a key -are consumed **in order by a single consumer at a time**, while events with +are consumed **in order by a single worker at a time**, while events with different keys are consumed **in parallel**. This is the log-native ("Kafka partition") model: order *within* a key, parallelism *across* keys. -Concretely: `send(queue, key, payload)` tags an event with a partition key; -a partition-aware consumer guarantees that for any given key, its events are -delivered in `ev_id` order to exactly one worker at a time. - -## 2. Why it's needed +## 2. The guarantee (precise, testable) + +Stated as three independently-testable clauses (this replaces vague "in order" +prose; per Reviewer B): + +- **G1 — per-key affinity + happy-path FIFO.** For a queue whose events carry a + partition key, and a fixed slot count `N`, every event of key `K` maps to + exactly one slot `slot(K) = hashtextextended(K, 0) % N`. Within that slot, + successfully-processed, never-retried events of `K` are delivered in + non-decreasing `ev_id` order, and to **no other slot**. +- **G2 — single in-flight processor per key.** At any instant, at most one worker + holds an unacked event for `K`. +- **G3 — failure boundary.** Under **`pause`** policy, if `K#i` fails, no later + event of `K` is delivered until `K#i` is acked or dead-lettered; other keys are + unaffected. Under **`skip`** policy, later events of `K` MAY be delivered before + `K#i` resolves — so after a failure only *at-least-once* holds, not order. + **Note (engine fact):** a retried event re-enters under a new transaction/tick + (its `ev_id` is preserved but its `ev_txid` is new, so it reappears in a *later* + batch — `event_retry` → `maint_retry_events` → `insert_event_raw`). So G1's + `ev_id` monotonicity holds only *between non-retried* events; across a retry the + only guarantee is G3's pause boundary, never `ev_id` ordering. + +## 3. Why it's needed PgQue is an **ordered, immutable log**, not a job queue. Real workloads need -**per-entity ordering without global ordering**. The motivating case (Supabase -Storage, evaluating PgQue to replace pg-boss): +**per-entity ordering without global ordering**. Motivating case (a multi-tenant +storage service evaluating PgQue to replace pg-boss): - Millions of file-lifecycle events (`FileCreated`, `FileDeleted`, - `FileOverwritten`). They **must be processed in order per tenant**, but - **order across tenants does not matter**. -- A single in-order consumer can't keep up with millions of events; naive - multi-worker consumption breaks per-tenant order. - -Today PgQue offers no way to parallelize a queue while preserving per-key order. -Cooperative consumers exist but distribute events without key affinity, so they -do not preserve order for a key. This spec closes that gap. - -Non-goal restatement: this is **not** "one job at a time per key via locks" -(that was a job-queue framing). Ordering here is achieved by **routing**, with -no per-event lock or mutable state — consistent with PgQue's no-bloat thesis. - -## 3. Scope and ICP - -**In scope (v0.1):** -- Carry a partition key on an event. -- Partition-aware assignment: stable `hash(key) → slot` mapping over a fixed set - of N consumer slots. -- Per-key ordering guarantee across batches. -- A documented failure policy (§7, decision D2). - -**Out of scope (v0.1):** -- Producer idempotency / dedup windows (separate spec). -- Dynamic rebalancing / elastic slot count (fixed N in v0.1; §10 D3). + `FileOverwritten`), which **must be ordered per tenant** but **need no ordering + across tenants**. +- One in-order consumer can't keep up; naive multi-worker consumption breaks + per-tenant order. + +## 4. Scope and ICP + +**In scope (v0.1 implementation):** +- Carry a partition key on a `send()`-sourced event. +- N independent **slot consumers**, each filtering the stream to its hash class + (§6). Stable `hashtextextended(key, 0) % N` affinity. +- G1 + G2 always; **`skip` failure policy as the v0.1 default** (sound, simple). + +**Deferred to v0.2 implementation (specified, not built first):** +- **`pause` failure policy** (G3 strict). It is fully specified here (§7 D2, §8) + but carries the crash-recovery risk (R1) and ships after `skip`. + +**Out of scope:** +- Producer idempotency / dedup windows (separate spec — §12). +- Dynamic rebalancing / elastic `N` (fixed `N`; §7 D3, R4). +- **Trigger-sourced queues** (`jsontriga`/`logutriga`/`sqltriga` already store the + table name in `ev_extra1` — §7 D1, R5). Partitioned consumption is defined only + for `send()`-sourced queues in v0.1. - Cross-queue / cascaded (multi-node) partitioning. -- Hot-partition mitigation beyond documentation. +- Automatic hot-partition mitigation (documented only). -**ICP:** multi-tenant SaaS on managed Postgres (RDS/Aurora/Cloud SQL/AlloyDB/ -Supabase/Neon) running a high-volume per-entity event stream where entity = -partition key (tenant, user, document, device). +**ICP:** multi-tenant SaaS on managed Postgres running a high-volume per-entity +event stream where entity = partition key (tenant, user, document, device). -## 4. End-to-end workflow +## 5. End-to-end workflow ``` -producer: pgque.send('files', partition_key => tenant_id, payload => '{...}') - │ (key stored on the event; no new hot-path state) +producer: pgque.send('files', 'default', payload, partition_key => tenant_id) + │ key → ev_extra1 (send-sourced queues only, v0.1) ▼ -engine: append-only event tables, global ev_id order (UNCHANGED, sacred) +engine: append-only event tables · global ev_id/ev_txid order (UNCHANGED) + │ full stream ▼ -consumer: N partition-aware sub-consumers; slot = hash(key) % N - - each slot processes its keys in ev_id order - - a key never spans two slots → per-key order preserved - - different keys → different slots → parallel +consumers: N slot consumers, each an INDEPENDENT subscription with its own cursor; + slot k reads the whole stream and server-side-filters to its hash class ``` -## 5. User stories - -- **US-1 (per-tenant order):** As a consumer, when I read `files`, all events for - `tenant=42` arrive in `ev_id` order, even under N parallel workers. -- **US-2 (cross-tenant parallelism):** As an operator, throughput scales with N - workers because distinct tenants are processed concurrently. -- **US-3 (single processor per key):** As a consumer author, I never have two - workers processing `tenant=42` events at the same instant, so I need no - external lock on the tenant's resources. -- **US-4 (no new bloat):** As a DBA, enabling partitions adds **no per-event - UPDATE/DELETE** and no vacuum-dependent side table. -- **US-5 (failure policy is explicit):** As a consumer author, I can choose - whether a failing event **pauses its partition** (strict order) or is - **skipped** (at-least-once, possible reorder). Default per D2. - ## 6. Architecture +The v0.1 mechanism is **N independent slot consumers**, *not* a modification of +cooperative consumers. (Review round 1, B1: coop hands each member a disjoint +tick window; it cannot fan one batch to N hash-filtered slots without dropping +events when the shared cursor advances. So we do not use coop distribution.) + ``` - ┌─────────────────────────────────────────────┐ - producers │ pgque.send(queue, partition_key, payload) │ - └───────────────────────┬─────────────────────┘ - │ key on ev_extra1 (free today) - ▼ - ┌─────────────────────────────────────────────┐ - ENGINE │ append-only event tables · global ev_id │ <-- UNCHANGED - (sacred) │ next_batch / get_batch_events / rotation │ (no edits to - └───────────────────────┬─────────────────────┘ batch_event_sql) - │ batch of events - ▼ - ┌─────────────────────────────────────────────┐ - PARTITION │ assignment: slot = hash(key) % N │ <-- NEW logic, - LAYER │ (rides on cooperative consumers) │ distribution only - └───┬───────────────┬───────────────┬─────────┘ - ▼ ▼ ▼ - slot 0 slot 1 slot N-1 - worker A worker B worker C - keys h%N==0 keys h%N==1 keys h%N==N-1 - in ev_id order in ev_id order in ev_id order + producers │ send(queue, type, payload, partition_key => K) + │ key → ev_extra1 + ▼ + ┌──────────────────────────────────────────────────────────┐ + │ ENGINE · sacred — UNCHANGED │ + │ append-only tables · global ev_id/ev_txid · rotation │ + │ next_batch / get_batch_cursor(i_extra_where) / get_events │ + └───────┬───────────────┬───────────────┬──────────────────┘ + │ full stream │ full stream │ full stream + ▼ ▼ ▼ + slot 0 (sub#0/N) slot 1 (sub#1/N) slot N-1 (sub#(N-1)/N) + own cursor own cursor own cursor + extra_where: extra_where: extra_where: + hashext%N=0 hashext%N=1 hashext%N=N-1 + │ │ │ + one worker one worker one worker + keys h%N==0 keys h%N==1 keys h%N==N-1 + in ev_id order in ev_id order in ev_id order ``` -**Key property:** the new code lives entirely in the *distribution* step of the -cooperative-consumer layer. The batch/tick/snapshot/rotation engine -(`batch_event_sql`, `next_batch`, rotation) is **not modified** (design rule: -the PgQ engine is sacred). +**How filtering happens without touching the engine:** each slot's receive call +reuses the existing `pgque.get_batch_cursor(..., i_extra_where)` hook +(`pgque.sql` ~line 2229), injecting the predicate +`and hashtextextended(ev_extra1, 0) % N = k`. The predicate is built from +integers `N`, `k` (no user input → injection-safe). `batch_event_sql`, +`next_batch`, and rotation are **not modified**. + +**Each slot is its own subscription**, so it has its own cursor (`sub_last_tick`) +and its own `sub_id` → **no cross-slot data loss** (each slot independently +advances over the full stream) and **retry/DLQ rows are naturally slot-scoped** +(`ev_owner = that slot's sub_id`), which is what makes `pause` re-derivable +(§8). + +**Known cost — read amplification.** Every event is examined by all `N` slot +cursors (each discards `(N-1)/N` after the hash filter). The `extra_where` +push-down keeps *returned* rows minimal, but the index scan over each tick window +is repeated `N` times. Acceptable for moderate `N`; documented, with a +single-reader/dispatch optimization noted as future work (R6). ## 7. Decisions -| ID | Decision | Choice (v0.1) | Rationale | -|----|----------|---------------|-----------| -| D1 | Where the key lives | `ev_extra1` (no schema change) | Already carried through batching and exposed on `pgque.message`. Dedicated column can come later. | -| D2 | Failure policy (head-of-line) | **Pause the partition** by default; `skip` opt-in | Motivating workload "cares about per-tenant order". PgQ retry re-inserts with a later `ev_id`, which would reorder — so strict order must block the key until the failure resolves. | -| D3 | Elasticity | Fixed N in v0.1 | Rebalancing safely (without reordering across the change) is its own hard problem; defer. | -| D4 | Assignment function | `hashtext(key)` mod N, stable | Deterministic key→slot affinity; standard partition model. | -| D5 | No per-event state | Routing only; no lease table, no advisory lock per event | Preserves the append-only / no-vacuum thesis. | +| ID | Decision | Choice (v0.2) | Rationale / change | +|----|----------|---------------|--------------------| +| D1 | Where the key lives | `ev_extra1`, **`send()`-sourced queues only** | Trigger producers already use `ev_extra1` for the table name (`pgque.sql:2943`). Restrict, don't collide. (R5) | +| D2 | Failure policy | `skip` default (v0.1); `pause` specified, ships v0.2 | `pause` needs durable-ish blocked-key tracking; deliver the sound `skip` first. Rationale corrected re: retry (§2 note). | +| D3 | Elasticity | Fixed `N`, **persisted per (queue, consumer)** | A worker registering with a mismatched `N` is **rejected**, so "fixed N" is an invariant, not a convention. (N2) | +| D4 | Assignment function | `hashtextextended(key, 0) % N` | `hashtext()` is internal/unstable across PG majors → affinity would break on upgrade. `hashtextextended` is the documented, stable hash. (N1) | +| D5 | State budget | **No new mutable table; happy path writes nothing.** `pause` derives blocked keys from the engine's existing `retry_queue`/`dead_letter`, read only on failure/slot-start | Resolves the D2-vs-"no state" contradiction honestly: failure handling reuses state the engine already keeps, scoped per slot by `sub_id`. (B3/B4) | +| D6 | Producer signature | `send(queue, type, payload, partition_key => text)` (new 4-arg overload) | A 3-arg `send(queue, key, payload)` collides with the existing `send(queue, type, payload)`. (B4/N4) | +| D7 | Slot identity & single-owner | slot = a named consumer `"#k/N"`; single owner enforced by the existing per-consumer receive lock (the `sub_batch`/`FOR UPDATE` path) | Defines what a "slot" is and what makes G2 true and testable. (B5) | ## 8. Implementation details -- **Producer:** `pgque.send(queue, partition_key text, payload …)` wrapper → - `insert_event(queue, type, payload, partition_key /*ev_extra1*/, …)`. - Pure reduction to the existing primitive. -- **Assignment:** extend cooperative-consumer distribution so a sub-consumer N - receives exactly the batch events where `hashtext(ev_extra1) % total = N`. - Filtering happens in the distribution/consume layer, **not** in - `batch_event_sql`. -- **Per-key order across batches:** a key always maps to the same slot, and a - slot processes its events in `ev_id` order, so order holds across ticks. -- **Pause-on-failure (D2):** when an event for key K fails, the slot must not - advance past K for that key until K succeeds or is dead-lettered. Built on the - existing `event_retry` / DLQ primitives plus a per-slot "blocked keys" set held - **in the consumer**, not in a table. (Exact mechanism is the main design risk — - §10.) -- **Security/grants:** producer wrapper → `pgque_writer`; partition consumer → - `pgque_reader`. `SECURITY DEFINER` functions pin `search_path = pgque, - pg_catalog`. +- **Producer:** `pgque.send(queue, type, payload, partition_key text default null)` + → `insert_event(queue, type, payload, ev_extra1 => partition_key, …)`. Pure + reduction to the existing primitive (Key Design Rule 3). Explicit + `revoke execute … from public` + `grant … to pgque_writer`, `SECURITY DEFINER` + with `set search_path = pgque, pg_catalog`. +- **Slot registration:** `pgque.subscribe_slot(queue, consumer, k, n)` registers + subscription `"#k/n"` and persists `n` for the consumer; a later + registration with a different `n` is rejected (D3). +- **Partitioned receive:** `pgque.receive_partitioned(queue, consumer, k, n, …)` + → `next_batch` + `get_batch_cursor(..., i_extra_where => + format('and hashtextextended(ev_extra1,0) %% %s = %s', n, k))`. Server-side + filter; engine untouched. +- **`pause` blocked-set (v0.2):** within a run, a worker holds back later events + of a key whose head event is unacked/retrying. On (re)start, the blocked set is + rebuilt by querying `retry_queue where ev_owner = ` (existing + state; read once at slot start, not per event). A key unblocks when its head + event is acked **or** dead-lettered (`dead_letter`), so a poison event cannot + wedge a tenant beyond `max_retries` (B5). `skip` mode needs none of this. +- **Grants:** producer overload → `pgque_writer`; `subscribe_slot` / + `receive_partitioned` → `pgque_reader`. Deny-by-default re-applied. ## 9. Tests plan (red/green TDD) -Write the failing test first, then the implementation. CI matrix PG 14–18. - -- **T1 (order):** interleave events for keys A,B,A,A,B; assert each key delivered - in `ev_id` order under N≥2 slots. *(red first)* -- **T2 (parallelism):** distinct keys land on distinct slots per `hash%N`. -- **T3 (affinity/stability):** same key always → same slot across batches. -- **T4 (single processor):** two workers, same key in one batch → only one slot - ever holds it concurrently. -- **T5 (pause-on-failure):** key A event #2 fails → A#3 is NOT delivered before - #2 resolves; B continues unaffected. -- **T6 (skip mode):** with `skip` policy, A#3 proceeds after A#2 fails (reorder - allowed). -- **T7 (no bloat):** processing M events adds zero rows to any side table and - issues no per-event UPDATE/DELETE (assert via `pg_stat`/row counts). -- **T8 (engine untouched):** `batch_event_sql` text/byte-identical to baseline. +Write the failing test first. CI matrix PG 14–18. Map to the guarantee: + +- **T-G1a (affinity):** same key → same slot; assert the **literal integer** + `hashtextextended(K,0) % N` (not "same within one version") on **every** CI PG + version — guards D4 stability. *(red first)* +- **T-G1b (per-key FIFO, happy path):** interleave A,B,A,A,B; assert each key in + `ev_id` order across batches, no key on two slots. +- **T-G2 (single owner):** two sessions, same slot → second blocks on the + receive lock (mirror `tests/two_session_receive_lock.sh`). +- **T-no-drop:** keys spanning all slots in one tick window; run all N slots; + assert union delivered = all events, **zero loss** (guards the cursor/filter + interaction, §6). +- **T-G3-pause (order-after-retry):** A#2 nacked (`pause`); assert A#3 withheld + until A#2 acked-or-DLQ'd; B unaffected. +- **T-G3-skip (reorder boundary):** with `skip`, assert the *exact* permitted + reorder after A#2 fails (not just "A#3 proceeds"). +- **T-DLQ-unblock:** A#2 exhausts retries → `dead_letter`; assert A#3 then + proceeds (no permanent wedge). +- **T-slot-crash:** slot-k worker holds A#2 unacked and dies; another worker takes + slot k; assert A#2 redelivered before A#3 and only to slot k. +- **T-empty-slot / T-hot-key:** an empty slot doesn't wedge others; a single hot + key saturates one slot while others still drain (correctness, not perf). +- **T-no-bloat (happy path):** processing M events with all acks adds **zero** + rows to `retry_queue`/`dead_letter` and issues no per-event UPDATE/DELETE. (The + failure path legitimately writes `retry_queue` — out of this test's scope.) +- **T-engine-untouched:** `pg_get_functiondef` of `batch_event_sql` and + `next_batch_custom` byte-identical to baseline (assert on the **definition**, + not the generated SQL — N2). +- **T-idempotent-install:** re-running `pgque.sql` re-creates partition functions + cleanly (mirror `tests/test_install_idempotency.sql`). ## 10. Risks and open questions -- **R1 — pause-on-failure mechanism.** Keeping a key "blocked" without a mutable - table, across crashes and re-delivery, is the hard part. Needs a concrete - design that survives a worker restart (likely: re-derive blocked state from the - presence of an unacked/retrying event for the key at slot start). -- **R2 — cooperative-consumer internals.** Must confirm where assignment hooks in - pgq-coop without touching the engine, and whether coop guarantees a sub-consumer - sees a key consistently. *(Next concrete investigation step.)* -- **R3 — hot partitions.** One very active key saturates its slot. v0.1: document; - no automatic mitigation. -- **R4 — fixed N / rebalancing.** Changing N reshuffles affinity and can reorder - in-flight keys. Out of scope; needs a future spec. - -## 11. Relationship to producer idempotency (deferred sibling) - -A separate spec covers producer-side dedup as a **TTL window** (SQS/NATS model), -append-only, GC'd by rotation. It is intentionally decoupled: in a log, -"processed" is a per-consumer fact the producer cannot see, so dedup must be a -producer-side time window, while ordering/serialization is this consumer-side -partition feature. Prior-art and rationale: `blueprints/IDEMPOTENCY_DESIGN.md`. - -## 12. Team of veteran experts (review panel) - -- **Lead (spec author):** drafts and revises. -- **Reviewer A — ops/security:** scope creep, the pause-on-failure crash story, - grants, managed-PG constraints. -- **Reviewer B — QA/testability:** ordering under concurrency, the reorder edge - in skip mode, "engine untouched" assertion. - -*(Live multi-model review loop not run here; reviewer personas listed for when -this is iterated through the actual `samospec` CLI.)* - -## 13. Sprint plan - -1. **S1 — producer + key plumbing:** `send(queue, key, payload)`, key on - `ev_extra1`, exposed on `pgque.message`. Tests T2–T3. -2. **S2 — partition-aware assignment** over cooperative consumers. Tests T1, T4, - T7, T8. Resolves R2. -3. **S3 — pause-on-failure** (D2 default) + `skip` mode. Tests T5, T6. Resolves R1. -4. **S4 — docs + benchmark** (throughput vs N; per-tenant order under load). - -## 14. Changelog - -- **v0.1 (draft):** initial single-pass SamoSpec-format draft. Defines the - partition-key consumer feature, the hash-assignment architecture, the - pause-on-failure default (D2), and the no-per-event-state constraint (D5). - Producer idempotency split out to a sibling spec. +- **R1 — `pause` crash recovery.** Rebuilding the blocked set from `retry_queue` + at slot start (D5) is the crux; needs the exact predicate and a test + (T-slot-crash). This is why `pause` ships after `skip`. +- **R2 — read amplification.** N× scans (§6). Bench it; if it bites, R6. +- **R3 — hot partitions.** One hot key saturates its slot; v0.1 documents only. +- **R4 — changing N.** Now an invariant (D3): mismatched workers are rejected, so + no silent reorder. True rebalancing is a future spec. +- **R5 — `ev_extra1` semantics.** Restricted to `send()`-sourced queues (D1); + a dedicated partition-key column is possible future work. +- **R6 — single-reader/dispatch optimization** to remove read amplification + (one reader hash-routes to per-slot staging). Future; adds a hop/state, so out + of v0.1's no-state budget. + +## 11. (reserved) + +## 12. Relationship to producer idempotency (deferred sibling) + +Producer-side dedup is a **TTL window** (SQS/NATS model), append-only, GC'd by +rotation — a separate spec. In a log, "processed" is a per-consumer fact the +producer cannot see, so dedup must be a producer-side time window, while +ordering/serialization is this consumer-side partition feature. Rationale and +prior art: `blueprints/IDEMPOTENCY_DESIGN.md`. + +## 13. Team of veteran experts (review panel) + +- **Lead:** drafts/revises (this document). +- **Reviewer A — ops/security:** failure modes, crash safety, scope. Round 1 + applied (B1–B5, N1–N5). +- **Reviewer B — QA/testability:** ordering precision, slot model, falsifiable + tests. Round 1 applied (B1–B6, N1–N7, the G1/G2/G3 restatement in §2). + +## 14. Sprint plan + +1. **S1 — producer + key plumbing:** `send(…, partition_key =>)`, key on + `ev_extra1` for send-sourced queues. Tests T-G1a, T-no-bloat(happy), + T-idempotent-install. +2. **S2 — slot consumers (`skip` default):** `subscribe_slot`, + `receive_partitioned` via `get_batch_cursor` `extra_where`; persisted N (D3); + single-owner (D7). Tests T-G1b, T-G2, T-no-drop, T-G3-skip, T-engine-untouched. +3. **S3 — `pause` policy (v0.2):** blocked-set from `retry_queue`; DLQ-unblock. + Tests T-G3-pause, T-DLQ-unblock, T-slot-crash. +4. **S4 — docs + benchmark:** throughput vs N; read-amplification cost (R2); + per-tenant order under load. + +## 15. Changelog + +- **v0.2 (draft):** review round 1 (Reviewer A + B) applied. **Re-grounded the + core mechanism**: dropped the (impossible) coop-distribution model for **N + independent slot consumers** filtering via `get_batch_cursor` `extra_where` + (§6). Restated the guarantee as testable **G1/G2/G3** (§2). **Corrected** the + retry rationale (ev_id preserved, ev_txid changes). Resolved D2-vs-state with + **D5** (derive `pause` from existing `retry_queue`; no new table). Made `skip` + the v0.1 default, `pause` a specified v0.2 follow. Fixed: `send` signature + collision (D6), `ev_extra1`/trigger collision (D1), unstable `hashtext` (D4), + fixed-N as enforced invariant (D3), slot/owner definition (D7). Added missing + tests (no-drop, order-after-retry, DLQ-unblock, slot-crash, empty/hot, + cross-version affinity). Recorded accepted/rejected items in `decisions.md`. +- **v0.1 (draft):** initial single-pass SamoSpec-format draft. diff --git a/blueprints/partition-keys/decisions.md b/blueprints/partition-keys/decisions.md new file mode 100644 index 00000000..b0b3df76 --- /dev/null +++ b/blueprints/partition-keys/decisions.md @@ -0,0 +1,61 @@ +# Partition Keys — decisions log + +Accepted / rejected / deferred choices, tracked across review rounds. + +## Review round 1 (Reviewer A ops/security · Reviewer B QA/testability) + +### Accepted (changed the spec) + +- **A1 — Drop the cooperative-consumer distribution model.** Both reviewers + proved coop hands each member a *disjoint tick window*, not a hash-filtered + shared batch; a filter overlay would drop other slots' events on cursor + advance. → v0.2 uses **N independent slot subscriptions**, each filtering the + full stream via `get_batch_cursor` `extra_where`. (SPEC §6) +- **A2 — Correct the retry rationale.** `event_retry` preserves `ev_id` and + changes `ev_txid` (event reappears in a later window); the original "later + ev_id" claim was wrong. → guarantee restated as G1/G2/G3 with an explicit + engine note. (SPEC §2) +- **A3 — Resolve D2-vs-"no state".** `pause` derives its blocked-key set from the + engine's existing `retry_queue`/`dead_letter`, scoped per slot by `sub_id` + (each slot is its own subscription); no new mutable table. (SPEC §7 D5, §8) +- **A4 — DLQ must unblock.** A paused key releases when its head event is acked + *or* dead-lettered, so a poison event cannot wedge a tenant past `max_retries`. + (SPEC §8; test T-DLQ-unblock) +- **A5 — `send` signature.** Use a new 4-arg `send(queue, type, payload, + partition_key =>)`; a 3-arg `send(queue, key, payload)` collides with the + existing `send(queue, type, payload)`. (SPEC §7 D6) +- **A6 — `hashtextextended(key, 0)`** instead of `hashtext()` (unstable across PG + majors → affinity would break on upgrade). (SPEC §7 D4) +- **A7 — `ev_extra1` restricted to `send()`-sourced queues** (triggers store the + table name there). (SPEC §7 D1) +- **A8 — Fixed N as an enforced invariant**: persisted per (queue, consumer); a + mismatched-N worker is rejected, not silently misrouting. (SPEC §7 D3) +- **A9 — Define "slot" and single-owner**: slot = named consumer `"#k/N"`; + G2 enforced by the existing per-consumer receive lock. (SPEC §7 D7) +- **A10 — Test corrections**: `T-engine-untouched` asserts `pg_get_functiondef` + (not generated SQL); `T-no-bloat` scoped to the happy path; added T-no-drop, + order-after-retry, DLQ-unblock, slot-crash, empty/hot-key, cross-version + affinity. (SPEC §9) + +### Deferred + +- **`pause` policy implementation** ships after `skip` (sound + simple first); + `skip` is the v0.1 default. `pause` is fully specified. (SPEC §4, §7 D2) +- **Read-amplification optimization** (single-reader/dispatch) — R6; adds a + hop/state, out of v0.1's no-state budget. +- **Trigger-sourced queues**, **dynamic N / rebalancing**, **hot-partition + mitigation** — out of scope, documented. + +### Rejected + +- **Lease table / advisory-lock-per-event** for serialization — reintroduces the + per-event churn PgQue exists to avoid (carried over from the superseded + `IDEMPOTENCY_AND_PARTITIONS.md`, now removed). +- **Modifying `batch_event_sql`** to push partitioning into the engine — violates + the sacred-engine rule; the `extra_where` hook achieves filtering without it. + +## Open (for round 2) + +- Exact `retry_queue` predicate for `pause` blocked-set reconstruction at slot + start (R1). +- Whether read amplification at target throughput justifies R6 in v0.1. diff --git a/web/public/briefs/partition-keys.html b/web/public/briefs/partition-keys.html index ae8776df..f732b6de 100644 --- a/web/public/briefs/partition-keys.html +++ b/web/public/briefs/partition-keys.html @@ -7,18 +7,9 @@ - producers - pgque.send(queue, partition_key, payload) + send(queue, type, payload, partition_key => K) → ev_extra1 - key → ev_extra1 (free today) - - + engine · sacred — UNCHANGED - append-only event tables · global ev_id order - next_batch / get_batch_events / rotation — no edits to batch_event_sql - - batch of events - - - - partition layer · NEW (routing only) - slot = hash(partition_key) mod N - rides on cooperative consumers · no per-event state, no lock + append-only tables · global ev_id / ev_txid order + next_batch · get_batch_cursor(extra_where) · rotation — no edits to batch_event_sql - - - - + + + + full stream + full stream + full stream - - - slot 0 · worker A - keys where h%N==0 - in ev_id order - tenants 0,3,6… + + slot 0 · sub#0/N + own cursor + extra_where: h%N==0 + keys in ev_id order + one worker - - slot 1 · worker B - keys where h%N==1 - in ev_id order - tenants 1,4,7… + + slot 1 · sub#1/N + own cursor + extra_where: h%N==1 + keys in ev_id order + one worker - - slot N-1 · worker C - keys where h%N==N-1 - in ev_id order - tenants 2,5,8… + + slot N-1 · sub#k/N + own cursor + extra_where: h%N==N-1 + keys in ev_id order + one worker - one key → one slot → per-key order preserved · distinct keys → parallel + each slot = independent subscription · own cursor → no data loss + one key → one slot → per-key order · distinct keys → parallel + cost: N× read amplification (each slot scans the full stream, server-side hash-filtered) -
The new logic lives only in the distribution step. The PgQ engine is not modified.
+
Slots are independent subscriptions filtering via the existing extra_where hook. The PgQ engine is not modified.
-

03Scope

+

04Scope

In · v0.1

    -
  • Partition key carried on an event
  • -
  • Stable hash(key) % N assignment
  • -
  • Per-key order across batches
  • -
  • Explicit failure policy (pause vs skip)
  • +
  • Partition key on send()-sourced events
  • +
  • Stable hashtextextended(key,0) % N affinity
  • +
  • Per-key order + single processor (G1, G2)
  • +
  • skip failure policy (default)
-

Out · later

+

Deferred / out

    +
  • pause policy — specified, ships v0.2
  • Producer idempotency / dedup window
  • -
  • Dynamic rebalancing / elastic N
  • -
  • Cross-queue / cascaded partitioning
  • -
  • Automatic hot-partition mitigation
  • +
  • Trigger-sourced queues · dynamic N
  • +
  • Hot-partition mitigation · read-amp optimization
-

04Key decisions

+

05Key decisions

- + - - - - - + + + + + + +
IDDecisionChoice (v0.1)
IDDecisionChoice (v0.2)
D1Where the key livesev_extra1 — no schema change
D2Failure policy (head-of-line)Pause the partition (default); skip opt-in
D3ElasticityFixed N in v0.1
D4Assignment functionhashtext(key) % N, stable
D5Per-event stateNone — routing only, no lease, no lock
D1Where the key livesev_extra1, send()-sourced queues only
D2Failure policyskip default; pause ships v0.2
D3Slot count NFixed, persisted & enforced per (queue, consumer)
D4Assignment functionhashtextextended(key,0) % N (version-stable)
D5State budgetNo new table; pause derives from existing retry_queue
D6Producer signaturesend(queue, type, payload, partition_key =>)
D7Slot & single-ownerslot = consumer "<c>#k/N"; receive lock
-

D2 is the contract-shaping choice: PgQ retry re-inserts a failed event with a later ev_id, which would reorder a key — so strict order must block the key until the failure resolves.

+

Review round 1 re-grounded the mechanism: cooperative consumers hand out disjoint tick windows (not a hash-filtered shared batch), so slots are independent subscriptions, not a coop overlay. The retry path preserves ev_id and changes ev_txid — so order breaks across a retry, which is exactly what the pause boundary (G3) handles.

-

05Sprint plan

+

06Sprint plan

S1  producer + key plumbing - S2  partition-aware assignment - S3  pause-on-failure + skip mode - S4  docs + throughput benchmark + S2  slot consumers (skip) + S3  pause policy (v0.2) + S4  docs + read-amp benchmark

- Biggest open risk: a crash-safe pause-on-failure that holds a key - "blocked" with no mutable table — likely re-derived at slot start from the presence of - an unacked/retrying event for the key. Next concrete step is confirming where assignment - hooks into cooperative consumers without touching the engine. + Open for round 2: the exact retry_queue predicate that rebuilds a slot's + blocked-key set after a crash (the crux of crash-safe pause), and whether + N× read amplification at target throughput justifies a single-reader/dispatch + optimization in v0.1.

Companion to the producer-side decision note - (idempotency = TTL window, not free-once-processed) and the full spec. - In a log, “processed” is a per-consumer fact the producer can't see — so ordering and - serialization live here, on the consumer. + (idempotency = TTL window, not free-once-processed) and the full spec + (blueprints/partition-keys/SPEC.md). In a log, “processed” is a + per-consumer fact the producer can't see — so ordering and serialization live here, + on the consumer.

PgQue — zero-bloat PostgreSQL queue · pgque.dev · github.com/NikolayS/pgque

-

Drafted in SamoSpec format (single-pass lead draft; the live multi-model review loop was not run for this brief).

+

Drafted in SamoSpec format; review round 1 (ops/security + QA/testability personas) applied. The live multi-model CLI loop was not run in this environment.

From 18ddb9c96fa28cc41bce8b5911f47a8d446a8d47 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 19 Jun 2026 21:32:23 +0000 Subject: [PATCH 5/7] docs: move idempotency design note into blueprints/idempotency/ Relocate IDEMPOTENCY_DESIGN.md -> blueprints/idempotency/DESIGN.md to match the partition-keys/ slug layout; update the cross-reference in the partition spec and refresh the note's footer. Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01WuaYcu1XXsVEpsnLhF1FFu --- blueprints/{IDEMPOTENCY_DESIGN.md => idempotency/DESIGN.md} | 3 ++- blueprints/partition-keys/SPEC.md | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) rename blueprints/{IDEMPOTENCY_DESIGN.md => idempotency/DESIGN.md} (99%) diff --git a/blueprints/IDEMPOTENCY_DESIGN.md b/blueprints/idempotency/DESIGN.md similarity index 99% rename from blueprints/IDEMPOTENCY_DESIGN.md rename to blueprints/idempotency/DESIGN.md index f851d653..d81e2ba0 100644 --- a/blueprints/IDEMPOTENCY_DESIGN.md +++ b/blueprints/idempotency/DESIGN.md @@ -275,4 +275,5 @@ planned features are genuine differentiators, not catch-up. --- -(This note is local only — not committed, not pushed.) +(Companion: the consumer-side partition feature in +`blueprints/partition-keys/SPEC.md`.) diff --git a/blueprints/partition-keys/SPEC.md b/blueprints/partition-keys/SPEC.md index dda0f4bc..a8a786ae 100644 --- a/blueprints/partition-keys/SPEC.md +++ b/blueprints/partition-keys/SPEC.md @@ -228,7 +228,7 @@ Producer-side dedup is a **TTL window** (SQS/NATS model), append-only, GC'd by rotation — a separate spec. In a log, "processed" is a per-consumer fact the producer cannot see, so dedup must be a producer-side time window, while ordering/serialization is this consumer-side partition feature. Rationale and -prior art: `blueprints/IDEMPOTENCY_DESIGN.md`. +prior art: `blueprints/idempotency/DESIGN.md`. ## 13. Team of veteran experts (review panel) From f2297546e9277148a6558a882a5a0b18b2bf820d Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 19 Jun 2026 21:40:58 +0000 Subject: [PATCH 6/7] =?UTF-8?q?docs(spec):=20partition-keys=20v0.3=20?= =?UTF-8?q?=E2=80=94=20review=20round=202=20applied?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Round 2 verified the model against the engine: G1 ev_id ordering is real (order by 1, preserved through get_batch_cursor) and the G2 single-owner lock is the tested #97 guard. Fixes folded in: - security: receive_partitioned/subscribe_slot are SECURITY DEFINER over the admin-only get_batch_cursor; validated integer-only filter (corrects the "injection-safe" framing). - correctness: pause blocked-set moved off the transient retry_queue to a durable compact partition_block marker (per failing key, not per event); DLQ-unblock predicate made explicit. - bug: modulo sign-normalized to (h%N+N)%N. - R7 rotation wedge (+ pause must not hold the batch open); N persistence + teardown + DLQ-cascade caveat. - tests: retry-affinity, security, N-invariant; split G2 block/parallel; get_batch_cursor in engine-untouched guard. Update decisions.md (round-2 scorecard) and refresh the brief. Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01WuaYcu1XXsVEpsnLhF1FFu --- blueprints/partition-keys/SPEC.md | 393 ++++++++++++++----------- blueprints/partition-keys/decisions.md | 51 +++- web/public/briefs/partition-keys.html | 12 +- 3 files changed, 271 insertions(+), 185 deletions(-) diff --git a/blueprints/partition-keys/SPEC.md b/blueprints/partition-keys/SPEC.md index a8a786ae..81db1f50 100644 --- a/blueprints/partition-keys/SPEC.md +++ b/blueprints/partition-keys/SPEC.md @@ -1,12 +1,12 @@ # PgQue Partition Keys — Spec -- **Version:** v0.2 (draft) -- **Status:** review round 1 applied (Reviewer A ops/security + Reviewer B - QA/testability). Core mechanism re-grounded against the engine; see §15 - changelog and `decisions.md`. +- **Version:** v0.3 (draft) +- **Status:** review rounds 1 + 2 applied. Core model verified sound against the + engine (G1 ordering confirmed true; G2 lock confirmed real). Remaining fixes + from round 2 folded in. See §15 changelog and `decisions.md`. - **Slug:** partition-keys - **Scope:** consumer-side ordered, parallel consumption by partition key. - Producer-side idempotency/dedup is a *separate* spec (deferred — see §12). + Producer-side idempotency/dedup is a *separate* spec (deferred — §12). --- @@ -14,66 +14,66 @@ Add a **partition key** to PgQue so that, within one queue, events sharing a key are consumed **in order by a single worker at a time**, while events with -different keys are consumed **in parallel**. This is the log-native ("Kafka -partition") model: order *within* a key, parallelism *across* keys. +different keys are consumed **in parallel** — the log-native ("Kafka partition") +model: order *within* a key, parallelism *across* keys. ## 2. The guarantee (precise, testable) -Stated as three independently-testable clauses (this replaces vague "in order" -prose; per Reviewer B): - -- **G1 — per-key affinity + happy-path FIFO.** For a queue whose events carry a - partition key, and a fixed slot count `N`, every event of key `K` maps to - exactly one slot `slot(K) = hashtextextended(K, 0) % N`. Within that slot, - successfully-processed, never-retried events of `K` are delivered in - non-decreasing `ev_id` order, and to **no other slot**. -- **G2 — single in-flight processor per key.** At any instant, at most one worker - holds an unacked event for `K`. -- **G3 — failure boundary.** Under **`pause`** policy, if `K#i` fails, no later - event of `K` is delivered until `K#i` is acked or dead-lettered; other keys are - unaffected. Under **`skip`** policy, later events of `K` MAY be delivered before - `K#i` resolves — so after a failure only *at-least-once* holds, not order. - **Note (engine fact):** a retried event re-enters under a new transaction/tick - (its `ev_id` is preserved but its `ev_txid` is new, so it reappears in a *later* - batch — `event_retry` → `maint_retry_events` → `insert_event_raw`). So G1's - `ev_id` monotonicity holds only *between non-retried* events; across a retry the - only guarantee is G3's pause boundary, never `ev_id` ordering. +- **G1 — per-key affinity + FIFO.** For a queue whose events carry a partition + key, and a fixed slot count `N`, every event of key `K` maps to exactly one + slot `slot(K) = (hashtextextended(K, 0) % N + N) % N` (the `+N` normalization + is mandatory — `hashtextextended` returns `bigint` and `%` of a negative value + is negative; round 2). Within that slot, non-retried events of `K` are + delivered in non-decreasing `ev_id` order, to **no other slot**. + - *Intra-batch order* is guaranteed by the engine's `order by 1` (= `ev_id`) + in `batch_event_sql` (`pgque.sql:440`), preserved through the + `get_batch_cursor` filter wrap (`pgque.sql:2277`). *Cross-batch order* for a + slot follows from sequential consumption of one subscription whose cursor + advances monotonically — **not** from `order by 1` (which is per-batch). +- **G2 — single in-flight processor per key.** At any instant at most one worker + holds an unacked event for `K`. Enforced by the per-subscription receive lock + (`next_batch_custom` … `for update of s`, `pgque.sql:5761`), the same lock the + #97/#125 double-delivery guard relies on (`tests/two_session_receive_lock.sh`). +- **G3 — failure boundary.** Under **`pause`**, no later event of `K` is + delivered until `K`'s failed head event is acked or dead-lettered; other keys + unaffected. Under **`skip`** (v0.1 default), later events of `K` MAY be + delivered before the failure resolves — only *at-least-once* holds. + - **Engine fact:** a retried event keeps its `ev_id` but gets a new `ev_txid` + (re-injected by `maint_retry_events` → `insert_event_raw`, `pgque.sql:859`), + so it reappears in a *later* batch. Thus G1's `ev_id` monotonicity holds only + between non-retried events; across a retry the only guarantee is G3's pause + boundary. The retried event re-routes to the **same slot** because + `ev_extra1` is preserved through the retry path (verified `pgque.sql:2376`, + `:861`). ## 3. Why it's needed -PgQue is an **ordered, immutable log**, not a job queue. Real workloads need +PgQue is an **ordered, immutable log**, not a job queue — workloads need **per-entity ordering without global ordering**. Motivating case (a multi-tenant -storage service evaluating PgQue to replace pg-boss): - -- Millions of file-lifecycle events (`FileCreated`, `FileDeleted`, - `FileOverwritten`), which **must be ordered per tenant** but **need no ordering - across tenants**. -- One in-order consumer can't keep up; naive multi-worker consumption breaks - per-tenant order. +storage service evaluating PgQue to replace pg-boss): millions of file-lifecycle +events that **must be ordered per tenant** but need **no ordering across +tenants**. One in-order consumer can't keep up; naive multi-worker consumption +breaks per-tenant order. ## 4. Scope and ICP **In scope (v0.1 implementation):** -- Carry a partition key on a `send()`-sourced event. +- Partition key on a `send()`-sourced event. - N independent **slot consumers**, each filtering the stream to its hash class - (§6). Stable `hashtextextended(key, 0) % N` affinity. -- G1 + G2 always; **`skip` failure policy as the v0.1 default** (sound, simple). + (§6). Stable `(hashtextextended(key,0) % N + N) % N` affinity. +- G1 + G2 always; **`skip` failure policy as the v0.1 default** (sound, stateless). -**Deferred to v0.2 implementation (specified, not built first):** -- **`pause` failure policy** (G3 strict). It is fully specified here (§7 D2, §8) - but carries the crash-recovery risk (R1) and ships after `skip`. +**Deferred to v0.2 implementation (specified here, not built first):** +- **`pause` failure policy** (G3 strict) — uses a compact blocked-key marker + (§7 D5, §8). Carries the crash/rotation risks (R1, R7); ships after `skip`. -**Out of scope:** -- Producer idempotency / dedup windows (separate spec — §12). -- Dynamic rebalancing / elastic `N` (fixed `N`; §7 D3, R4). -- **Trigger-sourced queues** (`jsontriga`/`logutriga`/`sqltriga` already store the - table name in `ev_extra1` — §7 D1, R5). Partitioned consumption is defined only - for `send()`-sourced queues in v0.1. -- Cross-queue / cascaded (multi-node) partitioning. -- Automatic hot-partition mitigation (documented only). +**Out of scope:** producer idempotency (separate spec, §12); dynamic +rebalancing / elastic `N` (§7 D3, R4); **trigger-sourced queues** +(`jsontriga`/`logutriga`/`sqltriga` store the table name in `ev_extra1` — §7 D1, +R5); cross-queue / cascaded partitioning; automatic hot-partition mitigation. -**ICP:** multi-tenant SaaS on managed Postgres running a high-volume per-entity -event stream where entity = partition key (tenant, user, document, device). +**ICP:** multi-tenant SaaS on managed Postgres with a high-volume per-entity +event stream (entity = partition key: tenant, user, document, device). ## 5. End-to-end workflow @@ -91,134 +91,172 @@ consumers: N slot consumers, each an INDEPENDENT subscription with its own curso ## 6. Architecture The v0.1 mechanism is **N independent slot consumers**, *not* a modification of -cooperative consumers. (Review round 1, B1: coop hands each member a disjoint -tick window; it cannot fan one batch to N hash-filtered slots without dropping -events when the shared cursor advances. So we do not use coop distribution.) +cooperative consumers (round 1 B1: coop hands each member a disjoint tick window +and cannot fan one batch to N hash-filtered slots without dropping events on the +shared cursor advance; confirmed `cooperative_consumers.sql`, `for update skip +locked` victim-steal at `pgque.sql:6262`). ``` - producers │ send(queue, type, payload, partition_key => K) - │ key → ev_extra1 + producers │ send(queue, type, payload, partition_key => K) → ev_extra1 ▼ ┌──────────────────────────────────────────────────────────┐ │ ENGINE · sacred — UNCHANGED │ │ append-only tables · global ev_id/ev_txid · rotation │ - │ next_batch / get_batch_cursor(i_extra_where) / get_events │ + │ next_batch / get_batch_cursor(i_extra_where) / order by 1 │ └───────┬───────────────┬───────────────┬──────────────────┘ │ full stream │ full stream │ full stream ▼ ▼ ▼ - slot 0 (sub#0/N) slot 1 (sub#1/N) slot N-1 (sub#(N-1)/N) + slot 0 (sub#0/N) slot 1 (sub#1/N) slot N-1 own cursor own cursor own cursor - extra_where: extra_where: extra_where: - hashext%N=0 hashext%N=1 hashext%N=N-1 - │ │ │ + filter h%N=0 filter h%N=1 filter h%N=N-1 one worker one worker one worker - keys h%N==0 keys h%N==1 keys h%N==N-1 - in ev_id order in ev_id order in ev_id order ``` -**How filtering happens without touching the engine:** each slot's receive call -reuses the existing `pgque.get_batch_cursor(..., i_extra_where)` hook -(`pgque.sql` ~line 2229), injecting the predicate -`and hashtextextended(ev_extra1, 0) % N = k`. The predicate is built from -integers `N`, `k` (no user input → injection-safe). `batch_event_sql`, -`next_batch`, and rotation are **not modified**. - -**Each slot is its own subscription**, so it has its own cursor (`sub_last_tick`) -and its own `sub_id` → **no cross-slot data loss** (each slot independently -advances over the full stream) and **retry/DLQ rows are naturally slot-scoped** -(`ev_owner = that slot's sub_id`), which is what makes `pause` re-derivable -(§8). - -**Known cost — read amplification.** Every event is examined by all `N` slot -cursors (each discards `(N-1)/N` after the hash filter). The `extra_where` -push-down keeps *returned* rows minimal, but the index scan over each tick window -is repeated `N` times. Acceptable for moderate `N`; documented, with a +**Filtering without touching the engine.** Each slot's receive reuses the +existing admin-only `pgque.get_batch_cursor(…, i_extra_where)` hook +(`pgque.sql:2229`), injecting `and (hashtextextended(ev_extra1,0) % N + N) % N = +k`. The fragment is assembled **only** from the validated integers `N`, `k` +(§8) — never a caller string. `batch_event_sql`, `next_batch`, rotation are +**not modified**; `get_batch_cursor`'s `order by 1` re-wrap preserves G1. + +**Trust boundary (round 2).** `get_batch_cursor`'s `i_extra_where` is a +documented *trusted-SQL sink*, revoked from `public/pgque_reader/pgque_writer` +and granted to `pgque_admin` only (`pgque.sql:2221`, `:4852`). Therefore +`receive_partitioned` and `subscribe_slot` are **`SECURITY DEFINER`, owned by the +installer** (which holds the admin grant), exactly like `receive`/`nack` +(`receive.sql`). The reader role gets EXECUTE on the *wrappers*, never on +`get_batch_cursor`. Safety rests on the wrapper validating and integer-casting +`N`,`k` and interpolating no caller-supplied value — **not** on "the string +contains only integers." + +**Each slot is its own subscription** → its own cursor (`sub_last_tick`) and +`sub_id` → no cross-slot data loss, and retry/DLQ rows are naturally slot-scoped +(`ev_owner = sub_id`, `pgque.sql:2374`). + +**Known cost — read amplification.** Every event is scanned by all `N` slot +cursors (each discards `(N-1)/N` after the hash filter; the filter is applied +*after* the engine materializes the per-slot window, `pgque.sql:2277`, so it +reduces returned rows, not scan work). Steady state ≈ **N×**; **up to ~2N×** +during a rotation-overlap window (the engine's multi-table `union all`); a +**stalled slot scans an ever-widening window** each poll. Documented; single-reader/dispatch optimization noted as future work (R6). ## 7. Decisions -| ID | Decision | Choice (v0.2) | Rationale / change | -|----|----------|---------------|--------------------| -| D1 | Where the key lives | `ev_extra1`, **`send()`-sourced queues only** | Trigger producers already use `ev_extra1` for the table name (`pgque.sql:2943`). Restrict, don't collide. (R5) | -| D2 | Failure policy | `skip` default (v0.1); `pause` specified, ships v0.2 | `pause` needs durable-ish blocked-key tracking; deliver the sound `skip` first. Rationale corrected re: retry (§2 note). | -| D3 | Elasticity | Fixed `N`, **persisted per (queue, consumer)** | A worker registering with a mismatched `N` is **rejected**, so "fixed N" is an invariant, not a convention. (N2) | -| D4 | Assignment function | `hashtextextended(key, 0) % N` | `hashtext()` is internal/unstable across PG majors → affinity would break on upgrade. `hashtextextended` is the documented, stable hash. (N1) | -| D5 | State budget | **No new mutable table; happy path writes nothing.** `pause` derives blocked keys from the engine's existing `retry_queue`/`dead_letter`, read only on failure/slot-start | Resolves the D2-vs-"no state" contradiction honestly: failure handling reuses state the engine already keeps, scoped per slot by `sub_id`. (B3/B4) | -| D6 | Producer signature | `send(queue, type, payload, partition_key => text)` (new 4-arg overload) | A 3-arg `send(queue, key, payload)` collides with the existing `send(queue, type, payload)`. (B4/N4) | -| D7 | Slot identity & single-owner | slot = a named consumer `"#k/N"`; single owner enforced by the existing per-consumer receive lock (the `sub_batch`/`FOR UPDATE` path) | Defines what a "slot" is and what makes G2 true and testable. (B5) | +| ID | Decision | Choice (v0.3) | Rationale / round-2 change | +|----|----------|---------------|----------------------------| +| D1 | Where the key lives | `ev_extra1`, **`send()`-sourced queues only** | Triggers use `ev_extra1` for the table name. | +| D2 | Failure policy | `skip` default (v0.1); `pause` ships v0.2 | `pause` needs the blocked-key marker (D5). | +| D3 | Elasticity & N | Fixed `N`, **persisted in a `pgque.partition_consumer(queue, consumer, n)` row**; a worker registering a different `n` is **rejected** | "Fixed N" is an enforced invariant, not a convention. N lives outside the slot names (which encode it) so a *new* slot can be validated. | +| D4 | Assignment | `(hashtextextended(key, 0) % N + N) % N` | `hashtextextended` is the stable, documented hash; the `+N` normalizes the sign (round 2). | +| D5 | State budget | **Happy path & `skip`: no state.** `pause`: a compact `pgque.partition_block(sub_id, partition_key, head_ev_id)` marker, written on first failure, cleared on ack-or-DLQ | Round 2 (B-R2-2) proved `retry_queue` is transient (deleted by `maint_retry_events`, `pgque.sql:863`) so it can't be the durable blocked-set. The marker is O(*concurrently-failing keys*) — proportional to failures, **not** throughput, so it is not pg-boss-style per-event churn. Honestly reopens "no new table," scoped to `pause`. | +| D6 | Producer signature | `send(queue, type, payload, partition_key => text)` (new 4-arg overload) | Avoids collision with `send(queue, type, payload)`. | +| D7 | Slot identity & single-owner | slot = consumer `"#k/N"`; G2 via the per-subscription receive lock. `receive_partitioned`/`subscribe_slot` are `SECURITY DEFINER` (§6) | Defines what a slot is and what makes G2 true and reader-callable. | ## 8. Implementation details -- **Producer:** `pgque.send(queue, type, payload, partition_key text default null)` - → `insert_event(queue, type, payload, ev_extra1 => partition_key, …)`. Pure - reduction to the existing primitive (Key Design Rule 3). Explicit - `revoke execute … from public` + `grant … to pgque_writer`, `SECURITY DEFINER` - with `set search_path = pgque, pg_catalog`. -- **Slot registration:** `pgque.subscribe_slot(queue, consumer, k, n)` registers - subscription `"#k/n"` and persists `n` for the consumer; a later - registration with a different `n` is rejected (D3). -- **Partitioned receive:** `pgque.receive_partitioned(queue, consumer, k, n, …)` - → `next_batch` + `get_batch_cursor(..., i_extra_where => - format('and hashtextextended(ev_extra1,0) %% %s = %s', n, k))`. Server-side - filter; engine untouched. -- **`pause` blocked-set (v0.2):** within a run, a worker holds back later events - of a key whose head event is unacked/retrying. On (re)start, the blocked set is - rebuilt by querying `retry_queue where ev_owner = ` (existing - state; read once at slot start, not per event). A key unblocks when its head - event is acked **or** dead-lettered (`dead_letter`), so a poison event cannot - wedge a tenant beyond `max_retries` (B5). `skip` mode needs none of this. +- **Producer:** `pgque.send(queue, type, payload, partition_key text default + null)` → `insert_event(queue, type, payload, ev_extra1 => partition_key, …)`. + `SECURITY DEFINER set search_path = pgque, pg_catalog`; revoke from public, + grant `pgque_writer`. +- **Slot registration:** `pgque.subscribe_slot(queue text, consumer text, k int, + n int)` — validates `n >= 1 and 0 <= k < n` (raises otherwise), upserts the + persisted `n` for `(queue, consumer)` and **rejects a changed `n`** (D3), + registers subscription `"#k/n"`. Idempotent for the same `(k,n)`. +- **Partitioned receive:** `pgque.receive_partitioned(queue, consumer, k int, n + int, …)` — after validating/casting `k,n` to int, calls `next_batch` + + `get_batch_cursor(…, i_extra_where => format('and (hashtextextended(ev_extra1,0) + %% %s + %s) %% %s = %s', n, n, n, k))`. Under `pause`, the wrapper also + withholds events whose key has an open `partition_block` row for this `sub_id` + with `head_ev_id < ev_id`. `SECURITY DEFINER` (§6); granted `pgque_reader`. +- **`pause` lifecycle (v0.2):** on nack of `K#i`, upsert + `partition_block(sub_id, K, head_ev_id => ev_id)`. The row is **durable**, so + it survives a crash with no reconstruction. Clear it when `K#i` is acked + (success after retry) **or** dead-lettered — unblock predicate: ack, OR a + `dead_letter` row exists for `(dl_consumer_id = this slot, ev_id = K#i)` + (`event_dead`, `dlq.sql`). So a poison key cannot wedge past `max_retries`. +- **`pause` must not hold the batch open** (R7): it acks the batch and tracks + blocked keys via `partition_block`, so the slot's cursor keeps advancing past + *non-blocked* keys and does not pin rotation. +- **Teardown:** `pgque.unsubscribe_slot(queue, consumer, k, n)` removes the slot + subscription; full-consumer teardown removes all N + the + `partition_consumer`/`partition_block` rows. **Caveat:** `unregister_consumer` + cascades `dead_letter` (`on delete cascade`), so tearing down a slot drops that + slot's DLQ audit — documented. - **Grants:** producer overload → `pgque_writer`; `subscribe_slot` / - `receive_partitioned` → `pgque_reader`. Deny-by-default re-applied. + `unsubscribe_slot` / `receive_partitioned` → `pgque_reader`. Deny-by-default + re-applied. `get_batch_cursor` stays revoked from all app roles. ## 9. Tests plan (red/green TDD) -Write the failing test first. CI matrix PG 14–18. Map to the guarantee: - -- **T-G1a (affinity):** same key → same slot; assert the **literal integer** - `hashtextextended(K,0) % N` (not "same within one version") on **every** CI PG - version — guards D4 stability. *(red first)* -- **T-G1b (per-key FIFO, happy path):** interleave A,B,A,A,B; assert each key in - `ev_id` order across batches, no key on two slots. -- **T-G2 (single owner):** two sessions, same slot → second blocks on the - receive lock (mirror `tests/two_session_receive_lock.sh`). +CI matrix PG 14–18. Write the failing test first. + +- **T-G1a (affinity, stable):** assert the literal `(hashtextextended(K,0)%N+N)%N` + on **every** CI PG version, pinning one concrete `(K, expected)` pair so a hash + drift is caught even if all versions move together. *(red first)* +- **T-G1b (per-key FIFO):** interleave A,B,A,A,B; assert each key in `ev_id` order + across batches, no key on two slots. *(No existing test guards intra-batch + ev_id order — this is new and load-bearing.)* +- **T-retry-affinity (new, round 2):** nack a keyed event; run + `maint_retry_events()` + `force_next_tick` + `ticker()`; assert it redelivers + to the **same** slot `k` and no other. *(The core correctness property under + retry.)* +- **T-G2-block:** two workers on the **same** slot → second blocks on the + subscription lock (mirror `tests/two_session_receive_lock.sh`). +- **T-G2-parallel:** two workers on **different** slots → neither blocks (the + parallelism half). - **T-no-drop:** keys spanning all slots in one tick window; run all N slots; - assert union delivered = all events, **zero loss** (guards the cursor/filter - interaction, §6). -- **T-G3-pause (order-after-retry):** A#2 nacked (`pause`); assert A#3 withheld - until A#2 acked-or-DLQ'd; B unaffected. -- **T-G3-skip (reorder boundary):** with `skip`, assert the *exact* permitted - reorder after A#2 fails (not just "A#3 proceeds"). + union delivered = all events, zero loss. +- **T-security (new, round 2):** a bare `pgque_reader` can call + `receive_partitioned`/`subscribe_slot` end-to-end; and `pgque_reader` **cannot** + call `get_batch_cursor` directly (mirror + `tests/test_security_get_batch_cursor.sql`); `receive_partitioned` rejects a + non-integer/out-of-range `n`/`k`. +- **T-G3-pause (order-after-retry):** A#2 nacked (`pause`); drive + `maint_retry_events()` + tick; assert A#3 withheld until A#2 acked-or-DLQ'd; B + unaffected. - **T-DLQ-unblock:** A#2 exhausts retries → `dead_letter`; assert A#3 then - proceeds (no permanent wedge). -- **T-slot-crash:** slot-k worker holds A#2 unacked and dies; another worker takes - slot k; assert A#2 redelivered before A#3 and only to slot k. -- **T-empty-slot / T-hot-key:** an empty slot doesn't wedge others; a single hot - key saturates one slot while others still drain (correctness, not perf). -- **T-no-bloat (happy path):** processing M events with all acks adds **zero** - rows to `retry_queue`/`dead_letter` and issues no per-event UPDATE/DELETE. (The - failure path legitimately writes `retry_queue` — out of this test's scope.) -- **T-engine-untouched:** `pg_get_functiondef` of `batch_event_sql` and - `next_batch_custom` byte-identical to baseline (assert on the **definition**, - not the generated SQL — N2). -- **T-idempotent-install:** re-running `pgque.sql` re-creates partition functions - cleanly (mirror `tests/test_install_idempotency.sql`). + proceeds. +- **T-slot-crash:** worker holding A#2 (and its `partition_block` row) dies; a + new worker takes slot k; drive maint+tick; assert A#2 redelivered before A#3 + and only to slot k. *(Crash specifically in the post-`maint` window where the + `retry_queue` row is already gone — the round-2 hole.)* +- **T-G3-skip (reorder boundary):** with `skip`, assert the exact permitted + reorder after A#2 fails. +- **T-N-invariant (new, round 2):** `subscribe_slot(…,k,n)` is idempotent; + `subscribe_slot(…,k,n2≠n)` raises. +- **T-empty-slot / T-hot-key:** an empty slot doesn't wedge others; a hot key + saturates one slot while others drain. +- **T-no-bloat (happy path):** all-ack processing of M events adds zero + `retry_queue`/`dead_letter`/`partition_block` rows and no per-event + UPDATE/DELETE. (Failure path legitimately writes — out of scope here.) +- **T-engine-untouched:** `pg_get_functiondef` of `batch_event_sql`, + `next_batch_custom`, **and `get_batch_cursor`** (round 2 — the slot model + depends on its `order by 1` re-wrap) byte-identical to baseline. +- **T-idempotent-install:** re-running `pgque.sql` re-creates the partition + functions/tables cleanly. ## 10. Risks and open questions -- **R1 — `pause` crash recovery.** Rebuilding the blocked set from `retry_queue` - at slot start (D5) is the crux; needs the exact predicate and a test - (T-slot-crash). This is why `pause` ships after `skip`. -- **R2 — read amplification.** N× scans (§6). Bench it; if it bites, R6. -- **R3 — hot partitions.** One hot key saturates its slot; v0.1 documents only. -- **R4 — changing N.** Now an invariant (D3): mismatched workers are rejected, so - no silent reorder. True rebalancing is a future spec. -- **R5 — `ev_extra1` semantics.** Restricted to `send()`-sourced queues (D1); - a dedicated partition-key column is possible future work. -- **R6 — single-reader/dispatch optimization** to remove read amplification - (one reader hash-routes to per-slot staging). Future; adds a hop/state, so out - of v0.1's no-state budget. +- **R1 — `pause` crash safety: resolved by D5's durable `partition_block` + marker** (round 2 B-R2-2). Test: T-slot-crash with the crash in the + post-`maint` window. +- **R2 — read amplification:** N× steady, ~2N× during rotation overlap, + unbounded-width for a stalled slot (§6). Benchmark the stalled case, not just + uniform N. +- **R3 — hot partitions:** one hot key saturates its slot; documented only. +- **R4 — changing N:** an enforced invariant (D3); true rebalancing is future. +- **R5 — `ev_extra1` semantics:** restricted to `send()`-sourced queues; a + dedicated partition-key column is future work. +- **R6 — single-reader/dispatch optimization** to remove read amplification; + adds a hop/state, out of v0.1's budget. +- **R7 — rotation wedge (round 2):** rotation waits for `min(sub_last_tick)` over + ALL subscriptions (`pgque.sql:910`); N slots lower that floor to the slowest + slot, and a wedged `pause` slot could pin rotation for the whole queue → + unbounded data growth. Mitigated by §8 ("`pause` does not hold the batch open"; + cursor advances past non-blocked keys) + an alert on per-slot staleness. ## 11. (reserved) @@ -226,42 +264,47 @@ Write the failing test first. CI matrix PG 14–18. Map to the guarantee: Producer-side dedup is a **TTL window** (SQS/NATS model), append-only, GC'd by rotation — a separate spec. In a log, "processed" is a per-consumer fact the -producer cannot see, so dedup must be a producer-side time window, while -ordering/serialization is this consumer-side partition feature. Rationale and -prior art: `blueprints/idempotency/DESIGN.md`. +producer cannot see, so dedup is a producer-side time window while +ordering/serialization is this consumer-side feature. Rationale and prior art: +`blueprints/idempotency/DESIGN.md`. ## 13. Team of veteran experts (review panel) - **Lead:** drafts/revises (this document). -- **Reviewer A — ops/security:** failure modes, crash safety, scope. Round 1 - applied (B1–B5, N1–N5). -- **Reviewer B — QA/testability:** ordering precision, slot model, falsifiable - tests. Round 1 applied (B1–B6, N1–N7, the G1/G2/G3 restatement in §2). +- **Reviewer A — ops/security:** rounds 1 + 2 applied (security trust boundary, + blocked-set durability, rotation wedge, modulo sign). +- **Reviewer B — QA/testability:** rounds 1 + 2 applied (confirmed G1 ordering + + G2 lock; grant/DEFINER wiring; retry-affinity, security, N-invariant tests). ## 14. Sprint plan -1. **S1 — producer + key plumbing:** `send(…, partition_key =>)`, key on - `ev_extra1` for send-sourced queues. Tests T-G1a, T-no-bloat(happy), - T-idempotent-install. -2. **S2 — slot consumers (`skip` default):** `subscribe_slot`, - `receive_partitioned` via `get_batch_cursor` `extra_where`; persisted N (D3); - single-owner (D7). Tests T-G1b, T-G2, T-no-drop, T-G3-skip, T-engine-untouched. -3. **S3 — `pause` policy (v0.2):** blocked-set from `retry_queue`; DLQ-unblock. - Tests T-G3-pause, T-DLQ-unblock, T-slot-crash. -4. **S4 — docs + benchmark:** throughput vs N; read-amplification cost (R2); - per-tenant order under load. +1. **S1 — producer + key plumbing:** `send(…, partition_key =>)` on + send-sourced queues. Tests T-G1a, T-no-bloat(happy), T-idempotent-install. +2. **S2 — slot consumers (`skip` default):** `subscribe_slot` (persisted N, D3), + `receive_partitioned` via `get_batch_cursor` `extra_where` (`SECURITY + DEFINER`, D7), teardown. Tests T-G1b, T-retry-affinity, T-G2-block, + T-G2-parallel, T-no-drop, T-security, T-N-invariant, T-G3-skip, + T-engine-untouched. +3. **S3 — `pause` policy (v0.2):** `partition_block` marker; DLQ-unblock; "no held + batch" (R7). Tests T-G3-pause, T-DLQ-unblock, T-slot-crash. +4. **S4 — docs + benchmark:** throughput vs N; read-amp (steady, rotation, + stalled); per-tenant order under load. ## 15. Changelog -- **v0.2 (draft):** review round 1 (Reviewer A + B) applied. **Re-grounded the - core mechanism**: dropped the (impossible) coop-distribution model for **N - independent slot consumers** filtering via `get_batch_cursor` `extra_where` - (§6). Restated the guarantee as testable **G1/G2/G3** (§2). **Corrected** the - retry rationale (ev_id preserved, ev_txid changes). Resolved D2-vs-state with - **D5** (derive `pause` from existing `retry_queue`; no new table). Made `skip` - the v0.1 default, `pause` a specified v0.2 follow. Fixed: `send` signature - collision (D6), `ev_extra1`/trigger collision (D1), unstable `hashtext` (D4), - fixed-N as enforced invariant (D3), slot/owner definition (D7). Added missing - tests (no-drop, order-after-retry, DLQ-unblock, slot-crash, empty/hot, - cross-version affinity). Recorded accepted/rejected items in `decisions.md`. +- **v0.3 (draft):** review round 2 applied. **Confirmed G1 ordering is true** + (engine `order by 1`, preserved through `get_batch_cursor`) and the G2 lock is + real/tested. Fixed: (security) `receive_partitioned`/`subscribe_slot` are + `SECURITY DEFINER` over the admin-only `get_batch_cursor`, with integer + validation and the real trust-boundary argument (B-R2-1); (correctness) + `pause` blocked-set moved from the transient `retry_queue` to a durable compact + `partition_block` marker (B-R2-2 / D5); (bug) modulo normalized + `(h%N+N)%N` (D4); added R7 rotation-wedge + "no held batch"; specified N + persistence + teardown + DLQ-cascade caveat (D3); explicit DLQ-unblock + predicate. Added tests: T-retry-affinity, T-security, T-N-invariant, split + T-G2 block/parallel, `get_batch_cursor` in T-engine-untouched, pinned hash + pair. Round-2 decisions in `decisions.md`. +- **v0.2 (draft):** review round 1 — re-grounded the mechanism to N independent + slot subscriptions; restated G1/G2/G3; corrected retry rationale; `skip` + default. (Full detail in `decisions.md`.) - **v0.1 (draft):** initial single-pass SamoSpec-format draft. diff --git a/blueprints/partition-keys/decisions.md b/blueprints/partition-keys/decisions.md index b0b3df76..5f2ed0ef 100644 --- a/blueprints/partition-keys/decisions.md +++ b/blueprints/partition-keys/decisions.md @@ -54,8 +54,51 @@ Accepted / rejected / deferred choices, tracked across review rounds. - **Modifying `batch_event_sql`** to push partitioning into the engine — violates the sacred-engine rule; the `extra_where` hook achieves filtering without it. -## Open (for round 2) +## Review round 2 (both personas, verified against the engine) -- Exact `retry_queue` predicate for `pause` blocked-set reconstruction at slot - start (R1). -- Whether read amplification at target throughput justifies R6 in v0.1. +### Confirmed sound (no change needed) +- **G1 `ev_id` ordering is true.** `batch_event_sql` emits `order by 1` + (`pgque.sql:440`); `get_batch_cursor` re-wraps the filtered stream with + `order by 1` (`pgque.sql:2277`) → per-key order survives the filter, no + consumer sort. (Reviewer B headline.) +- **G2 single-owner lock is real and tested** — `next_batch_custom … for update + of s` (`pgque.sql:5761`), guarded by `two_session_receive_lock.sh`. +- **Retry affinity holds** — `ev_extra1` preserved through `event_retry` / + `maint_retry_events`, so a retried event re-routes to the same slot. +- **Coop genuinely hands disjoint windows** (`for update skip locked`, + `pgque.sql:6262`) — confirms round-1 A1. + +### Accepted (changed the spec → v0.3) +- **B-R2-2 — `pause` blocked-set must be durable.** `retry_queue` is transient + (`maint_retry_events` deletes the row on re-injection, `pgque.sql:863`), + leaving a crash hole that violates G3. → durable + `partition_block(sub_id, partition_key, head_ev_id)` marker, O(failing keys), + cleared on ack-or-DLQ. Honestly reopens "no new table," scoped to `pause`. + (D5, §8, R1) +- **B-R2-1 — security trust boundary.** `get_batch_cursor.extra_where` is an + admin-only trusted-SQL sink (`pgque.sql:2221`, `:4852`). → `receive_partitioned` + / `subscribe_slot` are `SECURITY DEFINER` installer-owned; integers `n,k` + validated + cast; no caller string interpolated. Reframed the "injection-safe" + prose. (§6, §8, D7; test T-security) +- **Negative modulo bug** — `hashtextextended` returns `bigint`; bare `% N` can + be negative → `(h % N + N) % N`. (D4, §6, §8) +- **R7 rotation wedge** — N slots lower the rotation floor to the slowest slot; a + wedged `pause` slot could pin rotation for the whole queue. → `pause` must not + hold the batch open; cursor advances past non-blocked keys. (R7, §8) +- **N persistence + teardown** — N persisted in `partition_consumer`; + `unsubscribe_slot` + DLQ-cascade caveat. (D3, §8) +- **Tests** — added T-retry-affinity, T-security, T-N-invariant; split T-G2 into + block/parallel; added `get_batch_cursor` to T-engine-untouched; pinned a + concrete hash pair. (§9) + +### Round-1 closure scorecard (both reviewers) +- B1 (coop model) ✅ closed · B2 (retry rationale) ✅ closed · B3 (crash-derive + blocked set) ♻️ reopened in v0.2, now ✅ closed via durable marker (D5) · + B4 (D2-vs-state / send sig) ✅ closed · B5 (DLQ-unblock / slot definition) + ✅ closed; spawned B-R2-1 (now fixed) · B6 ✅ closed. + +## Still open (for round 3, if run) +- Bench numbers for read amplification (steady N× vs ~2N× rotation vs stalled + slot) to decide if R6 (single-reader/dispatch) is needed in v0.1. +- Exact `partition_block` withhold predicate wording in `receive_partitioned` + (server-side `not exists` vs worker-side filter). diff --git a/web/public/briefs/partition-keys.html b/web/public/briefs/partition-keys.html index f732b6de..6a412c81 100644 --- a/web/public/briefs/partition-keys.html +++ b/web/public/briefs/partition-keys.html @@ -76,8 +76,8 @@

Partition KeysOrdered, parallel consumption — the log-native way

slug partition-keys - version v0.2 (draft) - review round 1 applied + version v0.3 (draft) + review rounds 1+2 applied engine untouched
@@ -198,13 +198,13 @@

05Key decisions

D1Where the key livesev_extra1, send()-sourced queues only D2Failure policyskip default; pause ships v0.2 D3Slot count NFixed, persisted & enforced per (queue, consumer) - D4Assignment functionhashtextextended(key,0) % N (version-stable) - D5State budgetNo new table; pause derives from existing retry_queue + D4Assignment function(hashtextextended(key,0) % N + N) % N (stable, sign-safe) + D5State budgetHappy path & skip: none. pause: a compact blocked-key marker (per failing key, not per event) D6Producer signaturesend(queue, type, payload, partition_key =>) D7Slot & single-ownerslot = consumer "<c>#k/N"; receive lock -

Review round 1 re-grounded the mechanism: cooperative consumers hand out disjoint tick windows (not a hash-filtered shared batch), so slots are independent subscriptions, not a coop overlay. The retry path preserves ev_id and changes ev_txid — so order breaks across a retry, which is exactly what the pause boundary (G3) handles.

+

Round 1 re-grounded the mechanism (slots are independent subscriptions, not a coop overlay). Round 2 verified it against the engine: G1's ev_id ordering is real (order by 1, preserved through the filter) and the G2 single-owner lock is the tested #97 double-delivery guard. It also corrected the security trust boundary (the filter hook is admin-only, so the consumer wrappers are SECURITY DEFINER) and moved pause's blocked-set to a durable per-failing-key marker.

06Sprint plan

@@ -232,7 +232,7 @@

06Sprint plan

PgQue — zero-bloat PostgreSQL queue · pgque.dev · github.com/NikolayS/pgque

-

Drafted in SamoSpec format; review round 1 (ops/security + QA/testability personas) applied. The live multi-model CLI loop was not run in this environment.

+

Drafted in SamoSpec format; review rounds 1 + 2 (ops/security + QA/testability personas) applied. The live multi-model CLI loop was not run in this environment.

From d8a0f4d83e42f7e42e38a088ba3c261881f12183 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 19 Jun 2026 21:55:04 +0000 Subject: [PATCH 7/7] =?UTF-8?q?docs(spec):=20partition-keys=20v0.4=20?= =?UTF-8?q?=E2=80=94=20round=203,=20Phase=201=20converged?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Round 3 convergence: Phase 1 (skip-default partition consumption) declared implementation-ready; pause split into Phase 2 with explicit open items (O1 defer-without-retry-increment primitive; O2 hot-blocked- key cost). Corrected the SECURITY DEFINER model to the co-ownership invariant (not pgque_admin) + non-superuser-owner security test. Fixed DLQ-unblock sub_id<->co_id join; partition_block FK-cascade/index/ revoked-from-roles, created empty in Phase 1; tightened tests (engine-untouched /4 overload, in-order-after-unblock, marker-clear-via- DLQ, marker durability, hot-blocked-key). Update decisions.md and brief. Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01WuaYcu1XXsVEpsnLhF1FFu --- blueprints/partition-keys/SPEC.md | 460 ++++++++++++------------- blueprints/partition-keys/decisions.md | 51 ++- web/public/briefs/partition-keys.html | 26 +- 3 files changed, 281 insertions(+), 256 deletions(-) diff --git a/blueprints/partition-keys/SPEC.md b/blueprints/partition-keys/SPEC.md index 81db1f50..8b336c3e 100644 --- a/blueprints/partition-keys/SPEC.md +++ b/blueprints/partition-keys/SPEC.md @@ -1,100 +1,96 @@ # PgQue Partition Keys — Spec -- **Version:** v0.3 (draft) -- **Status:** review rounds 1 + 2 applied. Core model verified sound against the - engine (G1 ordering confirmed true; G2 lock confirmed real). Remaining fixes - from round 2 folded in. See §15 changelog and `decisions.md`. +- **Version:** v0.4 (draft) +- **Status:** review rounds 1–3 applied. **Phase 1 (`skip`-default partition + consumption) is converged / implementation-ready.** Phase 2 (`pause` strict + ordering) is specified but has open design items (§11) and is a deliberate + follow-up. See §15 changelog and `decisions.md`. - **Slug:** partition-keys - **Scope:** consumer-side ordered, parallel consumption by partition key. - Producer-side idempotency/dedup is a *separate* spec (deferred — §12). + Producer-side idempotency/dedup is a separate spec (deferred — §12). --- ## 1. Goal -Add a **partition key** to PgQue so that, within one queue, events sharing a key -are consumed **in order by a single worker at a time**, while events with -different keys are consumed **in parallel** — the log-native ("Kafka partition") -model: order *within* a key, parallelism *across* keys. +Within one queue, events sharing a partition key are consumed **in order by a +single worker at a time**; events with different keys are consumed **in +parallel** — the log-native ("Kafka partition") model: order *within* a key, +parallelism *across* keys. ## 2. The guarantee (precise, testable) - **G1 — per-key affinity + FIFO.** For a queue whose events carry a partition - key, and a fixed slot count `N`, every event of key `K` maps to exactly one - slot `slot(K) = (hashtextextended(K, 0) % N + N) % N` (the `+N` normalization - is mandatory — `hashtextextended` returns `bigint` and `%` of a negative value - is negative; round 2). Within that slot, non-retried events of `K` are - delivered in non-decreasing `ev_id` order, to **no other slot**. - - *Intra-batch order* is guaranteed by the engine's `order by 1` (= `ev_id`) - in `batch_event_sql` (`pgque.sql:440`), preserved through the - `get_batch_cursor` filter wrap (`pgque.sql:2277`). *Cross-batch order* for a - slot follows from sequential consumption of one subscription whose cursor - advances monotonically — **not** from `order by 1` (which is per-batch). -- **G2 — single in-flight processor per key.** At any instant at most one worker - holds an unacked event for `K`. Enforced by the per-subscription receive lock - (`next_batch_custom` … `for update of s`, `pgque.sql:5761`), the same lock the - #97/#125 double-delivery guard relies on (`tests/two_session_receive_lock.sh`). -- **G3 — failure boundary.** Under **`pause`**, no later event of `K` is - delivered until `K`'s failed head event is acked or dead-lettered; other keys - unaffected. Under **`skip`** (v0.1 default), later events of `K` MAY be - delivered before the failure resolves — only *at-least-once* holds. - - **Engine fact:** a retried event keeps its `ev_id` but gets a new `ev_txid` + key and a fixed slot count `N`, every event of key `K` maps to one slot + `slot(K) = (hashtextextended(K, 0) % N + N) % N` (the `+N` normalizes the sign; + `hashtextextended` returns `bigint`). Within that slot, non-retried events of + `K` are delivered in non-decreasing `ev_id` order, to **no other slot**. + Intra-batch order is the engine's `order by 1` (`pgque.sql:440`), preserved + through `get_batch_cursor`'s filter re-wrap (`pgque.sql:2277`); cross-batch + order follows from one subscription's monotonically-advancing cursor. +- **G2 — single in-flight processor per key.** At most one worker holds an + unacked event for `K`. Enforced by the per-subscription receive lock + (`next_batch_custom … for update of s`, `pgque.sql:5761` — the #97/#125 guard). +- **G3 — failure boundary (Phase 2 / `pause`).** Under `pause`, no later event of + `K` is delivered until `K`'s failed head event is acked or dead-lettered, and + after it resolves the deferred events deliver in `ev_id` order, exactly once. + Under `skip` (Phase 1 default), later events of `K` MAY arrive before the + failure resolves — only at-least-once holds. + - **Engine fact:** a retried event keeps its `ev_id`, gets a new `ev_txid` (re-injected by `maint_retry_events` → `insert_event_raw`, `pgque.sql:859`), - so it reappears in a *later* batch. Thus G1's `ev_id` monotonicity holds only - between non-retried events; across a retry the only guarantee is G3's pause - boundary. The retried event re-routes to the **same slot** because - `ev_extra1` is preserved through the retry path (verified `pgque.sql:2376`, - `:861`). + and re-routes to the **same slot** because `ev_extra1` is preserved + (`pgque.sql:861`). So G1's `ev_id` monotonicity holds only between non-retried + events; across a retry the only ordering guarantee is G3's pause boundary. ## 3. Why it's needed -PgQue is an **ordered, immutable log**, not a job queue — workloads need -**per-entity ordering without global ordering**. Motivating case (a multi-tenant -storage service evaluating PgQue to replace pg-boss): millions of file-lifecycle -events that **must be ordered per tenant** but need **no ordering across -tenants**. One in-order consumer can't keep up; naive multi-worker consumption -breaks per-tenant order. +PgQue is an ordered, immutable **log**, not a job queue — workloads need +per-entity ordering without global ordering. Motivating case (a multi-tenant +storage service evaluating PgQue vs pg-boss): millions of file-lifecycle events +that **must be ordered per tenant** but need **no ordering across tenants**. -## 4. Scope and ICP +## 4. Scope and phasing -**In scope (v0.1 implementation):** -- Partition key on a `send()`-sourced event. +**Phase 1 — converged, build now:** +- Partition key on a `send()`-sourced event (D1, D6). - N independent **slot consumers**, each filtering the stream to its hash class - (§6). Stable `(hashtextextended(key,0) % N + N) % N` affinity. -- G1 + G2 always; **`skip` failure policy as the v0.1 default** (sound, stateless). + via `get_batch_cursor` `extra_where` (§6). Stable affinity (D4). +- G1 + G2. **`skip` failure policy** (stateless, sound). +- Persisted/enforced `N` (D3); slot identity + single-owner (D7); SECURITY + DEFINER ownership model (§6). -**Deferred to v0.2 implementation (specified here, not built first):** -- **`pause` failure policy** (G3 strict) — uses a compact blocked-key marker - (§7 D5, §8). Carries the crash/rotation risks (R1, R7); ships after `skip`. +**Phase 2 — specified, follow-up (NOT converged):** +- **`pause` failure policy** (G3 strict). Needs a *defer-without-retry-increment* + primitive that does not exist today (§11 O1), a durable blocked-key marker + (D5), and carries the hot-blocked-key cost (§11 O2). Build after Phase 1. -**Out of scope:** producer idempotency (separate spec, §12); dynamic -rebalancing / elastic `N` (§7 D3, R4); **trigger-sourced queues** -(`jsontriga`/`logutriga`/`sqltriga` store the table name in `ev_extra1` — §7 D1, -R5); cross-queue / cascaded partitioning; automatic hot-partition mitigation. +**Out of scope:** producer idempotency (§12); dynamic `N` / rebalancing (R4); +**trigger-sourced queues** (triggers use `ev_extra1` for the table name — D1, R5); +cascaded/multi-node; automatic hot-partition mitigation. **ICP:** multi-tenant SaaS on managed Postgres with a high-volume per-entity -event stream (entity = partition key: tenant, user, document, device). +event stream (entity = partition key). ## 5. End-to-end workflow ``` producer: pgque.send('files', 'default', payload, partition_key => tenant_id) - │ key → ev_extra1 (send-sourced queues only, v0.1) + │ key → ev_extra1 (send-sourced queues only) ▼ -engine: append-only event tables · global ev_id/ev_txid order (UNCHANGED) +engine: append-only tables · global ev_id/ev_txid order (UNCHANGED) │ full stream ▼ consumers: N slot consumers, each an INDEPENDENT subscription with its own cursor; - slot k reads the whole stream and server-side-filters to its hash class + slot k reads the whole stream, server-side-filtered to its hash class ``` ## 6. Architecture -The v0.1 mechanism is **N independent slot consumers**, *not* a modification of -cooperative consumers (round 1 B1: coop hands each member a disjoint tick window -and cannot fan one batch to N hash-filtered slots without dropping events on the -shared cursor advance; confirmed `cooperative_consumers.sql`, `for update skip -locked` victim-steal at `pgque.sql:6262`). +The mechanism is **N independent slot consumers**, not a modification of +cooperative consumers (round 1 B1: coop hands disjoint tick windows; confirmed +`pgque.sql:6262`). Each slot is its own subscription → own cursor + `sub_id` → no +cross-slot data loss; retry/DLQ rows are slot-scoped (`ev_owner = sub_id`, +`pgque.sql:2374`). ``` @@ -105,206 +101,194 @@ locked` victim-steal at `pgque.sql:6262`). │ append-only tables · global ev_id/ev_txid · rotation │ │ next_batch / get_batch_cursor(i_extra_where) / order by 1 │ └───────┬───────────────┬───────────────┬──────────────────┘ - │ full stream │ full stream │ full stream - ▼ ▼ ▼ + ▼ full stream ▼ full stream ▼ full stream slot 0 (sub#0/N) slot 1 (sub#1/N) slot N-1 own cursor own cursor own cursor filter h%N=0 filter h%N=1 filter h%N=N-1 - one worker one worker one worker ``` **Filtering without touching the engine.** Each slot's receive reuses the -existing admin-only `pgque.get_batch_cursor(…, i_extra_where)` hook -(`pgque.sql:2229`), injecting `and (hashtextextended(ev_extra1,0) % N + N) % N = -k`. The fragment is assembled **only** from the validated integers `N`, `k` -(§8) — never a caller string. `batch_event_sql`, `next_batch`, rotation are -**not modified**; `get_batch_cursor`'s `order by 1` re-wrap preserves G1. - -**Trust boundary (round 2).** `get_batch_cursor`'s `i_extra_where` is a -documented *trusted-SQL sink*, revoked from `public/pgque_reader/pgque_writer` -and granted to `pgque_admin` only (`pgque.sql:2221`, `:4852`). Therefore -`receive_partitioned` and `subscribe_slot` are **`SECURITY DEFINER`, owned by the -installer** (which holds the admin grant), exactly like `receive`/`nack` -(`receive.sql`). The reader role gets EXECUTE on the *wrappers*, never on -`get_batch_cursor`. Safety rests on the wrapper validating and integer-casting -`N`,`k` and interpolating no caller-supplied value — **not** on "the string -contains only integers." - -**Each slot is its own subscription** → its own cursor (`sub_last_tick`) and -`sub_id` → no cross-slot data loss, and retry/DLQ rows are naturally slot-scoped -(`ev_owner = sub_id`, `pgque.sql:2374`). - -**Known cost — read amplification.** Every event is scanned by all `N` slot -cursors (each discards `(N-1)/N` after the hash filter; the filter is applied -*after* the engine materializes the per-slot window, `pgque.sql:2277`, so it -reduces returned rows, not scan work). Steady state ≈ **N×**; **up to ~2N×** -during a rotation-overlap window (the engine's multi-table `union all`); a -**stalled slot scans an ever-widening window** each poll. Documented; -single-reader/dispatch optimization noted as future work (R6). +admin-only `pgque.get_batch_cursor(…, i_extra_where)` hook (`pgque.sql:2229`, +the 4-arg overload), injecting `and (hashtextextended(ev_extra1,0) % N + N) % N = +k`, assembled only from the validated integers `N`,`k` (§8). `batch_event_sql`, +`next_batch`, rotation are not modified; the filter re-wrap preserves G1. + +**SECURITY DEFINER ownership (round 3 — corrected).** `get_batch_cursor`'s +`extra_where` is a trusted-SQL sink, revoked from `public/pgque_reader/ +pgque_writer`, admin-only (`pgque.sql:2221`, `:4852`). `receive_partitioned` and +`subscribe_slot` reach it **because they are owned by the same role that owns +`get_batch_cursor` (the install owner) — a function owner may execute its own +functions regardless of grants.** This is *not* the `receive`/`nack` pattern +(those never call `get_batch_cursor`; they call reader-granted internals), and it +does *not* depend on the owner holding `pgque_admin`. **Invariant:** the +partition functions MUST be created by the same role that ran `\i pgque.sql`. On +managed Postgres the installer is a non-superuser admin role; co-ownership (not a +grant) is what makes the wrapper work — state and test this explicitly. + +**Read amplification.** Every event is scanned by all `N` slot cursors (filter +applied after the engine materializes the window, `pgque.sql:2277` — reduces +returned rows, not scan work). ≈ N× steady; up to ~2N× during rotation overlap; +a stalled slot scans an ever-widening window. Documented; single-reader/dispatch +optimization is future (R6). ## 7. Decisions -| ID | Decision | Choice (v0.3) | Rationale / round-2 change | -|----|----------|---------------|----------------------------| -| D1 | Where the key lives | `ev_extra1`, **`send()`-sourced queues only** | Triggers use `ev_extra1` for the table name. | -| D2 | Failure policy | `skip` default (v0.1); `pause` ships v0.2 | `pause` needs the blocked-key marker (D5). | -| D3 | Elasticity & N | Fixed `N`, **persisted in a `pgque.partition_consumer(queue, consumer, n)` row**; a worker registering a different `n` is **rejected** | "Fixed N" is an enforced invariant, not a convention. N lives outside the slot names (which encode it) so a *new* slot can be validated. | -| D4 | Assignment | `(hashtextextended(key, 0) % N + N) % N` | `hashtextextended` is the stable, documented hash; the `+N` normalizes the sign (round 2). | -| D5 | State budget | **Happy path & `skip`: no state.** `pause`: a compact `pgque.partition_block(sub_id, partition_key, head_ev_id)` marker, written on first failure, cleared on ack-or-DLQ | Round 2 (B-R2-2) proved `retry_queue` is transient (deleted by `maint_retry_events`, `pgque.sql:863`) so it can't be the durable blocked-set. The marker is O(*concurrently-failing keys*) — proportional to failures, **not** throughput, so it is not pg-boss-style per-event churn. Honestly reopens "no new table," scoped to `pause`. | -| D6 | Producer signature | `send(queue, type, payload, partition_key => text)` (new 4-arg overload) | Avoids collision with `send(queue, type, payload)`. | -| D7 | Slot identity & single-owner | slot = consumer `"#k/N"`; G2 via the per-subscription receive lock. `receive_partitioned`/`subscribe_slot` are `SECURITY DEFINER` (§6) | Defines what a slot is and what makes G2 true and reader-callable. | +| ID | Decision | Choice (v0.4) | Notes | +|----|----------|---------------|-------| +| D1 | Key location | `ev_extra1`, `send()`-sourced queues only | Triggers use `ev_extra1` for table name. | +| D2 | Failure policy | `skip` default (Phase 1); `pause` is Phase 2 (§11) | `pause` has open mechanics. | +| D3 | N | Fixed, persisted in `pgque.partition_consumer(queue, consumer, n)` (written inside SECURITY DEFINER `subscribe_slot`; table revoked from app roles); changed `n` rejected | Enforced invariant, not convention. | +| D4 | Assignment | `(hashtextextended(key,0) % N + N) % N` | Stable, sign-safe. | +| D5 | State budget | **Phase 1 / happy / `skip`: no state, no per-event writes.** **Phase 2 `pause`:** durable `pgque.partition_block(sub_id, partition_key, head_ev_id)` marker (FK `sub_id → subscription on delete cascade`; index `(sub_id, partition_key)`). Blocked keys additionally incur defer churn (§11 O1) — so "no per-event churn" is a Phase-1/non-blocked-key claim only. | Round 3 corrected the churn framing. | +| D6 | Producer signature | `send(queue, type, payload, partition_key => text)` | Avoids `send(queue,type,payload)` collision. | +| D7 | Slot & single-owner | slot = consumer `"#k/N"`; G2 via per-subscription receive lock; functions SECURITY DEFINER co-owned with `get_batch_cursor` (§6) | Reader-callable, owner-reachable. | ## 8. Implementation details -- **Producer:** `pgque.send(queue, type, payload, partition_key text default - null)` → `insert_event(queue, type, payload, ev_extra1 => partition_key, …)`. - `SECURITY DEFINER set search_path = pgque, pg_catalog`; revoke from public, - grant `pgque_writer`. -- **Slot registration:** `pgque.subscribe_slot(queue text, consumer text, k int, - n int)` — validates `n >= 1 and 0 <= k < n` (raises otherwise), upserts the - persisted `n` for `(queue, consumer)` and **rejects a changed `n`** (D3), - registers subscription `"#k/n"`. Idempotent for the same `(k,n)`. -- **Partitioned receive:** `pgque.receive_partitioned(queue, consumer, k int, n - int, …)` — after validating/casting `k,n` to int, calls `next_batch` + - `get_batch_cursor(…, i_extra_where => format('and (hashtextextended(ev_extra1,0) - %% %s + %s) %% %s = %s', n, n, n, k))`. Under `pause`, the wrapper also - withholds events whose key has an open `partition_block` row for this `sub_id` - with `head_ev_id < ev_id`. `SECURITY DEFINER` (§6); granted `pgque_reader`. -- **`pause` lifecycle (v0.2):** on nack of `K#i`, upsert - `partition_block(sub_id, K, head_ev_id => ev_id)`. The row is **durable**, so - it survives a crash with no reconstruction. Clear it when `K#i` is acked - (success after retry) **or** dead-lettered — unblock predicate: ack, OR a - `dead_letter` row exists for `(dl_consumer_id = this slot, ev_id = K#i)` - (`event_dead`, `dlq.sql`). So a poison key cannot wedge past `max_retries`. -- **`pause` must not hold the batch open** (R7): it acks the batch and tracks - blocked keys via `partition_block`, so the slot's cursor keeps advancing past - *non-blocked* keys and does not pin rotation. -- **Teardown:** `pgque.unsubscribe_slot(queue, consumer, k, n)` removes the slot - subscription; full-consumer teardown removes all N + the - `partition_consumer`/`partition_block` rows. **Caveat:** `unregister_consumer` - cascades `dead_letter` (`on delete cascade`), so tearing down a slot drops that - slot's DLQ audit — documented. -- **Grants:** producer overload → `pgque_writer`; `subscribe_slot` / - `unsubscribe_slot` / `receive_partitioned` → `pgque_reader`. Deny-by-default - re-applied. `get_batch_cursor` stays revoked from all app roles. - -## 9. Tests plan (red/green TDD) - -CI matrix PG 14–18. Write the failing test first. - -- **T-G1a (affinity, stable):** assert the literal `(hashtextextended(K,0)%N+N)%N` - on **every** CI PG version, pinning one concrete `(K, expected)` pair so a hash - drift is caught even if all versions move together. *(red first)* -- **T-G1b (per-key FIFO):** interleave A,B,A,A,B; assert each key in `ev_id` order - across batches, no key on two slots. *(No existing test guards intra-batch - ev_id order — this is new and load-bearing.)* -- **T-retry-affinity (new, round 2):** nack a keyed event; run - `maint_retry_events()` + `force_next_tick` + `ticker()`; assert it redelivers - to the **same** slot `k` and no other. *(The core correctness property under - retry.)* -- **T-G2-block:** two workers on the **same** slot → second blocks on the - subscription lock (mirror `tests/two_session_receive_lock.sh`). -- **T-G2-parallel:** two workers on **different** slots → neither blocks (the - parallelism half). -- **T-no-drop:** keys spanning all slots in one tick window; run all N slots; - union delivered = all events, zero loss. -- **T-security (new, round 2):** a bare `pgque_reader` can call - `receive_partitioned`/`subscribe_slot` end-to-end; and `pgque_reader` **cannot** - call `get_batch_cursor` directly (mirror - `tests/test_security_get_batch_cursor.sql`); `receive_partitioned` rejects a - non-integer/out-of-range `n`/`k`. -- **T-G3-pause (order-after-retry):** A#2 nacked (`pause`); drive - `maint_retry_events()` + tick; assert A#3 withheld until A#2 acked-or-DLQ'd; B - unaffected. -- **T-DLQ-unblock:** A#2 exhausts retries → `dead_letter`; assert A#3 then - proceeds. -- **T-slot-crash:** worker holding A#2 (and its `partition_block` row) dies; a - new worker takes slot k; drive maint+tick; assert A#2 redelivered before A#3 - and only to slot k. *(Crash specifically in the post-`maint` window where the - `retry_queue` row is already gone — the round-2 hole.)* -- **T-G3-skip (reorder boundary):** with `skip`, assert the exact permitted - reorder after A#2 fails. -- **T-N-invariant (new, round 2):** `subscribe_slot(…,k,n)` is idempotent; - `subscribe_slot(…,k,n2≠n)` raises. -- **T-empty-slot / T-hot-key:** an empty slot doesn't wedge others; a hot key - saturates one slot while others drain. -- **T-no-bloat (happy path):** all-ack processing of M events adds zero - `retry_queue`/`dead_letter`/`partition_block` rows and no per-event - UPDATE/DELETE. (Failure path legitimately writes — out of scope here.) +- **Producer:** `send(queue, type, payload, partition_key text default null)` → + `insert_event(…, ev_extra1 => partition_key, …)`. SECURITY DEFINER, pinned + search_path; revoke public, grant `pgque_writer`. +- **Tables (created in Phase 1, `if not exists`):** `partition_consumer` + (N persistence) and `partition_block` (Phase-2 marker, empty in Phase 1 so + test assertions are well-formed). Both revoked from app roles (the + `dead_letter` pattern, `dlq.sql:236`); written only inside SECURITY DEFINER + functions. +- **`subscribe_slot(queue, consumer, k int, n int)`:** validate `n>=1 and + 0<=k#k/n"`. Idempotent for the same `(k,n)`. +- **`receive_partitioned(queue, consumer, k int, n int, …)`:** after casting + `k,n` to int, `next_batch` + `get_batch_cursor(…, i_extra_where => + format('and (hashtextextended(ev_extra1,0) %% %s + %s) %% %s = %s', n,n,n,k))`. + SECURITY DEFINER (§6); granted `pgque_reader`. +- **`pause` (Phase 2):** on nack of `K#i`, upsert `partition_block(sub_id, K, + head_ev_id => ev_id)`. A later event of a blocked key (open marker with + `head_ev_id < ev_id`) is **deferred** (see §11 O1 for the missing primitive), + not server-side-dropped (dropping + cursor-advance would lose it). Clear the + marker when `K#i` is acked, **or** when it is dead-lettered — DLQ-unblock + predicate: a `dead_letter` row exists for `ev_id = K#i` and `dl_consumer_id` + equal to this slot's `co_id`, where the slot's `co_id` is obtained by joining + `subscription` (`partition_block.sub_id → subscription.sub_consumer = + dead_letter.dl_consumer_id`) — `sub_id` and `co_id` are different ID spaces + (`dlq.sql:24,75-85`, `pgque.sql:170-183`); do not compare them directly. +- **Teardown:** `unsubscribe_slot` removes the slot subscription (the + `partition_block` FK cascades). Note `unregister_consumer` cascades + `dead_letter` (`dlq.sql:24`), so dropping a slot drops its DLQ audit — + documented. +- **Grants:** producer → `pgque_writer`; `subscribe_slot`/`unsubscribe_slot`/ + `receive_partitioned` → `pgque_reader`; `partition_consumer`/`partition_block` + revoked from all app roles; `get_batch_cursor` stays admin-only. Deny-by-default + re-applied. + +## 9. Tests plan (red/green TDD), CI PG 14–18 + +**Phase 1 (must pass to ship):** +- **T-G1a:** literal `(hashtextextended(K,0)%N+N)%N` on every CI version, pinning + one concrete `(K, expected)` pair. *(red first)* +- **T-G1b:** interleave A,B,A,A,B → each key in `ev_id` order across batches, no + key on two slots. (No existing test guards intra-batch `ev_id` order.) +- **T-retry-affinity:** nack a keyed event; `maint_retry_events()` + + `force_next_tick` + `ticker()`; assert redelivery to the **same** slot only. +- **T-G2-block / T-G2-parallel:** same slot → second worker blocks (mirror + `two_session_receive_lock.sh`); different slots → neither blocks. +- **T-no-drop:** keys across all slots in one window; all N slots; union = all + events, zero loss. +- **T-security:** run against an install whose owner is a **non-superuser, + non-`pgque_admin` role** — a bare `pgque_reader` can call + `receive_partitioned`/`subscribe_slot` end-to-end, and **cannot** call + `get_batch_cursor` directly (`42501`, mirror `test_security_get_batch_cursor.sql`); + non-integer/out-of-range `n`,`k` rejected. +- **T-N-invariant:** `subscribe_slot(…,k,n)` idempotent; `(…,k,n2≠n)` raises. +- **T-no-bloat (happy path):** all-ack of M events → zero `retry_queue`/ + `dead_letter`/`partition_block` rows (guard the `partition_block` clause with + `to_regclass('pgque.partition_block') is not null`) and no per-event + UPDATE/DELETE. - **T-engine-untouched:** `pg_get_functiondef` of `batch_event_sql`, - `next_batch_custom`, **and `get_batch_cursor`** (round 2 — the slot model - depends on its `order by 1` re-wrap) byte-identical to baseline. -- **T-idempotent-install:** re-running `pgque.sql` re-creates the partition - functions/tables cleanly. - -## 10. Risks and open questions - -- **R1 — `pause` crash safety: resolved by D5's durable `partition_block` - marker** (round 2 B-R2-2). Test: T-slot-crash with the crash in the - post-`maint` window. -- **R2 — read amplification:** N× steady, ~2N× during rotation overlap, - unbounded-width for a stalled slot (§6). Benchmark the stalled case, not just - uniform N. -- **R3 — hot partitions:** one hot key saturates its slot; documented only. -- **R4 — changing N:** an enforced invariant (D3); true rebalancing is future. -- **R5 — `ev_extra1` semantics:** restricted to `send()`-sourced queues; a - dedicated partition-key column is future work. -- **R6 — single-reader/dispatch optimization** to remove read amplification; - adds a hop/state, out of v0.1's budget. -- **R7 — rotation wedge (round 2):** rotation waits for `min(sub_last_tick)` over - ALL subscriptions (`pgque.sql:910`); N slots lower that floor to the slowest - slot, and a wedged `pause` slot could pin rotation for the whole queue → - unbounded data growth. Mitigated by §8 ("`pause` does not hold the batch open"; - cursor advances past non-blocked keys) + an alert on per-slot staleness. - -## 11. (reserved) + `next_batch_custom`, and `get_batch_cursor/4` (pin the 4-arg overload) + byte-identical to baseline. +- **T-idempotent-install:** re-running `pgque.sql` re-creates functions + the two + new tables (`create table/unique index if not exists`) cleanly. + +**Phase 2 (`pause`) — write when O1/O2 (§11) resolve:** +- **T-G3-pause:** A#2 nacked; drive maint+tick; A#3 withheld until A#2 + acked-or-DLQ'd; **and after unblock A#2 then A#3 deliver in `ev_id` order, + exactly once**; B unaffected. +- **T-DLQ-unblock:** A#2 exhausts retries → `dead_letter`; assert the + `partition_block` row for A drops to 0 **via the DLQ branch (no ack ever + occurred)** and A#3 then proceeds. +- **T-slot-crash:** worker holding A#2 dies; assert the `partition_block` row is + present after the crash and before the new worker's first receive (durability); + drive maint+tick; A#2 redelivered before A#3, only to slot k. Crash in the + post-`maint` window (retry_queue row already gone). +- **T-hot-blocked-key:** hot `K` blocked under `pause` → other slots unaffected; + defer cost is bounded by `K`'s backlog until DLQ, not total throughput. + +## 10. Risks + +- **R2 — read amplification:** N× steady, ~2N× rotation overlap, widening for a + stalled slot. Benchmark the stalled case. +- **R3 — hot partitions:** documented only. +- **R4 — changing N:** enforced invariant (D3); rebalancing is future. +- **R5 — `ev_extra1`:** send-sourced queues only. +- **R6 — single-reader/dispatch** to remove read amplification; future. +- **R7 — rotation pressure:** rotation waits for `min(sub_last_tick)` over ALL + subscriptions (`pgque.sql:910`); N slots lower the floor to the slowest slot. A + `pause`-blocked slot does *not* pin rotation (deferred events go to retry, not + the held cursor), **but** a hot blocked key keeps that slot perpetually lagging, + so a per-slot staleness alert cannot distinguish "wedged" from "hot key under + pause" — documented, not auto-mitigated. + +## 11. Open design items — Phase 2 `pause` (why it's a follow-up) + +- **O1 — defer-without-retry-increment primitive (blocking for `pause`).** + `finish_batch` acks the whole batch, so withholding `K#i+1` requires removing it + from the batch. A **server-side filter that lets the cursor advance would lose + it** (round-2 data-loss). `event_retry` preserves it but **increments + `ev_retry`**, so a long-blocked key's deferred events would falsely march toward + `max_retries`/DLQ. `pause` therefore needs a new "re-queue without counting as a + retry" path (or a hold-cursor design that doesn't wedge rotation). Undecided. +- **O2 — hot-blocked-key cost.** Until O1 is settled, a hot blocked key makes its + slot re-defer a growing backlog each poll (per-event churn for that key, + bounded by the head's time-to-DLQ). Acceptable for rare failures (the migration + ICP); needs documentation + T-hot-blocked-key before `pause` ships. ## 12. Relationship to producer idempotency (deferred sibling) -Producer-side dedup is a **TTL window** (SQS/NATS model), append-only, GC'd by -rotation — a separate spec. In a log, "processed" is a per-consumer fact the -producer cannot see, so dedup is a producer-side time window while -ordering/serialization is this consumer-side feature. Rationale and prior art: -`blueprints/idempotency/DESIGN.md`. +Producer dedup is a TTL window (SQS/NATS), append-only, GC'd by rotation — a +separate spec. Rationale: `blueprints/idempotency/DESIGN.md`. -## 13. Team of veteran experts (review panel) +## 13. Review panel -- **Lead:** drafts/revises (this document). -- **Reviewer A — ops/security:** rounds 1 + 2 applied (security trust boundary, - blocked-set durability, rotation wedge, modulo sign). -- **Reviewer B — QA/testability:** rounds 1 + 2 applied (confirmed G1 ordering + - G2 lock; grant/DEFINER wiring; retry-affinity, security, N-invariant tests). +- **Lead:** drafts/revises. +- **Reviewer A — ops/security** · **Reviewer B — QA/testability.** Rounds 1–3 + applied. Round 3 verdict: Phase 1 converged; Phase 2 (`pause`) has open items + O1/O2 → split out as follow-up. ## 14. Sprint plan -1. **S1 — producer + key plumbing:** `send(…, partition_key =>)` on - send-sourced queues. Tests T-G1a, T-no-bloat(happy), T-idempotent-install. -2. **S2 — slot consumers (`skip` default):** `subscribe_slot` (persisted N, D3), - `receive_partitioned` via `get_batch_cursor` `extra_where` (`SECURITY - DEFINER`, D7), teardown. Tests T-G1b, T-retry-affinity, T-G2-block, - T-G2-parallel, T-no-drop, T-security, T-N-invariant, T-G3-skip, - T-engine-untouched. -3. **S3 — `pause` policy (v0.2):** `partition_block` marker; DLQ-unblock; "no held - batch" (R7). Tests T-G3-pause, T-DLQ-unblock, T-slot-crash. -4. **S4 — docs + benchmark:** throughput vs N; read-amp (steady, rotation, - stalled); per-tenant order under load. +1. **S1 — producer + key plumbing** (+ the two tables, empty). T-G1a, + T-no-bloat(happy), T-idempotent-install. +2. **S2 — slot consumers (`skip`), SECURITY DEFINER + co-ownership, persisted N.** + T-G1b, T-retry-affinity, T-G2-block/parallel, T-no-drop, T-security, + T-N-invariant, T-engine-untouched. **← Phase 1 ships here.** +3. **S3 — `pause` (Phase 2), gated on O1/O2.** T-G3-pause, T-DLQ-unblock, + T-slot-crash, T-hot-blocked-key. +4. **S4 — docs + benchmark** (read-amp: steady/rotation/stalled; per-tenant order). ## 15. Changelog -- **v0.3 (draft):** review round 2 applied. **Confirmed G1 ordering is true** - (engine `order by 1`, preserved through `get_batch_cursor`) and the G2 lock is - real/tested. Fixed: (security) `receive_partitioned`/`subscribe_slot` are - `SECURITY DEFINER` over the admin-only `get_batch_cursor`, with integer - validation and the real trust-boundary argument (B-R2-1); (correctness) - `pause` blocked-set moved from the transient `retry_queue` to a durable compact - `partition_block` marker (B-R2-2 / D5); (bug) modulo normalized - `(h%N+N)%N` (D4); added R7 rotation-wedge + "no held batch"; specified N - persistence + teardown + DLQ-cascade caveat (D3); explicit DLQ-unblock - predicate. Added tests: T-retry-affinity, T-security, T-N-invariant, split - T-G2 block/parallel, `get_batch_cursor` in T-engine-untouched, pinned hash - pair. Round-2 decisions in `decisions.md`. -- **v0.2 (draft):** review round 1 — re-grounded the mechanism to N independent - slot subscriptions; restated G1/G2/G3; corrected retry rationale; `skip` - default. (Full detail in `decisions.md`.) -- **v0.1 (draft):** initial single-pass SamoSpec-format draft. +- **v0.4 (draft):** review round 3. **Phase 1 declared converged / + implementation-ready; `pause` split into Phase 2** with explicit open items + (§11 O1 defer-without-retry-increment, O2 hot-blocked-key). Corrected the + SECURITY DEFINER justification to the **co-ownership** invariant (not + `pgque_admin`; not "like receive/nack") + non-superuser-owner security test + (round 3 B1). Fixed the DLQ-unblock `sub_id`↔`co_id` join; added + `partition_block` FK-cascade + index + revoked-from-roles; tables created empty + in Phase 1; `T-no-bloat` guarded with `to_regclass`; `T-engine-untouched` pins + the `/4` overload; `T-G3-pause` now asserts in-order-exactly-once after unblock; + `T-DLQ-unblock` asserts marker-clear-via-DLQ; `T-slot-crash` asserts marker + durability; added `T-hot-blocked-key`. Round-3 detail in `decisions.md`. +- **v0.3:** round 2 — confirmed G1 ordering + G2 lock real; SECURITY DEFINER + wiring; durable `partition_block`; modulo sign fix; R7. +- **v0.2:** round 1 — N independent slot subscriptions; G1/G2/G3; `skip` default. +- **v0.1:** initial SamoSpec-format draft. diff --git a/blueprints/partition-keys/decisions.md b/blueprints/partition-keys/decisions.md index 5f2ed0ef..5b2d2270 100644 --- a/blueprints/partition-keys/decisions.md +++ b/blueprints/partition-keys/decisions.md @@ -97,8 +97,49 @@ Accepted / rejected / deferred choices, tracked across review rounds. B4 (D2-vs-state / send sig) ✅ closed · B5 (DLQ-unblock / slot definition) ✅ closed; spawned B-R2-1 (now fixed) · B6 ✅ closed. -## Still open (for round 3, if run) -- Bench numbers for read amplification (steady N× vs ~2N× rotation vs stalled - slot) to decide if R6 (single-reader/dispatch) is needed in v0.1. -- Exact `partition_block` withhold predicate wording in `receive_partitioned` - (server-side `not exists` vs worker-side filter). +## Review round 3 (convergence; both personas verified against the engine) + +### Verdict +**Phase 1 (`skip`-default partition consumption) CONVERGED / implementation-ready. +Phase 2 (`pause`) NOT converged — split out as a follow-up** with open items O1/O2. +Both reviewers agreed the round-2 engine-anchor and security *posture* are solid; +the remaining gaps are all in the new `pause`/DLQ surface. + +### Accepted → v0.4 +- **B1 (security ownership, affects Phase 1).** The "SECURITY DEFINER owned like + receive/nack" justification was wrong: `receive`/`nack` never call + `get_batch_cursor`. The real mechanism is **co-ownership** — a function owner + may execute its own functions regardless of grants, so `receive_partitioned` + reaches the admin-only `get_batch_cursor` only because the install owner owns + both. Not `pgque_admin` membership. Invariant: partition functions created by + the `\i pgque.sql` owner. Test under a non-superuser owner. (§6, D7, T-security) +- **The `pause` withhold mechanic is genuinely unsolved (O1).** Combining the two + reviewers: a server-side filter that advances the cursor **loses** the withheld + event (data loss); `event_retry` preserves it but **increments `ev_retry`**, so + deferred events of a long-blocked key march toward false DLQ. `pause` needs a + *defer-without-retry-increment* primitive that does not exist. → `pause` is + Phase 2; O1 is its blocking open item. (§11 O1) +- **Hot-blocked-key cost (O2).** Until O1, a hot blocked key re-defers a growing + backlog per poll (bounded by head's time-to-DLQ). Document + T-hot-blocked-key. +- **DLQ-unblock ID-space join.** Marker keyed on `sub_id`; `dead_letter.dl_consumer_id` + is `co_id`. Must join `subscription` to map; do not compare directly. (§8) +- **`partition_block` hygiene:** FK `sub_id → subscription on delete cascade` + (no orphans), index `(sub_id, partition_key)`, revoked from app roles, created + empty in Phase 1 so `T-no-bloat` is well-formed (guard with `to_regclass`). (§8, D5) +- **Test tightening:** `T-engine-untouched` pins the `get_batch_cursor/4` overload; + `T-G3-pause` asserts in-order-exactly-once after unblock; `T-DLQ-unblock` + asserts marker-clear-via-DLQ-branch (no ack); `T-slot-crash` asserts marker + durability; `T-security` runs under a non-superuser owner. (§9) +- **N persistence writer/grants:** `partition_consumer` written inside + SECURITY DEFINER `subscribe_slot`; table revoked from app roles. (D3, §8) + +### Round closure +B-R2-1 security: posture closed, ownership prose corrected (B1). B-R2-2 durable +marker: direction correct; hygiene (FK/index/grants) + the withhold mechanic (O1) +now specified/flagged. All round-1 + round-2 *Phase-1* items closed. + +## Still open (Phase 2 `pause`, before it can be built) +- **O1** — choose the defer-without-retry-increment mechanism (new primitive vs + hold-cursor-without-wedging-rotation). +- **O2** — bound + document the hot-blocked-key degradation. +- Read-amplification bench numbers (R2) to decide if R6 is needed. diff --git a/web/public/briefs/partition-keys.html b/web/public/briefs/partition-keys.html index 6a412c81..e35ff604 100644 --- a/web/public/briefs/partition-keys.html +++ b/web/public/briefs/partition-keys.html @@ -76,8 +76,8 @@

Partition KeysOrdered, parallel consumption — the log-native way

slug partition-keys - version v0.3 (draft) - review rounds 1+2 applied + version v0.4 (draft) + Phase 1 converged engine untouched
@@ -172,18 +172,18 @@

03Architecture

04Scope

-

In · v0.1

+

Phase 1 · converged, build now

  • Partition key on send()-sourced events
  • -
  • Stable hashtextextended(key,0) % N affinity
  • +
  • N independent slot consumers, stable affinity
  • Per-key order + single processor (G1, G2)
  • skip failure policy (default)
-

Deferred / out

+

Phase 2 · follow-up / out

    -
  • pause policy — specified, ships v0.2
  • +
  • pause strict order — open mechanic (defer w/o retry-count)
  • Producer idempotency / dedup window
  • Trigger-sourced queues · dynamic N
  • Hot-partition mitigation · read-amp optimization
  • @@ -196,7 +196,7 @@

    05Key decisions

    IDDecisionChoice (v0.2) D1Where the key livesev_extra1, send()-sourced queues only - D2Failure policyskip default; pause ships v0.2 + D2Failure policyskip default (Phase 1); pause = Phase 2 D3Slot count NFixed, persisted & enforced per (queue, consumer) D4Assignment function(hashtextextended(key,0) % N + N) % N (stable, sign-safe) D5State budgetHappy path & skip: none. pause: a compact blocked-key marker (per failing key, not per event) @@ -204,7 +204,7 @@

    05Key decisions

    D7Slot & single-ownerslot = consumer "<c>#k/N"; receive lock -

    Round 1 re-grounded the mechanism (slots are independent subscriptions, not a coop overlay). Round 2 verified it against the engine: G1's ev_id ordering is real (order by 1, preserved through the filter) and the G2 single-owner lock is the tested #97 double-delivery guard. It also corrected the security trust boundary (the filter hook is admin-only, so the consumer wrappers are SECURITY DEFINER) and moved pause's blocked-set to a durable per-failing-key marker.

    +

    Three review rounds against the real engine. Round 2 verified G1's ev_id ordering and the G2 single-owner lock are real. Round 3 declared Phase 1 (skip) converged / implementation-ready and split pause out as a follow-up: withholding a blocked key's later event has an unsolved mechanic — a server-side drop would lose the event, and event_retry would wrongly count it toward DLQ — so strict pause needs a defer-without-retry-increment primitive that doesn't exist yet. It also corrected the security model: the filter hook is admin-only, so the consumer wrappers reach it by co-ownership with the installer, not by a role grant.

    06Sprint plan

    @@ -214,10 +214,10 @@

    06Sprint plan

    S4  docs + read-amp benchmark

    - Open for round 2: the exact retry_queue predicate that rebuilds a slot's - blocked-key set after a crash (the crux of crash-safe pause), and whether - N× read amplification at target throughput justifies a single-reader/dispatch - optimization in v0.1. + Phase 1 (S1–S2) is the converged, build-now slice. Phase 2 (S3, pause) + is gated on two open items: a defer-without-retry-increment primitive (O1) and + bounding the hot-blocked-key cost (O2). Phase 1 alone covers the per-tenant + ordered, parallel consumption the motivating workload needs.