docs(rfc): RFC 0001 bulk data plane (load/export interfaces and scale path)#219
docs(rfc): RFC 0001 bulk data plane (load/export interfaces and scale path)#219ragnorc wants to merge 1 commit into
Conversation
… path Proposes canonicalizing the bulk-write surface as `load` on every layer, a transport-agnostic batch core, streamed NDJSON and Arrow transports, snapshot-pinned and compressed export, and the large-scale-load strategy (branch-load, incremental staging, resumability). Documents the per-mode correctness guarantees and gaps, and a substrate-trajectory section on Lance native Multi-Table Transactions retiring the manifest-CAS/sidecar stack. Public RFC track, Status: Proposed.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8a81fea1f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| One isolation interaction must land with this item. Today's loads exhibit a snapshot-isolation write skew on edge referential integrity: edge endpoints are validated against a snapshot plus pending batches, but the publish-time expected-version CAS covers only the tables this load writes (`crates/omnigraph/src/exec/staging.rs:501-514`, `db/manifest/publisher.rs:353-368`), so a concurrent mutation of a node table this load only reads can land between validation and publish, committing edges to nodes that no longer exist. Incremental staging lengthens exactly that window, so phase 2 must pair it with a remedy rather than widening the anomaly: either include the RI-read node tables' versions in the expected map (turning the race into a loud publish conflict, at the cost of false conflicts on unrelated node writes) or re-validate RI at commit time under the touched-table queues. See unresolved question 7. | ||
| - **Multipart Arrow IPC load.** `Content-Type: multipart/mixed`, one part per table, each part one Arrow IPC stream (IPC streams are single-schema by spec, so multi-table loads need framing; multipart keeps one request = one `load_as` = one publish, with no cross-call state). Schema compatibility is checked against the catalog per part; ID resolution and all validators run on the decoded batches. Note `allow_64bit` defaults to false in IPC writers, so individual field lengths cap at signed 32-bit, relevant for `Blob` properties and documented as a limit. | ||
| - **Load-by-reference.** `POST /load` with a JSON body `{"uri": …}` instead of inline data; the server streams the artifact from object storage through the same decoder. Fetch scope is restricted to the graph's own bucket or a server-issued presigned upload prefix; arbitrary URIs are rejected (SSRF). | ||
| - **Idempotency keys.** Optional `Idempotency-Key` request header, recorded in the published commit's metadata in `_graph_commits`. On a retry, the server scans recent commits on the target branch for the key and returns the original `LoadOutput` instead of re-applying. Derived from the commit graph: no side table, no drift. This directly closes the retry-after-roll-forward double-apply window. |
There was a problem hiding this comment.
Persist idempotency keys through recovery
Storing the key only in _graph_commits does not cover the retry window this RFC is meant to close: current RecoverySidecar data has actor/writer/table pins but no request key, and normal publishes also advance the manifest before record_graph_commit (graph_publish.before_commit_append). If a load crashes after per-table commits or after the manifest publish and is then rolled forward/retried, a commit-graph scan will not find the original key, so append-mode POST /load can still double-apply with the same Idempotency-Key. Please make the key part of the durable recovery/publish boundary, or avoid claiming this closes the double-apply window.
Useful? React with 👍 / 👎.
| - **The manifest-CAS + sidecar + sweep stack is removable** once MTT lands. This is the natural closure of the known gap that `invariants.md` and `AGENTS.md` already document ("multi-table commit = manifest CAS + recovery sidecars; *not* a single Lance primitive"). It is tracked by `iss-863` ("Track Lance namespace batch commit API and retire recovery sidecar when available"); that issue should be re-pointed at #7264 as the concrete upstream design. The per-table staging primitive (`stage_append` and its siblings) survives the swap; what MTT replaces is the multi-table coordination above it, since MTT itself stages each table before its atomic flip. So the loader and batch core, which call through the staging trait, are insulated from the change. | ||
| - **The cross-process write fence comes for free.** MTT's leased barrier is the proper version of the lease this RFC deliberately declines to hand-roll. The single-coordinator topology rule is the bridge until then. | ||
| - **Write skew does not go away.** MTT's stated isolation level is snapshot isolation with write skew *explicitly permitted* ("Disjoint concurrent transactions may both commit"). So the edge-RI write-skew anomaly (phase 2 / unresolved Q7, `iss-ri-write-skew-dangling-edges`) is an omnigraph application-level integrity concern that survives the migration; it is not a sidecar-era artifact and must be solved in omnigraph regardless of which substrate carries the commit. | ||
| - **MTT addresses commit coordination; large-load scale is separate.** Its scaling win keeps the catalog out of the per-commit hot path, which matters for commit *frequency* (many small transactions serializing on the catalog) rather than commit *size* (one large load is a single cheap commit either way, even under the removed `table_version_storage` approach). MTT replaces the cheapest part of a large load (the O(tables) flip) and leaves the expensive parts (fragment writes, validation, RI) untouched. The large-load levers all live above it (incremental staging, branch isolation, resumability) and are substrate-independent. So large-load work should not block on #7264. | ||
| - **Two MTT open questions bear on omnigraph directly.** Pointer-swap vs fast-forward-main (lance#7264 open question 1) decides whether a table's physical identity stays stable: pointer-swap means the catalog becomes the sole resolver and any omnigraph code that opens a dataset by physical path must re-resolve after every commit; fast-forward-main keeps identity stable at the cost of a per-transaction intent record (structurally the sidecar we already have). And the barrier's safety needs a generation/epoch, not a TTL alone, or a slow rebase that outruns the lease can still lose an update. Both are flagged in the upstream discussion. | ||
|
|
||
| Until #7264 ships, the interim machinery stays exactly as today; this RFC adds no new sidecar kind and no new writer that advances Lance HEAD before manifest publish. | ||
|
|
||
| A scope note to prevent a common confusion: Lance's MemTable/WAL (MemWAL) LSM is a separate substrate primitive on a different axis (single-table streaming write latency), not multi-table commit. It does not apply to bulk loads, which write large immutable fragments directly, and is tracked separately for hot-table write latency (`iss-681`). Do not conflate it with MTT. | ||
|
|
||
| ### Error behavior | ||
|
|
||
| All new failure modes are typed and loud: budget exceeded (names the budget and observed value), unsupported media type (406/415 with the supported list), multipart part schema mismatch (names the table and field), idempotency-key replay (200 with the original output, flagged in the body), export job expiry (410 with the job id). Streamed-load failures before the publish leave the manifest untouched, as today; aborted streams surface the abort reason in the response trailer or the connection error, never a partial commit. | ||
|
|
||
| ## Guarantees and known correctness gaps | ||
|
|
||
| What a load promises today, stated per mode so the contract is explicit: | ||
|
|
||
| - **Atomicity (all modes).** A load publishes one manifest commit; readers see all of it or none. Merge, append, and overwrite all run the same stage → `commit_staged` → manifest-flip tail. Two caveats: a fork-if-missing load is two commits (the branch creation publishes before the load, so a failed load leaves an empty created branch), and a crash after the per-table commits but before the manifest publish is rolled forward on the next open, so a client that saw an error may find the load applied. | ||
| - **Durability (all modes).** Lance fragments and the manifest are persisted before the response is sent. The roll-forward above is durability-preserving recovery, not loss. | ||
| - **Isolation (snapshot).** Readers hold one snapshot for a query; writers get first-committer-wins per table via the expected-version CAS. The gap is cross-table write skew on edge RI (below), which snapshot isolation permits and which survives the Lance MTT migration. | ||
| - **Consistency (validated at write, with mode-specific gaps).** Value, enum, intra-batch unique, edge-endpoint, and edge-cardinality checks run identically for all three modes before any commit. The remaining gaps are mode-specific and listed below. | ||
|
|
||
| Known correctness gaps on the load path, each tracked: | ||
|
|
||
| | Gap | Modes | Tracking | | ||
| |---|---|---| | ||
| | Merge replaces the whole matched row; omitted fields become null (silent data loss) | merge | `iss-986`, Q1 | | ||
| | Edge-RI write skew: a concurrent node mutation between validation and publish commits dangling edges | all | `iss-ri-write-skew-dangling-edges`, Q7 | | ||
| | Overwrite of a node table orphans committed edges in edge tables the load did not touch | overwrite | `iss-overwrite-orphans-committed-edges` | | ||
| | Append does not check incoming ids against committed rows, so it can create duplicate keys | append | `iss-714` | | ||
| | `@unique` / `@key` enforced intra-batch only, not against already-committed rows | all | `iss-714` | |
There was a problem hiding this comment.
Opaque internal issue IDs throughout a public RFC
The docs/rfcs/ track is explicitly open to external contributors (per docs/rfcs/README.md), but the document contains multiple internal tracking IDs that are opaque to anyone outside the team: iss-863 (line 169), iss-681 (line 177), iss-986 (line 196), iss-714 (lines 199–200), iss-cedar-context-load-mode (line 121), iss-ri-write-skew-dangling-edges (lines 127, 172, 197, 211, 237), iss-overwrite-orphans-committed-edges (line 198). An external contributor reading the Guarantees table cannot look up iss-714 or iss-986 to understand the tracked work or its status.
Per AGENTS.md rule 5, these should be replaced with GitHub issue links (e.g. https://github.com/modernrelay/omnigraph/issues/NNN) or with inline prose rationale. For slugs like iss-ri-write-skew-dangling-edges that don't map to a number, expanding them to a brief "tracked in a follow-up issue" sentence (or, when the GitHub issue exists, linking it) preserves the context without tying the document to a private tracker. The same applies to the iss-863 note in the Substrate trajectory section — that issue should become a GitHub link once it's in the public tracker.
Context Used: AGENTS.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| ### Server: phase 1 transport | ||
|
|
||
| - `POST /load`: raw `application/x-ndjson` body, parameters in the query string. Chunks feed the decoder as they arrive. Budget enforcement is incremental and loud (typed error naming the exceeded budget), replacing the silent-by-comparison 413. | ||
| - Admission control gains a concurrent-streaming-loads-per-actor cap alongside the existing per-actor workload caps. | ||
| - `LoadOutput` (renamed from `IngestOutput`, old name kept as an OpenAPI alias) gains `graph_commit_id`, threaded from the publish result. These DTOs now live in the `omnigraph-api-types` crate (extracted in RFC-009 Phase 2, `4821e72`), so the field additions land there and are consumed by both the server handler and the CLI's `GraphClient` arms, ending the triple DTO construction. | ||
| - `GET /export`: query-string parameters (`branch`, `at`, `types`), `Accept`-negotiated body. `application/vnd.apache.arrow.stream` is produced per table schema; phase 1 limits IPC output to single-schema responses (query results, single-type exports) and keeps NDJSON for mixed whole-branch exports. `POST /export` stays as the compatibility alias. | ||
| - The CLI `load` remote path switches from `fs::read_to_string` + JSON envelope to streaming the file as the request body. The CLI `export` gains `--at` and `--format arrow`. | ||
|
|
There was a problem hiding this comment.
TypeScript SDK and MCP server migration path not addressed
The RFC describes retiring POST /ingest to a deprecated alias in favour of POST /load, but does not mention the TypeScript SDK (packages/sdk/src/client.ts) or the MCP server (packages/mcp/src/server.ts) as consumers that will need migration. Currently og.ingest() calls POST /ingest and the MCP tool is named "ingest" — both will keep working via the deprecated alias, but Phase 1 implementors will need to add og.load() / a "load" tool and deprecate the old names in those repos. Because omnigraph-ts is the primary programmatic interface for remote callers (the same audience the naming unification targets), the RFC would be more complete with at least a sentence noting this cross-repo migration obligation.
| - **Incremental staging.** `stage_append` already chains via `prior_stages: &[StagedWrite]` (`crates/omnigraph/src/table_store.rs:856-860`). The loader stages decoded batches as they arrive instead of accumulating them in `MutationStaging`; resident memory drops to the current batch plus unique-key and node-ID hash sets. Commit semantics are unchanged: one `commit_staged` per table, one sidecar, one publish. An aborted stream leaves unreferenced staged data files; `cleanup` garbage-collects them (state derived from observable storage, reconciler-shaped). Merge mode's cross-batch upsert semantics under repeated `stage_merge_insert` need verification; if they conflict, streaming ships for append/overwrite first and merge keeps the accumulate path. | ||
|
|
||
| One isolation interaction must land with this item. Today's loads exhibit a snapshot-isolation write skew on edge referential integrity: edge endpoints are validated against a snapshot plus pending batches, but the publish-time expected-version CAS covers only the tables this load writes (`crates/omnigraph/src/exec/staging.rs:501-514`, `db/manifest/publisher.rs:353-368`), so a concurrent mutation of a node table this load only reads can land between validation and publish, committing edges to nodes that no longer exist. Incremental staging lengthens exactly that window, so phase 2 must pair it with a remedy rather than widening the anomaly: either include the RI-read node tables' versions in the expected map (turning the race into a loud publish conflict, at the cost of false conflicts on unrelated node writes) or re-validate RI at commit time under the touched-table queues. See unresolved question 7. | ||
| - **Multipart Arrow IPC load.** `Content-Type: multipart/mixed`, one part per table, each part one Arrow IPC stream (IPC streams are single-schema by spec, so multi-table loads need framing; multipart keeps one request = one `load_as` = one publish, with no cross-call state). Schema compatibility is checked against the catalog per part; ID resolution and all validators run on the decoded batches. Note `allow_64bit` defaults to false in IPC writers, so individual field lengths cap at signed 32-bit, relevant for `Blob` properties and documented as a limit. |
There was a problem hiding this comment.
allow_64bit description conflates the buffer-offset flag with a per-field size limit
"Note allow_64bit defaults to false in IPC writers, so individual field lengths cap at signed 32-bit" isn't quite right. allow_64bit in the Arrow IPC format controls whether record-batch message length fields use 32-bit or 64-bit values — it gates batches larger than 2 GiB total, not individual field widths. The relevant constraint for large binary columns (e.g. Blob properties) is whether the IPC writer uses LargeBinary / LargeUtf8 offsets (64-bit) vs Binary / Utf8 offsets (32-bit, 2 GiB per column buffer). Reframing this as "binary column buffers default to 32-bit offsets, capping each column at 2 GiB per batch; Blob columns may require LargeBinary encoding and an IPC writer configured to emit it" would avoid misleading implementors who hit this in a multi-column batch rather than a single oversized field.
- P1: end-state taxonomy `schema apply` annotation said "Open Q10" — now points at the resolved Decision 10 (cluster graphs via cluster apply). - P1: add the `alias <name>` verb (Decision 4) to the end-state taxonomy's local section — it was claimed "full command set" but omitted. - P2: Decision 11's bulk-data-plane reference now carries the "PR #219, not yet merged" caveat (matches the Relationship section). - P2: footnote now states the `check`→`lint` argv-shim is removed (its end-state disposition was unspecified). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…228) * docs(rfc): RFC-011 — CLI refactoring (one addressing & config model) A maintainer-internal RFC (Status: Proposed) for the post-omnigraph.yaml CLI: one ontology (store/server/cluster; cluster vs operator config; catalog; context; capability); addressing = scope + --graph with the access path *derived*; served is the default front door and direct storage is privileged (admin/break-glass); stateless per command; definitions named, payloads passed. Includes the full end-state command taxonomy (by capability), a current-state appendix, migration, invariants check, and the resolved Decisions (with two deferred). Completes the config/CLI lineage RFC-007 → RFC-008 → RFC-009 → RFC-010. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(rfc): RFC-011 — address Greptile review (4 doc fixes) - P1: end-state taxonomy `schema apply` annotation said "Open Q10" — now points at the resolved Decision 10 (cluster graphs via cluster apply). - P1: add the `alias <name>` verb (Decision 4) to the end-state taxonomy's local section — it was claimed "full command set" but omitted. - P2: Decision 11's bulk-data-plane reference now carries the "PR #219, not yet merged" caveat (matches the Relationship section). - P2: footnote now states the `check`→`lint` argv-shim is removed (its end-state disposition was unspecified). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
What & why
Adds RFC 0001, the bulk data plane proposal: canonicalizes the bulk-write surface as
loadon every layer, defines a transport-agnostic batch core, streamed NDJSON / Arrow IPC / compressed transports, snapshot-pinned and presigned export, and the large-scale-load strategy (branch-load, incremental staging, resumability). It also documents the per-mode correctness guarantees and gaps, the multi-writer topology rule, and a substrate-trajectory section on Lance native Multi-Table Transactions (lance#7264) retiring the manifest-CAS/sidecar stack. Docs-only; no code change.Backing issue / RFC
Checklist
scripts/check-agents-md.shpassesNotes for reviewers
Status: Proposed. The RFC reflects the v0.7.0 state of
main(the engine and CLIload/ingestunification already shipped; the HTTP route, streamed transport, response/export fields, and all phase-2 scale work remain). Highest-value review targets: the per-mode correctness gaps in the Guarantees section, the RI write-skew remedy (Q7), and the MTT-coupling open questions (Q8). Internal tracking lives in the modernrelay dev graph (epicepc-bulk-data-plane, SpecFilespc-rfc-0001-bulk-data-plane).Note
Low Risk
Documentation-only RFC with no code, config, or API behavior changes in this PR.
Overview
Adds RFC 0001 (
docs/rfcs/0001-bulk-data-plane.md), a proposed design for the bulk data plane. It does not change runtime behavior.The RFC proposes canonical
load/exporton every surface (HTTP still onPOST /ingesttoday), a transport-agnosticload_batches_asbatch core, and a two-phase scale path: phase 1 streamed NDJSON,graph_commit_idon write responses, snapshot-pinnedGET /export, and Arrow IPC viaAccept; phase 2 incremental staging, multipart Arrow ingest, load-by-reference,Idempotency-Key, and presigned export jobs. It documents branch-load → merge for large ingest, a single-coordinator multi-writer rule (replacing an earlier grace-window idea; in-process safety from #203), guarantees vs known load-path gaps, invariant/deny-list alignment, and Lance MTT (#7264) as the path to retire manifest-CAS/sidecars. Nine unresolved questions flag merge streaming, RI write-skew remedies, and resumability.Reviewed by Cursor Bugbot for commit 8a81fea. Bugbot is set up for automated code reviews on this repo. Configure here.
Greptile Summary
Adds RFC 0001 (
docs/rfcs/0001-bulk-data-plane.md), proposing to canonicalize bulk-write asload/exporton every layer, replace the buffered 32 MB NDJSON envelope with streamed transport, and stage Arrow IPC, idempotency keys, load-by-reference, and presigned export jobs as phase 2.POST /loadwith streamedapplication/x-ndjson,graph_commit_idin write responses, snapshot-pinnedGET /export, Arrow IPC viaAccept, and the documented single-coordinator multi-writer topology rule (in-process half already closed by Recovery liveness, storage fault-injection matrix, and one storage implementation over object_store #203).Idempotency-Keyheader backed by the commit graph, and presigned export jobs with TTL GC.Confidence Score: 4/5
Docs-only change; no code, config, or runtime behavior is modified by this PR.
The RFC is thorough and well-aligned with the codebase invariants. The main concern is that it uses opaque internal issue shorthand (iss-986, iss-714, iss-cedar-context-load-mode) throughout a document the RFC README explicitly describes as the public, externally-authorable track. Once merged, those labels become permanent in the commit history without any way for outside contributors to resolve them. A secondary gap is that the fork-if-missing two-commit atomicity caveat is noted in prose but absent from the known-gaps table.
docs/rfcs/0001-bulk-data-plane.md — the internal tracker references and the missing fork-if-missing gap table entry both need attention before merge.
Important Files Changed
Sequence Diagram
sequenceDiagram participant Client participant Server as omnigraph-server participant Engine as omnigraph (engine) participant Store as Object Store (Lance) Note over Client,Store: Phase 1 - Streamed NDJSON Load Client->>Server: "POST /load?branch=review/x&mode=merge" Server->>Engine: load_batches_as(branch, base, Stream, mode, actor) loop Per decoded batch Engine->>Store: stage_append / stage_merge_insert end Engine->>Store: commit_staged (per table) Engine->>Store: write recovery sidecar Engine->>Store: ManifestBatchPublisher publish (CAS) Engine->>Store: delete recovery sidecar Engine-->>Server: "LoadResult { graph_commit_id }" Server-->>Client: "200 OK { graph_commit_id }" Note over Client,Store: Phase 1 - Snapshot-Pinned Export Client->>Server: "GET /export?branch=main&at=commit_id" Server->>Engine: export(branch, at, types) Engine->>Store: open snapshot at commit_id Store-->>Engine: Arrow RecordBatches Engine-->>Server: Arrow IPC stream Server-->>Client: 200 application/vnd.apache.arrow.stream Note over Client,Store: Phase 2 - Export Job (presigned) Client->>Server: "POST /exports { branch, at, format }" Server->>Engine: pin snapshot, materialize parts Engine->>Store: write Arrow/Parquet part-files Engine-->>Server: job_id, parts[] Server-->>Client: "202 { job_id, parts presigned URLs, expires_at }" Client->>Store: GET presigned URL (direct) Store-->>Client: part dataComments Outside Diff (3)
docs/rfcs/0001-bulk-data-plane.md, line 121 (link)Internal tracker IDs in a public-facing document
This RFC lives in
docs/rfcs/— the track theREADME.mdexplicitly describes as "open to anyone, including external contributors." It uses opaque internal issue shorthand throughout:iss-cedar-context-load-modehere,iss-ri-write-skew-dangling-edges,iss-overwrite-orphans-committed-edges,iss-986,iss-714,iss-863, andiss-681elsewhere. An outside contributor reading this has no way to look these up — there is no URL, no GitHub issue number, no description of what was decided. AGENTS.md §5 is explicit: "Prefer stable public identifiers… do not reference private ticket systems, internal codenames, or planning shorthand that an outside contributor cannot inspect."Each should either be replaced with a public GitHub issue link, or the reference should be dropped in favour of the prose that already describes the concern. The pattern repeats across the Guarantees table (lines 200–207), the Substrate Trajectory section (line 175,
iss-863; line 177,iss-681), and the Policy section (this line,iss-cedar-context-load-mode).Context Used: CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
docs/rfcs/0001-bulk-data-plane.md, line 193-198 (link)Fork-if-missing atomicity gap absent from the Known Correctness Gaps table
The Guarantees section (two paragraphs above this table) explicitly documents a second atomicity caveat: "a fork-if-missing load is two commits (the branch creation publishes before the load, so a failed load leaves an empty created branch)." This is a real, observable post-failure state — an empty branch dangling on
main— but it does not appear in this table and carries no tracking issue. Readers consulting the table for a complete picture of load-path correctness gaps will miss it. Consider adding a row for this, similar to the others.docs/rfcs/0001-bulk-data-plane.md, line 114 (link)References to internal-track RFCs without links or descriptions
This line cites "RFC-009 Phase 2,
4821e72" as the source of theomnigraph-api-typescrate extraction. RFC-009 (docs/dev/rfc-009-unify-access-paths.md) lives in the maintainer-internal track. Thedocs/rfcs/README.mdexplicitly warns: "Two tracks, don't conflate them." An external author reading this RFC has no way to locate RFC-009 or understand what "Phase 2" means. RFC-005 and RFC-008 are similarly referenced in the Status note at the top. At minimum, add a relative path or a brief inline description of what each internal RFC introduced.Context Used: CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (1): Last reviewed commit: "docs(rfc): RFC 0001 bulk data plane, loa..." | Re-trigger Greptile
Context used: