Skip to content

docs: propose portable corpus format ("git for knowledge bases")#1771

Open
JSv4 wants to merge 1 commit into
mainfrom
claude/zen-goldberg-oU0Nj
Open

docs: propose portable corpus format ("git for knowledge bases")#1771
JSv4 wants to merge 1 commit into
mainfrom
claude/zen-goldberg-oU0Nj

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented May 24, 2026

Summary

Adds docs/architecture/portable-corpus-format.md — a design proposal for replacing the V1/V2 export ZIP with a content-addressed corpus bundle modeled on the matter2 VCS, enabling local-first LLM workflows (clone, query offline, push back) while keeping the server on Postgres + S3 with a narrow, fast-forward-only push contract.

Status: proposal / draft for discussion. No code changes. Reactions, objections, and alternative framings explicitly welcome.

What it proposes

  • One canonical bundle format ("version": "3.0") that replaces V1 and V2 ZIPs and is byte-identical whether produced by oc export, materialized by oc clone, or shipped back by oc push.
  • Server stays on Postgres + S3 (no production data migration to ship this). It transcodes between native storage and the bundle at the boundary. A new current_HEAD_hash per corpus is the only server-side state addition.
  • Client is a real matter2-shaped repo on disk — local DAG, local commits, staging area, rebasable. Uses matter2 as a library for the VCS substrate.
  • Push protocol is fast-forward only. Server accepts iff parent_HEAD == current_HEAD_hash, else rejects with a 409. Client owns all merge sophistication, so the server contract is frozen while the client can grow incrementally (v1: surface conflict; v1.5: auto-merge additive; v2: full conflict UX).
  • JSON manifests + content-addressed blobs as the wire format; SQLite (with sqlite-vss) as a lazily-built client-side query cache. No SQLite ever in the wire.
  • pipeline.lock carries producer fingerprints (parsers, embedders, post-processors) for every derived artifact — generalizes today's Corpus.preferred_embedder pattern to the whole pipeline.
  • Local Python SDK exposes the same agent-tool API the MCP server does, against the local bundle. This is the headline win — "pip install, point at a clone, run offline RAG."

Why this shape

After debating "full git-style branching for knowledge bases" vs. "just a better ZIP," the doc lands on a middle position: take matter2's storage substrate and DAG model, lift it one rung higher to track typed entities (annotations, notes, etc.) instead of just files, but ship without a branching/merging user surface. The substrate supports it; the UX waits for user pull. The clone/push loop alone is the v1 product.

Performance was the design's near-deal-breaker. The resolution is a three-way split: JSON manifests for wire (server emits via bulk json_agg queries, never touches SQLite), content-addressed blobs for O(Δ) clones and pushes after the first sync, SQLite as a client-side cache rebuilt lazily. See §11 of the doc.

Phased shipping plan

Each phase is independently shippable:

  1. Bundle format + server transcoding (faster, smaller export/import)
  2. Clone endpoint + read-only SDK (the LLM-workflow headline)
  3. Local mutation (offline annotator workflows)
  4. Push + dumb client-side rebase (full single-user loop)
  5. Smart rebase for additive changes (multi-user practical)
  6. Full conflict-resolution UX (collaborative annotation)

Stopping after any of these leaves a real product on the table.

Decisions taken (now fixed assumptions)

See §16 of the doc for the full list with rationale. Highlights:

  • JSON wire, SQLite client cache — never the reverse
  • Server stays on Postgres + S3
  • Fast-forward-only push contract on the server
  • Mutable-locally set is annotations / notes / relationships / new labels / folders / documents (all additive or last-write-wins-friendly)
  • No git-style branching UX in v1
  • Slug-based user references (issue PR 1600 Follow-up #1608) is a prerequisite

Open questions

§15 lists seven open questions that should be debated before Phase 1 starts. The top three:

  1. Ship embeddings in the bundle by default or regenerate on materialize?
  2. Standalone oc CLI vs. extending matter2's CLI?
  3. Clone-is-fork or clone-is-lease for private corpora?

Test plan

  • Read the doc end-to-end and react
  • Debate the open questions in §15
  • Confirm or push back on the fixed decisions in §16
  • If aligned: scope Phase 1 into a concrete implementation plan, with the bulk json_agg query work as the natural first step (it's also a worthwhile V2 optimization regardless of the rest)

Generated by Claude Code

Adds docs/architecture/portable-corpus-format.md — a design proposal for
replacing the V1/V2 export ZIP with a content-addressed corpus bundle
modeled on matter2, enabling local-first LLM workflows (clone, query
offline, push back) while keeping the server on Postgres + S3 with a
narrow fast-forward-only push contract. Draft for discussion.
@claude
Copy link
Copy Markdown

claude Bot commented May 24, 2026

Code Review — PR #1771: Portable Corpus Format Proposal

This is a documentation-only PR adding docs/architecture/portable-corpus-format.md, a design proposal for a content-addressed V3 bundle format and local-first SDK. No code is changed.


Overview

The proposal is well-structured and technically ambitious. It addresses a real gap — the absence of a local-checkout/offline-workflow story — and proposes a coherent path from the current V1/V2 ZIP formats to a git-flavored, content-addressed bundle with a local Python SDK. The phased shipping plan is a strength: each phase is independently shippable and the stopping criterion at each boundary is clear.


Strengths

  • Structured decision log (§16). Rationale is recorded alongside the decision, not just the outcome. This is important for a proposal that will take multiple quarters to implement.
  • Honest non-goals (§4) and open questions (§15). Deferring branching/merging UX and federation explicitly saves this proposal from overreach.
  • Appendix B (what this doesn't solve) is excellent. Calling out annotation re-anchoring and schema evolution as unsolved hard problems rather than sweeping them under the rug earns credibility.
  • Incremental fetch and push as the key performance win (§11.4 lever 2). The O(Δ) story is the right framing for why this isn't just a "fancier ZIP."
  • Companion doc links resolve. All four referenced docs (corpus_export_import_v2.md, corpus-export-format-spec.md, corpus_forking.md, embeddings_creation_and_retrieval.md) and opencontractserver/constants/zip_import.py exist in the current tree.

Issues and Concerns

1. matter2 is undefined — this is the highest-priority gap

matter2 is referenced 16 times but is never defined, linked, or evaluated. It's described as a VCS library providing "the object store, commit lifecycle, and DAG semantics" that OpenContracts would consume as a library. Before this proposal can be treated as anything more than a sketch, the following must be addressed:

  • What is it? A link or brief characterization (author, repo, maturity level) is required.
  • License compatibility. OpenContracts is MIT. If matter2 is GPL or proprietary, the dependency is a blocker.
  • Maintenance status. Depending on an abandoned or low-activity library for the VCS substrate of a production feature is an architectural risk.
  • Fallback. If matter2 isn't production-ready, does the proposal still hold? The object store and DAG semantics are not complex and could be implemented directly, but the doc treats matter2 as an unquestioned given.

2. Server atomicity on push is under-specified

§7.1 says the server "atomically advances HEAD" after ingesting new objects. In a Postgres + Django context this requires explicit transactional locking — likely SELECT corpus FOR UPDATE to serialize concurrent pushes to the same corpus. Without this, two simultaneous pushes with the same parent_HEAD could both pass the check and produce a split-brain HEAD. The proposal should either:

  • Specify the locking strategy (row-level lock on the corpus row), or
  • Explicitly defer it to Phase 4 with a note that it's a known gap.

3. pipeline.lock format inconsistency

The bundle is JSON throughout, but pipeline.lock is TOML. This creates a parsing dependency (Python's tomllib / tomli) that nothing else in the bundle requires. No rationale is given for the divergence. Consider JSON here for consistency, or add a rationale (e.g., human-editability, Cargo-lock familiarity).

4. sqlite-vss vs sqlite-vec

§11.2 specifies sqlite-vss for local vector search. sqlite-vss has been largely unmaintained since mid-2024 and its successor is sqlite-vec (from the same author). Using a deprecated library as the vector-search substrate for the local SDK is a risk worth flagging in the open questions section, or updating the recommendation.

5. refs/server update semantics not specified

§7.2 says refs/server holds the "server-side HEAD at clone time." §8.2 says it's updated after a successful push (refs/server = C_new). But the document doesn't say whether oc fetch (§13) also updates refs/server. If it does, the next push's parent_HEAD should be the post-fetch value, which is correct. If it doesn't, a user who fetches but doesn't push will have a stale refs/server and their next push will always fail. This is worth one sentence of clarification.

6. Push authorization is deferred but the default is implicit

Open question #3 asks "what does write access mean for push?" but the endpoints in §7.1 don't mention authentication at all. At minimum the document should note that all three endpoints require authentication via the existing Django auth mechanism, and that push requires at least update_corpus (the permission level already tentatively suggested in §15). Without that statement, a reader implementing Phase 4 might ship an unauthenticated push endpoint.

7. Embedding format for blobs is unspecified

§10 says embeddings are "content-addressed by sha256(producer_id || input_text_hash)" and stored as blobs. But what is the blob format? Raw float32 bytes? NumPy .npy? JSON array? This matters for portability (a TypeScript SDK can't use NumPy format) and for the sqlite-vss/sqlite-vec ingestion path. This should be fixed before Phase 1 — it's hard to change once published bundles exist in the wild.

8. CHANGELOG not updated

Per project conventions, CHANGELOG.md should be updated for significant changes. While this is a proposal doc (not an implementation), it represents a meaningful architectural commitment and a new capability direction. At minimum a one-liner under [Unreleased] / Added would satisfy the convention — e.g., "Added portable corpus format proposal (V3 bundle design)."

9. PR body attribution

The PR description ends with _Generated by [Claude Code]_. CLAUDE.md Baseline Commit Rule #3 states: "Never credit Claude or Claude Code in commit messages, PR messages, comments, or any other artifacts." This should be removed from the PR description.


Minor / Nits

  • §8.3 step 4 says "rebases its local commits (C_old → C_new) onto C_old′" — the notation is slightly ambiguous. C_old → C_new looks like a range, but it could be read as "the commit that moves the tree from C_old to C_new." A sentence clarifying "rebase the diff represented by the local commits" would help.
  • Appendix A table, "Identity preservation" row: "UUID identity" implies that OpenContracts entity PKs are UUIDs. Most models do use UUIDField, but a note confirming this assumption or listing any exceptions (e.g., LabelSet) would prevent Phase 1 surprises.
  • §12 V2 migration: "Continue to accept on import indefinitely" for V2 is a long commitment to maintain an importer shim. Suggest adding an explicit review trigger (e.g., "revisit at V3 GA + 18 months").

Summary

The proposal is coherent, the phased plan is practical, and the non-goals and open questions sections are well-handled. The three things that need to be resolved before this can drive Phase 1 scoping are:

  1. Define/link/evaluate matter2 — this is load-bearing and currently a black box.
  2. Specify the embedding blob format — unspecifiable once bundles are published.
  3. Specify server push atomicity — a correctness gap, not just a detail.

Everything else in the issues list is clarification or improvement rather than a blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants