docs: propose portable corpus format ("git for knowledge bases")#1771
docs: propose portable corpus format ("git for knowledge bases")#1771JSv4 wants to merge 1 commit into
Conversation
Adds docs/architecture/portable-corpus-format.md — a design proposal for replacing the V1/V2 export ZIP with a content-addressed corpus bundle modeled on matter2, enabling local-first LLM workflows (clone, query offline, push back) while keeping the server on Postgres + S3 with a narrow fast-forward-only push contract. Draft for discussion.
Code Review — PR #1771: Portable Corpus Format ProposalThis is a documentation-only PR adding OverviewThe proposal is well-structured and technically ambitious. It addresses a real gap — the absence of a local-checkout/offline-workflow story — and proposes a coherent path from the current V1/V2 ZIP formats to a git-flavored, content-addressed bundle with a local Python SDK. The phased shipping plan is a strength: each phase is independently shippable and the stopping criterion at each boundary is clear. Strengths
Issues and Concerns1.
|
Summary
Adds
docs/architecture/portable-corpus-format.md— a design proposal for replacing the V1/V2 export ZIP with a content-addressed corpus bundle modeled on the matter2 VCS, enabling local-first LLM workflows (clone, query offline, push back) while keeping the server on Postgres + S3 with a narrow, fast-forward-only push contract.Status: proposal / draft for discussion. No code changes. Reactions, objections, and alternative framings explicitly welcome.
What it proposes
"version": "3.0") that replaces V1 and V2 ZIPs and is byte-identical whether produced byoc export, materialized byoc clone, or shipped back byoc push.current_HEAD_hashper corpus is the only server-side state addition.parent_HEAD == current_HEAD_hash, else rejects with a 409. Client owns all merge sophistication, so the server contract is frozen while the client can grow incrementally (v1: surface conflict; v1.5: auto-merge additive; v2: full conflict UX).pipeline.lockcarries producer fingerprints (parsers, embedders, post-processors) for every derived artifact — generalizes today'sCorpus.preferred_embedderpattern to the whole pipeline.Why this shape
After debating "full git-style branching for knowledge bases" vs. "just a better ZIP," the doc lands on a middle position: take matter2's storage substrate and DAG model, lift it one rung higher to track typed entities (annotations, notes, etc.) instead of just files, but ship without a branching/merging user surface. The substrate supports it; the UX waits for user pull. The clone/push loop alone is the v1 product.
Performance was the design's near-deal-breaker. The resolution is a three-way split: JSON manifests for wire (server emits via bulk
json_aggqueries, never touches SQLite), content-addressed blobs for O(Δ) clones and pushes after the first sync, SQLite as a client-side cache rebuilt lazily. See §11 of the doc.Phased shipping plan
Each phase is independently shippable:
Stopping after any of these leaves a real product on the table.
Decisions taken (now fixed assumptions)
See §16 of the doc for the full list with rationale. Highlights:
Open questions
§15 lists seven open questions that should be debated before Phase 1 starts. The top three:
ocCLI vs. extending matter2's CLI?Test plan
json_aggquery work as the natural first step (it's also a worthwhile V2 optimization regardless of the rest)Generated by Claude Code