Skip to content

feat: stabilize cache codec with a versioned envelope#7163

Open
wjones127 wants to merge 8 commits into
lance-format:mainfrom
wjones127:feat/stable-cache-codec-envelope
Open

feat: stabilize cache codec with a versioned envelope#7163
wjones127 wants to merge 8 commits into
lance-format:mainfrom
wjones127:feat/stable-cache-codec-envelope

Conversation

@wjones127

@wjones127 wjones127 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Implements #7160. Cache entries (FTS posting lists, scalar/vector index state) were serialized with an ad-hoc, unversioned format only safe to read in the same process that wrote it. This stabilizes the format so entries can live in a node-agnostic, restart-surviving cache backend.

Wire format

Each entry is an envelope followed by a body:

[magic "LCE1"][envelope_version: u8][type_id][type_version: u32]   # envelope
<body: optional protobuf header, then sections in a fixed, version-keyed order>

Body sections, each self-delimiting:

HEADER    : [len: u32][protobuf bytes]
ARROW_IPC : [pad to 64B][self-delimiting IPC stream]
RAW_BLOB  : [len: u64][bytes]

Why this shape

  • The envelope is hand-framed, not protobuf. It's the most stability-critical part: it must parse robustly against any bytes (including old, pre-stabilization blobs) and never change shape. The magic is chosen so no prior blob can collide with it.
  • Decode returns Hit/Miss, never a hard error. Wrong/absent magic, an unknown envelope version, a type_id mismatch, a future type_version, or a body decode failure all become Miss → recompute. Old, foreign, or corrupt bytes self-heal with zero migration code.
  • Bodies use protobuf headers. Field-number evolution lets us add fields without a format break; only changes protobuf can't express transparently (reordering sections, changing a raw-blob encoding) bump type_version, which the reader branches on.
  • Arrow IPC sections are 64-byte aligned so concatenated sections decode zero-copy instead of a realigning memcpy on every read — this guards the FTS WAND hot path.
  • RAW_BLOB is reserved for payloads with their own portable, self-describing encoding (roaring bitmaps, the shared position stream).

A codec with no scalar metadata (e.g. bitmap) simply omits the header — sections are positional, so nothing is written for an absent header.

Scope

All cache codecs migrated: FTS posting lists (compressed/plain/positions + groups), scalar indices (BTree/Bitmap/Flat/LabelList/RowAddrTreeMap), and the five IVF quantizer partitions + IVF state. The cache protos live in lance-index/protos-cache/cache.proto (package lance.index.cache) — they describe library serialization, not the on-disk format spec.

Tests

Envelope round-trip and every miss path; per-codec round-trip + through-envelope zero-copy alignment (incl. RabitQ Matrix rotation, multi-batch SQ, nested bitmap in a label-list entry); additive proto-field compat; existing IVF build+search suites pass through the migrated path.

Closes #7160.

🤖 Generated with Claude Code

@github-actions

This comment was marked as resolved.

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer A-format On-disk format: protos and format spec docs enhancement New feature or request labels Jun 8, 2026
Comment thread protos/cache_fts.proto Outdated
@wjones127 wjones127 force-pushed the feat/stable-cache-codec-envelope branch 2 times, most recently from 5608c32 to b0d2622 Compare June 10, 2026 16:53
wjones127 and others added 6 commits June 11, 2026 08:09
Wraps every serialized cache entry in a hand-framed envelope (magic
`LCE1`, envelope version, author-assigned `type_id`, per-type
`type_version`) so a persistent backend can validate an entry before
trusting it. `CacheCodec::deserialize` now returns `CacheDecode::{Hit,
Miss}`: absent/wrong magic, unknown envelope version, `type_id`
mismatch, an unsupported (future) `type_version`, or a body decode error
all become a `Miss`, so a backend recomputes rather than trusting bad
bytes. Data written by the pre-stabilization format has no magic and
uniformly self-heals to a miss.

`CacheCodecImpl` gains `TYPE_ID`/`CURRENT_VERSION` and writes its body
through `CacheEntryWriter`/`CacheEntryReader` (the reader exposes
`version()` for backward-compat branching). IPC sections are padded to
64 bytes via new `lance_arrow::ipc` helpers so concatenated sections
decode zero-copy instead of triggering a realigning memcpy on every read.

Converts the `CompressedPostingList` codec to the stabilized form: a
protobuf header (`protos/cache_fts.proto`, with the tail/position codecs
and position-storage kind as enums), a 64-byte-aligned Arrow IPC section
for `blocks`, and a raw blob for the shared position stream. The
remaining codecs are migrated to the new trait signature but keep their
existing body framing pending their own conversion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the stabilization started in the envelope foundation: every
remaining CacheCodec now uses protobuf headers (discriminants as proto
enums) and 64-byte-aligned Arrow IPC sections / raw blobs, so entries are
portable and decode zero-copy.

- lance-arrow: add aligned multi-batch IPC section helpers
  (write_ipc_section_batches / read_ipc_section_batches_at) and matching
  CacheEntryWriter::write_ipc_batches / CacheEntryReader::read_ipc_batches.
- FTS (cache_fts.proto): plain posting list and standalone Positions move
  from JSON+u8-tag framing to proto headers + aligned IPC; shared the
  position-section read/write helpers with the compressed path.
- Scalar (cache_scalar.proto): BTreeIndexState header moves from ad-hoc
  binary to proto; BitmapIndexState, FlatIndex, and (nested) LabelListIndex
  switch to write_raw + aligned write_ipc.
- Vector (cache_vector.proto): the five IVF PartitionEntry quantizer headers
  (PQ/Flat/FlatBin/SQ/RabitQ) and IvfStateEntryBox move from JSON to proto,
  distance/rotation types become proto enums, and storage batches use
  aligned multi-batch IPC.

Adds roundtrip + through-envelope zero-copy tests for each migrated codec
(including RabitQ Matrix rotation and multi-batch SQ storage).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-review cleanup: remove the unused `CacheCodec::type_id()` getter and
`CacheDecode::is_miss()` (test call sites use `hit().is_none()` instead),
use the existing `IPC_CONTINUATION` constant in `parse_ipc_message_prefix`,
and align the label_list wire-format doc with the section vocabulary used by
the other migrated codecs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `protos/` folder is the on-disk Lance format spec. The cache-entry
headers are library-internal serialization, not spec, so they don't belong
there. Consolidate the three files (cache_fts/cache_scalar/cache_vector)
into a single `lance-index/protos-cache/cache.proto` under one package
`lance.index.cache`, generating a single `lance_index::cache_pb` module.

The `lance` crate no longer compiles its own vector cache proto; its IVF
codec imports the headers from `lance_index::cache_pb`. Message field
numbers are unchanged, so the on-wire proto bytes are identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses review feedback on the cache codec stabilization:

- Export `MAGIC` and a `has_cache_envelope()` helper so backends can
  recognize Lance cache entries without hardcoding the magic bytes.
- Carry a `CacheMissReason` on `CacheDecode::Miss` (InvalidEnvelope,
  TypeMismatch, VersionTooNew, BodyError) so backends can emit targeted
  metrics. Drop the vestigial outer `Result` from `deserialize` — reading
  from an in-memory `Bytes` cannot do I/O, so a miss is the only non-`Hit`
  outcome.
- Reword the `raw_writer()`/`body()` docs to describe their legitimate use
  (a self-framed whole-body payload such as a roaring bitmap) rather than
  framing them as transitional.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rebasing onto main pulled in lance-format#7078 (raw-query IVF RabitQ search), which
added a `query_estimator` field to `RabitQuantizationMetadata` and persisted
it in the RabitQ partition cache. Our migration replaced that serde header
with a proto header that predated the field, which would have silently
dropped it — a cached partition would always reload as `ResidualQuery` and
break raw-query search.

Add `query_estimator` to the proto `RabitPartitionHeader` (field 6) as a new
`RabitQueryEstimator` enum, map it on the serialize/deserialize paths, and
add a round-trip test exercising the non-default `RawQuery` value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the feat/stable-cache-codec-envelope branch from b0d2622 to cc615fb Compare June 11, 2026 15:26
@wjones127 wjones127 removed the A-format On-disk format: protos and format spec docs label Jun 11, 2026
wjones127 and others added 2 commits June 11, 2026 10:27
…t variant miss

Self-review pass on the cache codec stabilization:

- Fix a stale `.unwrap()` in the lance-table codec test left over from the
  `deserialize -> CacheDecode` signature change (was an E0599 build error).
- Extract the duplicated IPC section alignment-padding prologue into a
  `write_section_padding` helper shared by both write_ipc_section variants.
- Add `unknown_posting_variant_is_miss`: a valid envelope whose body leads with
  an out-of-range variant tag must self-heal to a `BodyError` miss, not panic.
- Move a misplaced Matrix-rotation doc comment onto the matrix roundtrip test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`CacheCodecImpl` is not imported in entry_io.rs, so the unqualified
`[CacheCodecImpl::serialize]` link failed to resolve under rustdoc's
`-D warnings`. Qualify it with the `super::` path the sibling links use.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wjones127 wjones127 marked this pull request as ready for review June 11, 2026 18:29
@wjones127 wjones127 requested a review from westonpace June 11, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make CacheCodec serialization format stable

1 participant