feat: stabilize cache codec with a versioned envelope#7163
Open
wjones127 wants to merge 8 commits into
Open
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
wjones127
commented
Jun 8, 2026
5608c32 to
b0d2622
Compare
Wraps every serialized cache entry in a hand-framed envelope (magic
`LCE1`, envelope version, author-assigned `type_id`, per-type
`type_version`) so a persistent backend can validate an entry before
trusting it. `CacheCodec::deserialize` now returns `CacheDecode::{Hit,
Miss}`: absent/wrong magic, unknown envelope version, `type_id`
mismatch, an unsupported (future) `type_version`, or a body decode error
all become a `Miss`, so a backend recomputes rather than trusting bad
bytes. Data written by the pre-stabilization format has no magic and
uniformly self-heals to a miss.
`CacheCodecImpl` gains `TYPE_ID`/`CURRENT_VERSION` and writes its body
through `CacheEntryWriter`/`CacheEntryReader` (the reader exposes
`version()` for backward-compat branching). IPC sections are padded to
64 bytes via new `lance_arrow::ipc` helpers so concatenated sections
decode zero-copy instead of triggering a realigning memcpy on every read.
Converts the `CompressedPostingList` codec to the stabilized form: a
protobuf header (`protos/cache_fts.proto`, with the tail/position codecs
and position-storage kind as enums), a 64-byte-aligned Arrow IPC section
for `blocks`, and a raw blob for the shared position stream. The
remaining codecs are migrated to the new trait signature but keep their
existing body framing pending their own conversion.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the stabilization started in the envelope foundation: every remaining CacheCodec now uses protobuf headers (discriminants as proto enums) and 64-byte-aligned Arrow IPC sections / raw blobs, so entries are portable and decode zero-copy. - lance-arrow: add aligned multi-batch IPC section helpers (write_ipc_section_batches / read_ipc_section_batches_at) and matching CacheEntryWriter::write_ipc_batches / CacheEntryReader::read_ipc_batches. - FTS (cache_fts.proto): plain posting list and standalone Positions move from JSON+u8-tag framing to proto headers + aligned IPC; shared the position-section read/write helpers with the compressed path. - Scalar (cache_scalar.proto): BTreeIndexState header moves from ad-hoc binary to proto; BitmapIndexState, FlatIndex, and (nested) LabelListIndex switch to write_raw + aligned write_ipc. - Vector (cache_vector.proto): the five IVF PartitionEntry quantizer headers (PQ/Flat/FlatBin/SQ/RabitQ) and IvfStateEntryBox move from JSON to proto, distance/rotation types become proto enums, and storage batches use aligned multi-batch IPC. Adds roundtrip + through-envelope zero-copy tests for each migrated codec (including RabitQ Matrix rotation and multi-batch SQ storage). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-review cleanup: remove the unused `CacheCodec::type_id()` getter and `CacheDecode::is_miss()` (test call sites use `hit().is_none()` instead), use the existing `IPC_CONTINUATION` constant in `parse_ipc_message_prefix`, and align the label_list wire-format doc with the section vocabulary used by the other migrated codecs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `protos/` folder is the on-disk Lance format spec. The cache-entry headers are library-internal serialization, not spec, so they don't belong there. Consolidate the three files (cache_fts/cache_scalar/cache_vector) into a single `lance-index/protos-cache/cache.proto` under one package `lance.index.cache`, generating a single `lance_index::cache_pb` module. The `lance` crate no longer compiles its own vector cache proto; its IVF codec imports the headers from `lance_index::cache_pb`. Message field numbers are unchanged, so the on-wire proto bytes are identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses review feedback on the cache codec stabilization: - Export `MAGIC` and a `has_cache_envelope()` helper so backends can recognize Lance cache entries without hardcoding the magic bytes. - Carry a `CacheMissReason` on `CacheDecode::Miss` (InvalidEnvelope, TypeMismatch, VersionTooNew, BodyError) so backends can emit targeted metrics. Drop the vestigial outer `Result` from `deserialize` — reading from an in-memory `Bytes` cannot do I/O, so a miss is the only non-`Hit` outcome. - Reword the `raw_writer()`/`body()` docs to describe their legitimate use (a self-framed whole-body payload such as a roaring bitmap) rather than framing them as transitional. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rebasing onto main pulled in lance-format#7078 (raw-query IVF RabitQ search), which added a `query_estimator` field to `RabitQuantizationMetadata` and persisted it in the RabitQ partition cache. Our migration replaced that serde header with a proto header that predated the field, which would have silently dropped it — a cached partition would always reload as `ResidualQuery` and break raw-query search. Add `query_estimator` to the proto `RabitPartitionHeader` (field 6) as a new `RabitQueryEstimator` enum, map it on the serialize/deserialize paths, and add a round-trip test exercising the non-default `RawQuery` value. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b0d2622 to
cc615fb
Compare
…t variant miss Self-review pass on the cache codec stabilization: - Fix a stale `.unwrap()` in the lance-table codec test left over from the `deserialize -> CacheDecode` signature change (was an E0599 build error). - Extract the duplicated IPC section alignment-padding prologue into a `write_section_padding` helper shared by both write_ipc_section variants. - Add `unknown_posting_variant_is_miss`: a valid envelope whose body leads with an out-of-range variant tag must self-heal to a `BodyError` miss, not panic. - Move a misplaced Matrix-rotation doc comment onto the matrix roundtrip test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`CacheCodecImpl` is not imported in entry_io.rs, so the unqualified `[CacheCodecImpl::serialize]` link failed to resolve under rustdoc's `-D warnings`. Qualify it with the `super::` path the sibling links use. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements #7160. Cache entries (FTS posting lists, scalar/vector index state) were serialized with an ad-hoc, unversioned format only safe to read in the same process that wrote it. This stabilizes the format so entries can live in a node-agnostic, restart-surviving cache backend.
Wire format
Each entry is an envelope followed by a body:
Body sections, each self-delimiting:
Why this shape
Hit/Miss, never a hard error. Wrong/absent magic, an unknown envelope version, atype_idmismatch, a futuretype_version, or a body decode failure all becomeMiss→ recompute. Old, foreign, or corrupt bytes self-heal with zero migration code.type_version, which the reader branches on.memcpyon every read — this guards the FTS WAND hot path.RAW_BLOBis reserved for payloads with their own portable, self-describing encoding (roaring bitmaps, the shared position stream).A codec with no scalar metadata (e.g. bitmap) simply omits the header — sections are positional, so nothing is written for an absent header.
Scope
All cache codecs migrated: FTS posting lists (compressed/plain/positions + groups), scalar indices (BTree/Bitmap/Flat/LabelList/RowAddrTreeMap), and the five IVF quantizer partitions + IVF state. The cache protos live in
lance-index/protos-cache/cache.proto(package lance.index.cache) — they describe library serialization, not the on-disk format spec.Tests
Envelope round-trip and every miss path; per-codec round-trip + through-envelope zero-copy alignment (incl. RabitQ Matrix rotation, multi-batch SQ, nested bitmap in a label-list entry); additive proto-field compat; existing IVF build+search suites pass through the migrated path.
Closes #7160.
🤖 Generated with Claude Code