fix: deduplicate BTree flat page row addresses#7235
Conversation
b6c2fbe to
c7051a2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6c2fbe6f3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if let Some(archive) = Self::load_from_files(&base, object_store.clone(), config).await? { | ||
| return Ok(archive); |
There was a problem hiding this comment.
Merge retained archives before writing the next archive
When max_entries is reached, load_or_new seeds the next write from only the newest archive file, even though max_archive_files keeps older archive files around. In repeated cleanups, entries that were drained from the newest file but still exist in an older retained archive are not carried forward; once cleanup_old_archives deletes that older file, metadata for cleaned versions is lost despite the retention setting. Use the existing all-file scan/merge path (or equivalent) when preparing the archive to flush.
Useful? React with 👍 / 👎.
2e90802 to
5150dcc
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
94e7126 to
853bb57
Compare
853bb57 to
d4ffd8a
Compare
Summary
Fixes #7230.
BTree flat pages are expected to contain at most one entry for each row address. During index update / remap / delta merge flows, duplicate
(value, row address)pairs can be produced. The flat page currently sorts by row id and then buildsRowAddrTreeMap, which fails on duplicated row ids with the misleadingfrom_sorted_iter called with non-sorted inputerror.This PR normalizes BTree flat page batches by sorting by row id and then scanning adjacent duplicate row ids:
The normalization is applied in both
FlatIndex::try_new()andFlatIndex::remap_batch()so existing/read-time pages and optimize/remap output are protected.Tests
cargo fmt --allcargo test -p lance-index scalar::btree::flat::testscargo test -p lance test_btree_reordered_merge_insert_index_delta_merge_deduplicates_row_addrcargo clippy --all --tests --benches -- -D warningscd python && uv run --frozen python ../repros/btree_duplicate_row_addr_from_table.py