Skip to content

fix: deduplicate BTree flat page row addresses#7235

Draft
majin1102 wants to merge 1 commit into
lance-format:mainfrom
majin1102:codex/fix-btree-duplicate-row-addresses
Draft

fix: deduplicate BTree flat page row addresses#7235
majin1102 wants to merge 1 commit into
lance-format:mainfrom
majin1102:codex/fix-btree-duplicate-row-addresses

Conversation

@majin1102

Copy link
Copy Markdown
Contributor

Summary

Fixes #7230.

BTree flat pages are expected to contain at most one entry for each row address. During index update / remap / delta merge flows, duplicate (value, row address) pairs can be produced. The flat page currently sorts by row id and then builds RowAddrTreeMap, which fails on duplicated row ids with the misleading from_sorted_iter called with non-sorted input error.

This PR normalizes BTree flat page batches by sorting by row id and then scanning adjacent duplicate row ids:

  • duplicate row id with the same indexed value: keep one entry
  • duplicate row id with a conflicting indexed value: return a clear internal error

The normalization is applied in both FlatIndex::try_new() and FlatIndex::remap_batch() so existing/read-time pages and optimize/remap output are protected.

Tests

  • cargo fmt --all
  • cargo test -p lance-index scalar::btree::flat::tests
  • cargo test -p lance test_btree_reordered_merge_insert_index_delta_merge_deduplicates_row_addr
  • cargo clippy --all --tests --benches -- -D warnings
  • cd python && uv run --frozen python ../repros/btree_duplicate_row_addr_from_table.py

@github-actions github-actions Bot added A-python Python bindings A-index Vector index, linalg, tokenizer A-ci CI / build workflows bug Something isn't working labels Jun 11, 2026
@majin1102 majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch from b6c2fbe to c7051a2 Compare June 11, 2026 09:14

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b6c2fbe6f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread rust/lance/src/dataset/archive.rs Outdated
Comment on lines +537 to +538
if let Some(archive) = Self::load_from_files(&base, object_store.clone(), config).await? {
return Ok(archive);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Merge retained archives before writing the next archive

When max_entries is reached, load_or_new seeds the next write from only the newest archive file, even though max_archive_files keeps older archive files around. In repeated cleanups, entries that were drained from the newest file but still exist in an older retained archive are not carried forward; once cleanup_old_archives deletes that older file, metadata for cleaned versions is lost despite the retention setting. Use the existing all-file scan/merge path (or equivalent) when preparing the archive to flush.

Useful? React with 👍 / 👎.

@majin1102 majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch 2 times, most recently from 2e90802 to 5150dcc Compare June 11, 2026 11:46
@majin1102 majin1102 marked this pull request as draft June 11, 2026 11:57
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.44068% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/btree/flat.rs 86.44% 0 Missing and 8 partials ⚠️

📢 Thoughts on this report? Let us know!

@majin1102 majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch 3 times, most recently from 94e7126 to 853bb57 Compare June 11, 2026 17:53
@majin1102 majin1102 force-pushed the codex/fix-btree-duplicate-row-addresses branch from 853bb57 to d4ffd8a Compare June 12, 2026 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ci CI / build workflows A-index Vector index, linalg, tokenizer A-python Python bindings bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BTree scalar index can contain duplicate row addresses

1 participant