Skip to content

feat(compaction): introduce RowAddrRemap structure to avoid remap OOM caused by HashMap#7237

Open
zhangyue19921010 wants to merge 2 commits into
lance-format:mainfrom
zhangyue19921010:remap-oom-optimize
Open

feat(compaction): introduce RowAddrRemap structure to avoid remap OOM caused by HashMap#7237
zhangyue19921010 wants to merge 2 commits into
lance-format:mainfrom
zhangyue19921010:remap-oom-optimize

Conversation

@zhangyue19921010

Copy link
Copy Markdown
Contributor

Closes: #7150

Compact Row-Address Remapping for Compaction

Compaction rewrites rows into new fragments, so indices that store physical row addresses need an old-address to new-address mapping without building an O(total rows) HashMap<u64, Option<u64>>.

Layout

Old Rows

old_fragment_id -> (old_offsets, old_rows_before)

  • old_offsets: rewritten old row offsets in this old fragment.
  • old_rows_before: rewritten row count before this old fragment.

New Rows

Ordered new-fragment ranges:

(fragment_id, new_rows_before, physical_rows)

  • new_rows_before: rewritten row count before this new fragment.

Lookup

  • An address whose fragment was not rewritten returns None.

  • For an address whose fragment was rewritten:

    1. Read (old_offsets, old_rows_before) from the old-row layout.

    2. If offset is not in old_offsets, return Some(None) because the row was deleted.

    3. Otherwise, old_offsets.rank(offset) - 1 is this row's 0-based position among rewritten old rows in this old fragment.

    4. Add old_rows_before to get k, the row's 0-based position among all rewritten old rows.

    5. In the new-row layout, find the range:

      (fragment_id, new_rows_before, physical_rows)
      

      where:

      new_rows_before <= k < new_rows_before + physical_rows
      
    6. The new address is:

      (fragment_id, k - new_rows_before)
      

Ordering

Compact remap does not store each old-to-new row mapping. It computes k from the old-row layout, then maps it to the k-th row written to the new fragments.

This requires the reader-to-writer pipeline to preserve row order.

  • old_frag_ids must match the order old fragments are read.

    • Within each old fragment, rewritten rows are interpreted by ascending old row offset.
  • new_frags must match the order new rows are written.

  • Current compaction satisfies this because it scans selected fragments in order and writes the resulting stream without reordering rows.

ig_03eb56d1be245827016a2a76f6c3588195a974dfbe6eea7146

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer enhancement New feature or request labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compaction remap OOM on the master: the O(rows) HashMap materialization

1 participant