[Feature] Properly deal with global index in MergeInto

### Search before asking

- [x] I searched in the [issues](https://github.com/apache/paimon/issues) and found nothing similar.


### Motivation

Currently, Paimon indexes are mainly designed for historical partitioned data. On the other hand, Data Evolution (DE) tables are built on top of an append-only table model. As a result, the current index implementation does not yet take deletes or updates into account.

However, DE tables are not strictly append-only. With MERGE INTO, we can update column values. If a column is updated after a global index has already been built for the affected data, but the index is not updated accordingly (or the corresponding index entries are not invalidated/removed), subsequent index-based queries may return incorrect results.

### Solution

## Introduce DeletedRowIds for index

In Paimon, the index is currently only used as a coarse pre-filter, and we still run a full filtering pass afterwards. Based on this, we could simply persist DeletedRanges in the data file, and after the normal index lookup finishes, apply an OR operation between the index result and the DeletedRanges to account for deleted/invalidated entries. As illustrated below:

<img width="1068" height="768" alt="Image" src="https://github.com/user-attachments/assets/d536ee05-8454-4305-a5e8-709d2931c761" />

The key points are:
1. During MERGE INTO, for any data files being modified, add the corresponding row ranges directly into DeletedRowIDs.
2. When an index update/rebuild happens, remove the row ranges covered by that update/rebuild from DeletedRowIDs.
3. Accordingly, IndexedSplitScan should be able to decide—based on the input row-range—whether to push the row-range down to the file format reader, or to fall back to a full scan.

## Introduce options for Merge Into
As the [comment](https://github.com/apache/paimon/pull/7028#pullrequestreview-3655396693), we could add an option for merge into, to control the action on updating indexed columns:
1. THROWS_AN_ERROR: perform a partition-level check and fail the commit with an error.
2. DROP_PARTITION_INDEX: in the same commit, drop the index files for all partitions that were modified.
3. DROP_FILE_INDEX: mark the row ranges of the affected data files as deleted/invalidated (as described above).
4. UPDATE_INDEX: add an index-update operator in the downstream MERGE INTO pipeline to update the index incrementally.

### Anything else?

I think we could introduce THROWS_AN_ERROR and DROP_PARTITION_INDEX first to fix the potential data-index inconsistency. The detailed design to solve index-partial-deletion and index-update can be discussed further.

### Are you willing to submit a PR?

- [x] I'm willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Properly deal with global index in MergeInto #7079

Search before asking

Motivation

Solution

Introduce DeletedRowIds for index

Introduce options for Merge Into

Anything else?

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Properly deal with global index in MergeInto #7079

Description

Search before asking

Motivation

Solution

Introduce DeletedRowIds for index

Introduce options for Merge Into

Anything else?

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions