Skip to content

Improve scan pruning observability and IN predicate stats pruning #385

@hhhizzz

Description

@hhhizzz

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Motivation

Paimon Rust scan planning has several metadata pruning paths, including partition pruning, bucket pruning, file min/max stats pruning, LIMIT split reduction, COUNT(*) statistics rewrite, and time-travel snapshot selection.

Today these pruning decisions are hard to inspect from tests or physical plan output. This makes it difficult to identify cases that still do full scans or produce too many splits.

There is also a specific gap for non-partition IN predicates: file min/max stats can prove that some files cannot match, but IN currently fails open and keeps those files.

Proposal

Add lightweight scan planning trace counters so tests and DataFusion physical plan display can show how many manifests, manifest entries, splits, and files survive each pruning stage.

Use the trace to add self-contained pruning baselines for:

  • partition pruning
  • bucket-key pruning
  • SQL BETWEEN partition pruning
  • LIMIT split reduction
  • COUNT(*) statistics rewrite
  • time-travel snapshot selection

Separately, improve non-partition IN stats pruning by checking whether any IN literal overlaps the file min/max range. Keep conservative behavior for NOT IN, missing stats, corrupt stats, and unsupported comparisons.

Solution

Anything else?

No response

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions