Skip to content

feat: add conservative ORC predicate pushdown#388

Open
liujiwen-up wants to merge 1 commit into
apache:mainfrom
liujiwen-up:feat/orc-predicate-pushdown
Open

feat: add conservative ORC predicate pushdown#388
liujiwen-up wants to merge 1 commit into
apache:mainfrom
liujiwen-up:feat/orc-predicate-pushdown

Conversation

@liujiwen-up

Copy link
Copy Markdown
Contributor

Purpose

Linked issue: none

This PR adds conservative predicate pushdown support for ORC reads.

Before this change, ORC reads ignored file predicates completely. This meant ORC files could not benefit from reader-level row-group pruning, and filtering on non-projected columns could fail once the ORC reader needed those columns for predicate evaluation.

This change enables safe ORC row-group pruning for a first set of scalar predicates while preserving Paimon's existing residual-filter semantics. ORC predicate pushdown is treated as a conservative optimization only, not as exact filtering.

Brief change log

  • Translate supported Paimon predicates into orc-rust predicates for ORC row-group pruning.
  • Support ORC pushdown for:
    • boolean equality
    • tinyint/smallint/int/bigint comparisons
    • string comparisons
    • small IN predicates
    • IS NOT NULL on supported scalar types
  • Keep unsupported predicates fail-open so they are not pushed down unsafely.
  • Read predicate-only columns internally when they are not part of the requested projection, then project output batches back to the requested columns.
  • Document that ORC predicate pushdown is conservative and still requires residual filtering above the scan for exact semantics.
  • Add unit tests for predicate translation, fail-open behavior, nested compound predicates, and projection restoration.
  • Add integration coverage for ORC reads with predicate-only column projection, supported scalar predicate types, conservative semantics, and unsupported date predicates remaining residual.

Unsupported in this PR

The following predicate pushdown cases are intentionally not enabled yet:

  • FLOAT = literal and DOUBLE = literal
    • Reason: ORC equality pruning may use bloom filters. We need to verify that orc-rust float/double bloom hashing is compatible with ORC files produced by Spark/Java ORC writers before enabling this safely. A false negative bloom match could incorrectly skip a row group.
  • Date and timestamp predicates
    • Reason: this first PR keeps the supported type surface small. Date/timestamp ORC statistics need separate validation for encoding, timezone, and unit semantics before enabling.
  • Decimal, binary, nested types, and complex predicates
    • Reason: these require additional type-specific validation and test fixtures.
  • NOT, NOT IN, !=, and partially supported predicates under OR
    • Reason: OR predicates must be pushed down only when all branches are safely representable. Otherwise row groups could be incorrectly pruned.

These unsupported cases fail open and remain residual filters. They can be added incrementally in follow-up PRs after dedicated compatibility tests are available.

Tests

  • cargo fmt --check: passed
  • cargo test -p paimon arrow::format::orc::tests --lib: passed
  • cargo test -p paimon-integration-tests test_read_orc --no-run: passed

Note: the ORC integration tests require the provisioned integration-test warehouse to run fully. In this local environment, full execution was not run because default.full_types_table was not present.

API and Format

No public API changes.

No table format, snapshot, manifest, or persisted metadata format changes.

Documentation

Inline reader documentation was updated to clarify that ORC predicate pushdown is conservative row-group pruning, not exact row-level filtering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant