feat(datafusion): push down isnan over NaN-preserving numeric expressions by huan233usc · Pull Request #2592 · apache/iceberg-rust

huan233usc · 2026-06-05T01:21:20Z

Which issue does this PR close?

Part of #2154.

What changes are included in this PR?

Previously, scalar-function pushdown only handled isnan(...) when the argument was a bare column reference; complex numeric arguments such as isnan(qux + 1) were silently dropped.

This PR adds resolve_nan_preserving_reference, which resolves an isnan argument down to a single column Reference through transformations that preserve NaN-ness, so isnan(<expr>) can be pushed down as <col> IS NAN:

negation -x
numeric casts (date casts still rejected)
x + c, c + x, x - c, c - x for a finite literal c
x * c, c * x, x / c for a finite, non-zero literal c
arbitrary nesting of the above (e.g. isnan(-(qux + 1) * 3))

Resolving scalar-function arguments (e.g. abs(x)) is intentionally left for a follow-up PR to keep this change small; isnan(abs(x)) is not pushed down here.

Why is this sound?

Filter pushdown is reported as Inexact, so DataFusion re-applies the original predicate after scanning. The pushed-down predicate therefore only needs to be implied by the original filter (it may match extra rows, but must never drop a matching one). Every supported transformation keeps NaN-ness exactly equivalent — the result is NaN iff the wrapped column is NaN — so both isnan(...) and NOT isnan(...) remain correct.

Cases that do not preserve NaN-ness are intentionally rejected (and covered by tests): x * 0 / x / 0 (±inf * 0 is NaN), c / x (0 / 0 is NaN), and multi-column expressions.

Are these changes tested?

Yes. Updated the existing isnan(qux + 1) "unsupported" test and added unit tests covering negation, additive and multiplicative forms, nested expressions, combination with other predicates, and the rejected/deferred cases. All unit tests in expr_to_predicate pass, along with clippy and rustfmt.

…ions Previously `isnan(...)` was only pushed down to Iceberg when its argument was a bare column; complex numeric arguments such as `isnan(qux + 1)` were silently dropped. This resolves the argument through NaN-preserving transformations - negation, `abs`, numeric casts, and arithmetic with finite literals - down to a single column reference, so `isnan(<expr>)` is pushed down as `<col> IS NAN`. Filter pushdown is reported as Inexact, so DataFusion re-applies the original predicate after scanning. Every supported transformation keeps NaN-ness exactly equivalent (result is NaN iff the column is NaN), so both `isnan(...)` and `NOT isnan(...)` stay correct. Unsound cases (`x * 0`, `x / 0`, `c / x`, multi-column expressions) are explicitly rejected and covered by tests. Closes apache#2154

Replace the terse `(nonzero, literal_allowed_left)` tuple with explicit per-operator handling, and simplify the literal helpers to return booleans (`is_finite_literal` / `is_finite_nonzero_literal`). No behavior change.

Narrow this PR's scope to references, negation, numeric casts and arithmetic with finite literals. Resolving scalar-function arguments such as `abs(x)` is left for a separate PR. `isnan(abs(x))` is no longer pushed down (covered by the unsupported-args test).

Re-add `abs(x)` as a NaN-preserving argument to `isnan`, and merge the literal f64 extraction into a single `literal_as_f64` helper. No other behavior change.

…tion Replace the hand-rolled numeric type match with DataFusion's `ScalarValue::cast_to(Float64)`. Less code and covers all numeric literal types (e.g. decimals) without enumerating them. No behavior change for the supported cases.

Use the existing `Reference::is_nan()` builder instead of constructing the `Predicate::Unary(UnaryExpression::new(...))` by hand.

…iteral Fold the finiteness check into `finite_literal` and inline the non-zero check at the multiply/divide call sites, removing the two thin boolean wrappers.

CTTY

Thanks for the contribution! I have some questions regarding this change

CTTY · 2026-06-12T00:00:10Z

+/// Resolves the column reference from an arithmetic expression that combines a
+/// single column with a finite literal while preserving NaN-ness. See
+/// [`resolve_nan_preserving_reference`] for the soundness argument.
+fn resolve_nan_preserving_binary(binary: &BinaryExpr) -> Option<Reference> {


Do we want to support column references on both sides? e.g. (x + 1) * (x - 2)

I think the example above will not be pushed down with the existing logic since we assume it's a finite literals on one side at least?

Yes, (x + 1) * (x - 2) won't be pushed down in this PR.

It requires more work to support it safely. To support that we need to:

Recurse into both operands and confirm they reduce to the same single column — otherwise e.g. x + y reduces to two different columns and can't be expressed as a single col IS NAN (it would need x IS NAN OR y IS NAN, and picking just one would drop rows).

Verify the operator combination is NaN-preserving (result is NaN iff that column is NaN) — e.g. (x + 1) * (x - 2) is NaN iff x is NaN, so it's safe, but (x + 1) - (x - 2) is NaN when x = inf (inf - inf) even though x isn't, so it is not.

Maybe we could do it in the follow up, but iiuc Spark won't push down it as well so in practice this might be a little bit less seen.

I agree that we don't have to support it in this PR, can we add a note here in the comment?

Done -- added the note with a TODO referencing #2154 for the two-sided column case.

CTTY · 2026-06-12T00:04:08Z

+
+        // `x / c` is NaN iff `x` is NaN, for a finite non-zero literal `c`.
+        // `c / x` is rejected because it is not NaN-preserving (e.g. `0 / 0` is
+        // NaN while `0` is not), so the column must be the dividend (left side).


I found these comments really hard to follow

e.g. 0 / 0 is
// NaN while 0 is not

Reworded in a7e81bc to spell out the IEEE-754 facts (0 is not NaN, but 0 / 0 is NaN).

CTTY · 2026-06-12T00:05:54Z

+        }
+
+        // `x * c` and `c * x` are NaN iff `x` is NaN, but only when `c` is
+        // non-zero: `±inf * 0` is NaN even though `±inf` is not. The column may


I think this comment is hard to follow, how about just:
according to IEEE-754
inf is not NaN
inf * 0 is NaN

Done in a7e81bc — reworded to: per IEEE-754, inf is not NaN but inf * 0 is NaN, so multiplying by zero is rejected.

Address review feedback: explain the multiply/divide rejection using explicit IEEE-754 facts (inf is not NaN but inf * 0 is NaN; 0 is not NaN but 0 / 0 is NaN) instead of the previous terse wording.

CTTY

Generally LGTM! Please add a note to clarify the behavior a bit

CTTY · 2026-06-12T21:16:05Z

+/// Resolves the column reference from an arithmetic expression that combines a
+/// single column with a finite literal while preserving NaN-ness. See
+/// [`resolve_nan_preserving_reference`] for the soundness argument.
+fn resolve_nan_preserving_binary(binary: &BinaryExpr) -> Option<Reference> {


I agree that we don't have to support it in this PR, can we add a note here in the comment?

…isnan pushdown

… follow-up

huan233usc added 8 commits June 4, 2026 18:20

refactor(datafusion): clarify NaN-preserving arithmetic resolution

a10973d

Replace the terse `(nonzero, literal_allowed_left)` tuple with explicit per-operator handling, and simplify the literal helpers to return booleans (`is_finite_literal` / `is_finite_nonzero_literal`). No behavior change.

docs(datafusion): add TODO for deferred scalar-function arg support

b33d08d

feat(datafusion): support abs argument and fold literal helper

4c92291

Re-add `abs(x)` as a NaN-preserving argument to `isnan`, and merge the literal f64 extraction into a single `literal_as_f64` helper. No other behavior change.

refactor(datafusion): build IsNan via Reference::is_nan builder

3675bd1

Use the existing `Reference::is_nan()` builder instead of constructing the `Predicate::Unary(UnaryExpression::new(...))` by hand.

refactor(datafusion): collapse literal helpers into a single finite_l…

8814873

…iteral Fold the finiteness check into `finite_literal` and inline the non-zero check at the multiply/divide call sites, removing the two thin boolean wrappers.

CTTY reviewed Jun 12, 2026

View reviewed changes

docs(datafusion): clarify IEEE-754 rationale in arithmetic comments

a7e81bc

Address review feedback: explain the multiply/divide rejection using explicit IEEE-754 facts (inf is not NaN but inf * 0 is NaN; 0 is not NaN but 0 / 0 is NaN) instead of the previous terse wording.

huan233usc requested a review from CTTY June 12, 2026 01:32

CTTY reviewed Jun 12, 2026

View reviewed changes

huan233usc added 4 commits June 12, 2026 14:44

docs(datafusion): note why two-column arithmetic is not resolved for …

517f91b

…isnan pushdown

docs(datafusion): link two-column arithmetic follow-up to issue 2154

674fb44

docs(datafusion): use TODO(issue-2154) form for two-column arithmetic…

366c472

… follow-up

docs(datafusion): match repo TODO style for issue-2154 follow-up note

5743c7a

huan233usc requested a review from CTTY June 12, 2026 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datafusion): push down isnan over NaN-preserving numeric expressions#2592

feat(datafusion): push down isnan over NaN-preserving numeric expressions#2592
huan233usc wants to merge 13 commits into
apache:mainfrom
huan233usc:fix/2154-isnan-complex-arg-pushdown

huan233usc commented Jun 5, 2026 •

edited

Loading

Uh oh!

CTTY left a comment

Uh oh!

CTTY Jun 12, 2026

Uh oh!

huan233usc Jun 12, 2026

Uh oh!

CTTY Jun 12, 2026

Uh oh!

huan233usc Jun 12, 2026

Uh oh!

CTTY Jun 12, 2026

Uh oh!

huan233usc Jun 12, 2026

Uh oh!

CTTY Jun 12, 2026

Uh oh!

huan233usc Jun 12, 2026

Uh oh!

CTTY left a comment

Uh oh!

CTTY Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

huan233usc commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Why is this sound?

Are these changes tested?

Uh oh!

CTTY left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CTTY left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huan233usc commented Jun 5, 2026 •

edited

Loading