Skip to content

feat(datafusion): push down isnan over NaN-preserving numeric expressions#2592

Open
huan233usc wants to merge 13 commits into
apache:mainfrom
huan233usc:fix/2154-isnan-complex-arg-pushdown
Open

feat(datafusion): push down isnan over NaN-preserving numeric expressions#2592
huan233usc wants to merge 13 commits into
apache:mainfrom
huan233usc:fix/2154-isnan-complex-arg-pushdown

Conversation

@huan233usc

@huan233usc huan233usc commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Part of #2154.

What changes are included in this PR?

Previously, scalar-function pushdown only handled isnan(...) when the argument was a bare column reference; complex numeric arguments such as isnan(qux + 1) were silently dropped.

This PR adds resolve_nan_preserving_reference, which resolves an isnan argument down to a single column Reference through transformations that preserve NaN-ness, so isnan(<expr>) can be pushed down as <col> IS NAN:

  • negation -x
  • numeric casts (date casts still rejected)
  • x + c, c + x, x - c, c - x for a finite literal c
  • x * c, c * x, x / c for a finite, non-zero literal c
  • arbitrary nesting of the above (e.g. isnan(-(qux + 1) * 3))

Resolving scalar-function arguments (e.g. abs(x)) is intentionally left for a follow-up PR to keep this change small; isnan(abs(x)) is not pushed down here.

Why is this sound?

Filter pushdown is reported as Inexact, so DataFusion re-applies the original predicate after scanning. The pushed-down predicate therefore only needs to be implied by the original filter (it may match extra rows, but must never drop a matching one). Every supported transformation keeps NaN-ness exactly equivalent — the result is NaN iff the wrapped column is NaN — so both isnan(...) and NOT isnan(...) remain correct.

Cases that do not preserve NaN-ness are intentionally rejected (and covered by tests): x * 0 / x / 0 (±inf * 0 is NaN), c / x (0 / 0 is NaN), and multi-column expressions.

Are these changes tested?

Yes. Updated the existing isnan(qux + 1) "unsupported" test and added unit tests covering negation, additive and multiplicative forms, nested expressions, combination with other predicates, and the rejected/deferred cases. All unit tests in expr_to_predicate pass, along with clippy and rustfmt.

…ions

Previously `isnan(...)` was only pushed down to Iceberg when its argument
was a bare column; complex numeric arguments such as `isnan(qux + 1)` were
silently dropped.

This resolves the argument through NaN-preserving transformations - negation,
`abs`, numeric casts, and arithmetic with finite literals - down to a single
column reference, so `isnan(<expr>)` is pushed down as `<col> IS NAN`.

Filter pushdown is reported as Inexact, so DataFusion re-applies the original
predicate after scanning. Every supported transformation keeps NaN-ness exactly
equivalent (result is NaN iff the column is NaN), so both `isnan(...)` and
`NOT isnan(...)` stay correct. Unsound cases (`x * 0`, `x / 0`, `c / x`,
multi-column expressions) are explicitly rejected and covered by tests.

Closes apache#2154
Replace the terse `(nonzero, literal_allowed_left)` tuple with explicit
per-operator handling, and simplify the literal helpers to return booleans
(`is_finite_literal` / `is_finite_nonzero_literal`). No behavior change.
Narrow this PR's scope to references, negation, numeric casts and arithmetic
with finite literals. Resolving scalar-function arguments such as `abs(x)` is
left for a separate PR. `isnan(abs(x))` is no longer pushed down (covered by
the unsupported-args test).
Re-add `abs(x)` as a NaN-preserving argument to `isnan`, and merge the literal
f64 extraction into a single `literal_as_f64` helper. No other behavior change.
…tion

Replace the hand-rolled numeric type match with DataFusion's
`ScalarValue::cast_to(Float64)`. Less code and covers all numeric literal
types (e.g. decimals) without enumerating them. No behavior change for the
supported cases.
Use the existing `Reference::is_nan()` builder instead of constructing the
`Predicate::Unary(UnaryExpression::new(...))` by hand.
…iteral

Fold the finiteness check into `finite_literal` and inline the non-zero check
at the multiply/divide call sites, removing the two thin boolean wrappers.

@CTTY CTTY left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I have some questions regarding this change

/// Resolves the column reference from an arithmetic expression that combines a
/// single column with a finite literal while preserving NaN-ness. See
/// [`resolve_nan_preserving_reference`] for the soundness argument.
fn resolve_nan_preserving_binary(binary: &BinaryExpr) -> Option<Reference> {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support column references on both sides? e.g. (x + 1) * (x - 2)

I think the example above will not be pushed down with the existing logic since we assume it's a finite literals on one side at least?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, (x + 1) * (x - 2) won't be pushed down in this PR.

It requires more work to support it safely. To support that we need to:

  1. Recurse into both operands and confirm they reduce to the same single column — otherwise e.g. x + y reduces to two different columns and can't be expressed as a single col IS NAN (it would need x IS NAN OR y IS NAN, and picking just one would drop rows).
  2. Verify the operator combination is NaN-preserving (result is NaN iff that column is NaN) — e.g. (x + 1) * (x - 2) is NaN iff x is NaN, so it's safe, but (x + 1) - (x - 2) is NaN when x = inf (inf - inf) even though x isn't, so it is not.

Maybe we could do it in the follow up, but iiuc Spark won't push down it as well so in practice this might be a little bit less seen.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we don't have to support it in this PR, can we add a note here in the comment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done -- added the note with a TODO referencing #2154 for the two-sided column case.


// `x / c` is NaN iff `x` is NaN, for a finite non-zero literal `c`.
// `c / x` is rejected because it is not NaN-preserving (e.g. `0 / 0` is
// NaN while `0` is not), so the column must be the dividend (left side).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found these comments really hard to follow

e.g. 0 / 0 is
// NaN while 0 is not

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded in a7e81bc to spell out the IEEE-754 facts (0 is not NaN, but 0 / 0 is NaN).

}

// `x * c` and `c * x` are NaN iff `x` is NaN, but only when `c` is
// non-zero: `±inf * 0` is NaN even though `±inf` is not. The column may

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is hard to follow, how about just:
according to IEEE-754
inf is not NaN
inf * 0 is NaN

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a7e81bc — reworded to: per IEEE-754, inf is not NaN but inf * 0 is NaN, so multiplying by zero is rejected.

Address review feedback: explain the multiply/divide rejection using explicit
IEEE-754 facts (inf is not NaN but inf * 0 is NaN; 0 is not NaN but 0 / 0 is
NaN) instead of the previous terse wording.
@huan233usc huan233usc requested a review from CTTY June 12, 2026 01:32

@CTTY CTTY left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM! Please add a note to clarify the behavior a bit

/// Resolves the column reference from an arithmetic expression that combines a
/// single column with a finite literal while preserving NaN-ness. See
/// [`resolve_nan_preserving_reference`] for the soundness argument.
fn resolve_nan_preserving_binary(binary: &BinaryExpr) -> Option<Reference> {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we don't have to support it in this PR, can we add a note here in the comment?

@huan233usc huan233usc requested a review from CTTY June 12, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants