feat(datafusion): push down isnan over NaN-preserving numeric expressions#2592
feat(datafusion): push down isnan over NaN-preserving numeric expressions#2592huan233usc wants to merge 13 commits into
Conversation
…ions Previously `isnan(...)` was only pushed down to Iceberg when its argument was a bare column; complex numeric arguments such as `isnan(qux + 1)` were silently dropped. This resolves the argument through NaN-preserving transformations - negation, `abs`, numeric casts, and arithmetic with finite literals - down to a single column reference, so `isnan(<expr>)` is pushed down as `<col> IS NAN`. Filter pushdown is reported as Inexact, so DataFusion re-applies the original predicate after scanning. Every supported transformation keeps NaN-ness exactly equivalent (result is NaN iff the column is NaN), so both `isnan(...)` and `NOT isnan(...)` stay correct. Unsound cases (`x * 0`, `x / 0`, `c / x`, multi-column expressions) are explicitly rejected and covered by tests. Closes apache#2154
Replace the terse `(nonzero, literal_allowed_left)` tuple with explicit per-operator handling, and simplify the literal helpers to return booleans (`is_finite_literal` / `is_finite_nonzero_literal`). No behavior change.
Narrow this PR's scope to references, negation, numeric casts and arithmetic with finite literals. Resolving scalar-function arguments such as `abs(x)` is left for a separate PR. `isnan(abs(x))` is no longer pushed down (covered by the unsupported-args test).
Re-add `abs(x)` as a NaN-preserving argument to `isnan`, and merge the literal f64 extraction into a single `literal_as_f64` helper. No other behavior change.
…tion Replace the hand-rolled numeric type match with DataFusion's `ScalarValue::cast_to(Float64)`. Less code and covers all numeric literal types (e.g. decimals) without enumerating them. No behavior change for the supported cases.
Use the existing `Reference::is_nan()` builder instead of constructing the `Predicate::Unary(UnaryExpression::new(...))` by hand.
…iteral Fold the finiteness check into `finite_literal` and inline the non-zero check at the multiply/divide call sites, removing the two thin boolean wrappers.
CTTY
left a comment
There was a problem hiding this comment.
Thanks for the contribution! I have some questions regarding this change
| /// Resolves the column reference from an arithmetic expression that combines a | ||
| /// single column with a finite literal while preserving NaN-ness. See | ||
| /// [`resolve_nan_preserving_reference`] for the soundness argument. | ||
| fn resolve_nan_preserving_binary(binary: &BinaryExpr) -> Option<Reference> { |
There was a problem hiding this comment.
Do we want to support column references on both sides? e.g. (x + 1) * (x - 2)
I think the example above will not be pushed down with the existing logic since we assume it's a finite literals on one side at least?
There was a problem hiding this comment.
Yes, (x + 1) * (x - 2) won't be pushed down in this PR.
It requires more work to support it safely. To support that we need to:
- Recurse into both operands and confirm they reduce to the same single column — otherwise e.g.
x + yreduces to two different columns and can't be expressed as a singlecol IS NAN(it would needx IS NAN OR y IS NAN, and picking just one would drop rows). - Verify the operator combination is NaN-preserving (result is NaN iff that column is NaN) — e.g.
(x + 1) * (x - 2)is NaN iffxis NaN, so it's safe, but(x + 1) - (x - 2)is NaN whenx = inf(inf - inf) even thoughxisn't, so it is not.
Maybe we could do it in the follow up, but iiuc Spark won't push down it as well so in practice this might be a little bit less seen.
There was a problem hiding this comment.
I agree that we don't have to support it in this PR, can we add a note here in the comment?
There was a problem hiding this comment.
Done -- added the note with a TODO referencing #2154 for the two-sided column case.
|
|
||
| // `x / c` is NaN iff `x` is NaN, for a finite non-zero literal `c`. | ||
| // `c / x` is rejected because it is not NaN-preserving (e.g. `0 / 0` is | ||
| // NaN while `0` is not), so the column must be the dividend (left side). |
There was a problem hiding this comment.
I found these comments really hard to follow
e.g.
0 / 0is
// NaN while0is not
There was a problem hiding this comment.
Reworded in a7e81bc to spell out the IEEE-754 facts (0 is not NaN, but 0 / 0 is NaN).
| } | ||
|
|
||
| // `x * c` and `c * x` are NaN iff `x` is NaN, but only when `c` is | ||
| // non-zero: `±inf * 0` is NaN even though `±inf` is not. The column may |
There was a problem hiding this comment.
I think this comment is hard to follow, how about just:
according to IEEE-754
inf is not NaN
inf * 0 is NaN
There was a problem hiding this comment.
Done in a7e81bc — reworded to: per IEEE-754, inf is not NaN but inf * 0 is NaN, so multiplying by zero is rejected.
Address review feedback: explain the multiply/divide rejection using explicit IEEE-754 facts (inf is not NaN but inf * 0 is NaN; 0 is not NaN but 0 / 0 is NaN) instead of the previous terse wording.
CTTY
left a comment
There was a problem hiding this comment.
Generally LGTM! Please add a note to clarify the behavior a bit
| /// Resolves the column reference from an arithmetic expression that combines a | ||
| /// single column with a finite literal while preserving NaN-ness. See | ||
| /// [`resolve_nan_preserving_reference`] for the soundness argument. | ||
| fn resolve_nan_preserving_binary(binary: &BinaryExpr) -> Option<Reference> { |
There was a problem hiding this comment.
I agree that we don't have to support it in this PR, can we add a note here in the comment?
Which issue does this PR close?
Part of #2154.
What changes are included in this PR?
Previously, scalar-function pushdown only handled
isnan(...)when the argument was a bare column reference; complex numeric arguments such asisnan(qux + 1)were silently dropped.This PR adds
resolve_nan_preserving_reference, which resolves anisnanargument down to a single columnReferencethrough transformations that preserve NaN-ness, soisnan(<expr>)can be pushed down as<col> IS NAN:-xx + c,c + x,x - c,c - xfor a finite literalcx * c,c * x,x / cfor a finite, non-zero literalcisnan(-(qux + 1) * 3))Resolving scalar-function arguments (e.g.
abs(x)) is intentionally left for a follow-up PR to keep this change small;isnan(abs(x))is not pushed down here.Why is this sound?
Filter pushdown is reported as
Inexact, so DataFusion re-applies the original predicate after scanning. The pushed-down predicate therefore only needs to be implied by the original filter (it may match extra rows, but must never drop a matching one). Every supported transformation keeps NaN-ness exactly equivalent — the result is NaN iff the wrapped column is NaN — so bothisnan(...)andNOT isnan(...)remain correct.Cases that do not preserve NaN-ness are intentionally rejected (and covered by tests):
x * 0/x / 0(±inf * 0is NaN),c / x(0 / 0is NaN), and multi-column expressions.Are these changes tested?
Yes. Updated the existing
isnan(qux + 1)"unsupported" test and added unit tests covering negation, additive and multiplicative forms, nested expressions, combination with other predicates, and the rejected/deferred cases. All unit tests inexpr_to_predicatepass, along withclippyandrustfmt.