You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
DataFusion currently loses expression-level statistics when computing plan metadata:
Projections: any expression that isn't a bare column or literal gets NDV = Absent, even for simple cases like col + 1 where NDV is trivially derivable from the input
Filters: when interval analysis cannot handle a predicate (check_support returns false), selectivity falls back to a hardcoded 20% regardless of available column statistics
Custom UDFs: there is no way for users to provide statistics metadata for their functions, making all UDFs opaque to the optimizer
Without expression-level statistics, the optimizer lacks the information it needs for join ordering, cardinality estimation, and cost-based decisions involving computed columns or UDFs. Projects embedding DataFusion currently have no extension point to provide this information for their own functions.
Related: this was previously raised in #992 (closed as non-actionable at the time).
Describe the solution you'd like
A pluggable chain-of-responsibility framework for expression-level statistics, covering:
Selectivity (predicate filtering fraction)
NDV (number of distinct values)
Min/max bounds
Null fraction
The framework should:
Ship with a default Selinger-style analyzer handling columns, literals, binary expressions (AND/OR/NOT/comparisons), and arithmetic
Include built-in analyzers for common function families (string, math, date_part/date_trunc)
Allow users to register custom analyzers via SessionState for UDF-specific or domain-specific estimation (e.g., histogram-based, geometry-aware)
Integrate into physical operators that need expression-level statistics (projections, filters, joins, aggregates, etc.)
Be non-breaking and purely additive
Describe alternatives you've considered
Extending PhysicalExpr::evaluate_statistics() (StatisticsV2: initial statistics framework redesign #14699): this provides per-expression statistics but doesn't support chain delegation or user-registered overrides, and would require changes to the PhysicalExpr trait
Hardcoding heuristics in each operator (the status quo): does not scale as more expressions and operators need statistics, and provides no extension point for users
Is your feature request related to a problem or challenge?
DataFusion currently loses expression-level statistics when computing plan metadata:
NDV = Absent, even for simple cases likecol + 1where NDV is trivially derivable from the inputcheck_supportreturns false), selectivity falls back to a hardcoded 20% regardless of available column statisticsWithout expression-level statistics, the optimizer lacks the information it needs for join ordering, cardinality estimation, and cost-based decisions involving computed columns or UDFs. Projects embedding DataFusion currently have no extension point to provide this information for their own functions.
Related: this was previously raised in #992 (closed as non-actionable at the time).
Describe the solution you'd like
A pluggable chain-of-responsibility framework for expression-level statistics, covering:
The framework should:
SessionStatefor UDF-specific or domain-specific estimation (e.g., histogram-based, geometry-aware)Describe alternatives you've considered
PhysicalExpr::evaluate_statistics()(StatisticsV2: initial statistics framework redesign #14699): this provides per-expression statistics but doesn't support chain delegation or user-registered overrides, and would require changes to thePhysicalExprtraitDistributionfromPrecision#14896, StatisticsV2: initial statistics framework redesign #14699): more powerful but significantly more complex to implement and adopt; ExpressionAnalyzer can serve as the foundation, with distribution-based estimation plugged in as a custom analyzerPlanned work
Framework
Built-in analyzers for common functions
Operator integration
Additional context
DistributionfromPrecision#14896 (expression statistics tracking)