Skip to content

Pluggable expression-level statistics estimation (ExpressionAnalyzer) #21120

@asolimando

Description

@asolimando

Is your feature request related to a problem or challenge?

DataFusion currently loses expression-level statistics when computing plan metadata:

  • Projections: any expression that isn't a bare column or literal gets NDV = Absent, even for simple cases like col + 1 where NDV is trivially derivable from the input
  • Filters: when interval analysis cannot handle a predicate (check_support returns false), selectivity falls back to a hardcoded 20% regardless of available column statistics
  • Custom UDFs: there is no way for users to provide statistics metadata for their functions, making all UDFs opaque to the optimizer

Without expression-level statistics, the optimizer lacks the information it needs for join ordering, cardinality estimation, and cost-based decisions involving computed columns or UDFs. Projects embedding DataFusion currently have no extension point to provide this information for their own functions.

Related: this was previously raised in #992 (closed as non-actionable at the time).

Describe the solution you'd like

A pluggable chain-of-responsibility framework for expression-level statistics, covering:

  1. Selectivity (predicate filtering fraction)
  2. NDV (number of distinct values)
  3. Min/max bounds
  4. Null fraction

The framework should:

  • Ship with a default Selinger-style analyzer handling columns, literals, binary expressions (AND/OR/NOT/comparisons), and arithmetic
  • Include built-in analyzers for common function families (string, math, date_part/date_trunc)
  • Allow users to register custom analyzers via SessionState for UDF-specific or domain-specific estimation (e.g., histogram-based, geometry-aware)
  • Integrate into physical operators that need expression-level statistics (projections, filters, joins, aggregates, etc.)
  • Be non-breaking and purely additive

Describe alternatives you've considered

Planned work

Framework

  • ExpressionAnalyzer trait, chain-of-responsibility registry, SessionState integration
  • Default analyzer with Selinger-style heuristics (columns, literals, binary expressions, NOT)

Built-in analyzers for common functions

  • String functions (UPPER, LOWER, TRIM, SUBSTRING, REPLACE, ...)
  • Math functions (FLOOR, CEIL, ROUND, ABS, EXP, LN, ...)
  • Date/time functions (date_part, date_trunc)

Operator integration

  • Projection: propagate statistics through projected expressions
  • Filter: use analyzer selectivity when interval analysis is not applicable
  • Joins: expression-aware cardinality estimation for join key expressions
  • Aggregates: NDV-based output row estimation for GROUP BY expressions

Additional context

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions