Skip to content

feat: Add pluggable StatisticsRegistry for operator-level statistics propagation#21483

Open
asolimando wants to merge 8 commits intoapache:mainfrom
asolimando:asolimando/statistics-planner-prototype
Open

feat: Add pluggable StatisticsRegistry for operator-level statistics propagation#21483
asolimando wants to merge 8 commits intoapache:mainfrom
asolimando:asolimando/statistics-planner-prototype

Conversation

@asolimando
Copy link
Copy Markdown
Member

Which issue does this PR close?

Rationale for this change

DataFusion's built-in statistics propagation has no extension point: downstream projects cannot inject external catalog stats, override built-in estimation, or plug in custom strategies without forking.

This PR introduces StatisticsRegistry, a pluggable chain-of-responsibility for operator-level statistics following the same pattern as RelationPlanner for SQL parsing and ExpressionAnalyzer (#21120) for expression-level stats. See #21443 for full motivation and design context.

What changes are included in this PR?

  1. Framework (operator_statistics/mod.rs): StatisticsProvider trait, StatisticsRegistry (chain-of-responsibility), ExtendedStatistics (Statistics + type-erased extension map), DefaultStatisticsProvider. PhysicalOptimizerContext trait with optimize_with_context dispatch. SessionState integration.

  2. Built-in providers for Filter, Projection, Passthrough (sort/repartition/etc), Aggregate, Join (hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities: num_distinct_vals, ndv_after_selectivity.

  3. ClosureStatisticsProvider: closure-based provider for test injection and cardinality feedback.

  4. JoinSelection integration: use_statistics_registry config flag (default false), registry-aware optimize_with_context, SLT test demonstrating plan difference on skewed data.

Are these changes tested?

  • 39 unit tests covering all providers, NDV utilities, chain priority, and edge cases (Inexact precision, Absent propagation, Partial aggregate delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV, exact Cartesian product, CrossJoin, GlobalLimit skip+fetch)
  • 1 SLT test (statistics_registry.slt): three-table join on skewed data (8:1:1 customer_id distribution) where the built-in NDV formula estimates 33 rows (wrong; actual=66) and the registry conservatively estimates 100, producing the correct build-side swap

Are there any user-facing changes?

New public API (purely additive, non-breaking):

  • StatisticsProvider trait and StatisticsRegistry in datafusion-physical-plan
  • ExtendedStatistics, StatisticsResult types; built-in provider structs; num_distinct_vals, ndv_after_selectivity utilities
  • PhysicalOptimizerContext trait and ConfigOnlyContext in datafusion-physical-optimizer
  • SessionState::statistics_registry(), SessionStateBuilder::with_statistics_registry()
  • Config: datafusion.optimizer.use_statistics_registry (default false)

Default behavior is unchanged. The registry is only consulted when the flag is explicitly enabled.

Known limitations:


Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

Introduces a chain-of-responsibility architecture for statistics
computation on physical plan nodes:

- StatisticsProvider trait: chain element computing stats for operators
- StatisticsRegistry: chains providers, with fast-path for empty config
- ExtendedStatistics: Statistics with type-erased extension map
- DefaultStatisticsProvider: delegates to each operator's partition_statistics()
- PhysicalOptimizerContext trait with optimize_with_context() dispatch
- ConfigOnlyContext for backward-compatible rule invocation
- SessionState integration with statistics_registry field and builder
Adds a set of default StatisticsProvider implementations that cover the
most common physical operators:

- FilterStatisticsProvider: selectivity-based row count, reuses the same
  pre-enhanced child statistics, with post-filter NDV adjustment
- ProjectionStatisticsProvider: column mapping through projections
- PassthroughStatisticsProvider: passthrough for cardinality-preserving operators
  (Sort, Repartition, Window, etc.) via CardinalityEffect
- AggregateStatisticsProvider: NDV-product estimation for GROUP BY,
  delegates for Partial mode and multiple grouping sets (apache#20926)
- JoinStatisticsProvider: NDV-based join output estimation (hash, sort-merge,
  nested-loop, cross) with join-type-aware cardinality bounds and correct
  key-column NDV lookup
- LimitStatisticsProvider: caps output at the fetch limit (local and global)
- UnionStatisticsProvider: sums input row counts
- DefaultStatisticsProvider: fallback to partition_statistics(None)
@github-actions github-actions bot added documentation Improvements or additions to documentation optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate physical-plan Changes to the physical-plan crate labels Apr 8, 2026
@asolimando asolimando force-pushed the asolimando/statistics-planner-prototype branch from 74753b3 to 0e25387 Compare April 8, 2026 19:51
@asolimando asolimando force-pushed the asolimando/statistics-planner-prototype branch from 0e25387 to 8132e06 Compare April 8, 2026 20:08
Adds a pluggable statistics path for JoinSelection that uses the
StatisticsRegistry instead of each operator's built-in partition_statistics.

- Add optimizer.use_statistics_registry config flag (default=false)
- Override optimize_with_context in JoinSelection to pass the registry
  to should_swap_join_order when the flag is enabled; if no registry is
  set on SessionState the built-in default is constructed lazily
- Add statistics_registry.slt demonstrating how the registry produces
  more conservative join estimates for skewed data (10*10=100 cartesian
  fallback vs 10*10/3=33 range-NDV estimate), triggering the correct
  build-side swap that the built-in estimator misses
@asolimando asolimando force-pushed the asolimando/statistics-planner-prototype branch from 8132e06 to 5cadfd3 Compare April 8, 2026 20:16
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asolimando
Visting this from 21122
I found one issue that looks blocking before this can land. I also left a couple of non-blocking suggestions that seem worth considering while this is fresh.

} else if let Some(smj) = plan.downcast_ref::<SortMergeJoinExec>() {
let est = equi_join_estimate(smj.on(), left, right, left_rows, right_rows);
(est, false, smj.join_type())
} else if let Some(nl_join) = plan.downcast_ref::<NestedLoopJoinExec>() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the NestedLoopJoinExec handling here. I think this branch is a bit too aggressive though.

Right now it treats every NL join as a Cartesian product with left_rows * right_rows, but NestedLoopJoinExec::partition_statistics intentionally returns unknown row counts for non-equi or filter-only joins because an arbitrary JoinFilter can be very selective.

With datafusion.optimizer.use_statistics_registry = true, a filtered NLJ would suddenly look much larger than before, and that can push join reordering or plan selection in the wrong direction.

Could we delegate when nl_join.filter().is_some() instead, or otherwise add a real selectivity model plus regression coverage? As written, this seems like it will overestimate filtered NLJs in a way that changes optimizer behavior.

}
}

/// Get statistics for a plan node, using the registry if available.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that stood out to me here is that get_stats() appears to call reg.compute(plan) from scratch for each lookup, while JoinSelection asks for stats multiple times on overlapping subtrees during transform_up.

That makes the registry-enabled path look potentially quadratic on larger join trees. It might be worth adding a small per-pass cache or precomputed stats map here so the feature does not add avoidable optimizer overhead.

// are lost here because partition_statistics(None) re-fetches raw child
// stats internally. Once #20184 lands, pass enhanced child_stats so the
// operator's built-in column mapping uses them instead.
let mut base = Arc::unwrap_or_clone(plan.partition_statistics(None)?);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move this partition_statistics(None) plus rescale_byte_size(...) sequence into a helper?

The same pattern now shows up in multiple providers, and centralizing it would make the base-stats preservation logic easier to audit and less likely to drift as more providers get added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation optimizer Optimizer rules physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants