feat: Add pluggable StatisticsRegistry for operator-level statistics propagation by asolimando · Pull Request #21483 · apache/datafusion

asolimando · 2026-04-08T19:32:42Z

Which issue does this PR close?

Part of Pluggable operator-level statistics propagation (StatisticsRegistry) #21443 (Pluggable operator-level statistics propagation)
Part of Epic: Statistics improvements #8227 (statistics improvements epic)

Rationale for this change

DataFusion's built-in statistics propagation has no extension point: downstream projects cannot inject external catalog stats, override built-in estimation, or plug in custom strategies without forking.

This PR introduces StatisticsRegistry, a pluggable chain-of-responsibility for operator-level statistics following the same pattern as RelationPlanner for SQL parsing and ExpressionAnalyzer (#21120) for expression-level stats. See #21443 for full motivation and design context.

What changes are included in this PR?

Framework (operator_statistics/mod.rs): StatisticsProvider trait, StatisticsRegistry (chain-of-responsibility), ExtendedStatistics (Statistics + type-erased extension map), DefaultStatisticsProvider. PhysicalOptimizerContext trait with optimize_with_context dispatch. SessionState integration.
Built-in providers for Filter, Projection, Passthrough (sort/repartition/etc), Aggregate, Join (hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities: num_distinct_vals, ndv_after_selectivity.
ClosureStatisticsProvider: closure-based provider for test injection and cardinality feedback.
JoinSelection integration: use_statistics_registry config flag (default false), registry-aware optimize_with_context, SLT test demonstrating plan difference on skewed data.

Are these changes tested?

39 unit tests covering all providers, NDV utilities, chain priority, and edge cases (Inexact precision, Absent propagation, Partial aggregate delegation, GROUPING SETS delegation, join-type bounds, multi-key NDV, exact Cartesian product, CrossJoin, GlobalLimit skip+fetch)
1 SLT test (statistics_registry.slt): three-table join on skewed data (8:1:1 customer_id distribution) where the built-in NDV formula estimates 33 rows (wrong; actual=66) and the registry conservatively estimates 100, producing the correct build-side swap

Are there any user-facing changes?

New public API (purely additive, non-breaking):

StatisticsProvider trait and StatisticsRegistry in datafusion-physical-plan
ExtendedStatistics, StatisticsResult types; built-in provider structs; num_distinct_vals, ndv_after_selectivity utilities
PhysicalOptimizerContext trait and ConfigOnlyContext in datafusion-physical-optimizer
SessionState::statistics_registry(), SessionStateBuilder::with_statistics_registry()
Config: datafusion.optimizer.use_statistics_registry (default false)

Default behavior is unchanged. The registry is only consulted when the flag is explicitly enabled.

Known limitations:

Column-level stats (NDV, min/max) at Join/Aggregate/Union/Limit boundaries are not improved: these operators call partition_statistics(None) internally, re-fetching raw child stats and discarding registry enrichment. 4 TODO comments mark the affected call sites; Let partition_statistics accept pre-computed children statistics #20184 would close this gap.
No ExpressionAnalyzer integration yet (Add ExpressionAnalyzer for pluggable expression-level statistics estimation #21122).

Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

Introduces a chain-of-responsibility architecture for statistics computation on physical plan nodes: - StatisticsProvider trait: chain element computing stats for operators - StatisticsRegistry: chains providers, with fast-path for empty config - ExtendedStatistics: Statistics with type-erased extension map - DefaultStatisticsProvider: delegates to each operator's partition_statistics() - PhysicalOptimizerContext trait with optimize_with_context() dispatch - ConfigOnlyContext for backward-compatible rule invocation - SessionState integration with statistics_registry field and builder

Adds a set of default StatisticsProvider implementations that cover the most common physical operators: - FilterStatisticsProvider: selectivity-based row count, reuses the same pre-enhanced child statistics, with post-filter NDV adjustment - ProjectionStatisticsProvider: column mapping through projections - PassthroughStatisticsProvider: passthrough for cardinality-preserving operators (Sort, Repartition, Window, etc.) via CardinalityEffect - AggregateStatisticsProvider: NDV-product estimation for GROUP BY, delegates for Partial mode and multiple grouping sets (apache#20926) - JoinStatisticsProvider: NDV-based join output estimation (hash, sort-merge, nested-loop, cross) with join-type-aware cardinality bounds and correct key-column NDV lookup - LimitStatisticsProvider: caps output at the fetch limit (local and global) - UnionStatisticsProvider: sums input row counts - DefaultStatisticsProvider: fallback to partition_statistics(None)

…back

Adds a pluggable statistics path for JoinSelection that uses the StatisticsRegistry instead of each operator's built-in partition_statistics. - Add optimizer.use_statistics_registry config flag (default=false) - Override optimize_with_context in JoinSelection to pass the registry to should_swap_join_order when the flag is enabled; if no registry is set on SessionState the built-in default is constructed lazily - Add statistics_registry.slt demonstrating how the registry produces more conservative join estimates for skewed data (10*10=100 cartesian fallback vs 10*10/3=33 range-NDV estimate), triggering the correct build-side swap that the built-in estimator misses

… information_schema.slt

…ions compatibility

kosiew

@asolimando
Visting this from 21122
I found one issue that looks blocking before this can land. I also left a couple of non-blocking suggestions that seem worth considering while this is fresh.

kosiew · 2026-04-09T05:33:40Z

datafusion/physical-plan/src/operator_statistics/mod.rs

+        } else if let Some(smj) = plan.downcast_ref::<SortMergeJoinExec>() {
+            let est = equi_join_estimate(smj.on(), left, right, left_rows, right_rows);
+            (est, false, smj.join_type())
+        } else if let Some(nl_join) = plan.downcast_ref::<NestedLoopJoinExec>() {


Thanks for adding the NestedLoopJoinExec handling here. I think this branch is a bit too aggressive though.

Right now it treats every NL join as a Cartesian product with left_rows * right_rows, but NestedLoopJoinExec::partition_statistics intentionally returns unknown row counts for non-equi or filter-only joins because an arbitrary JoinFilter can be very selective.

With datafusion.optimizer.use_statistics_registry = true, a filtered NLJ would suddenly look much larger than before, and that can push join reordering or plan selection in the wrong direction.

Could we delegate when nl_join.filter().is_some() instead, or otherwise add a real selectivity model plus regression coverage? As written, this seems like it will overestimate filtered NLJs in a way that changes optimizer behavior.

kosiew · 2026-04-09T05:33:40Z

datafusion/physical-optimizer/src/join_selection.rs

    }
 }

+/// Get statistics for a plan node, using the registry if available.


One thing that stood out to me here is that get_stats() appears to call reg.compute(plan) from scratch for each lookup, while JoinSelection asks for stats multiple times on overlapping subtrees during transform_up.

That makes the registry-enabled path look potentially quadratic on larger join trees. It might be worth adding a small per-pass cache or precomputed stats map here so the feature does not add avoidable optimizer overhead.

kosiew · 2026-04-09T05:33:40Z

datafusion/physical-plan/src/operator_statistics/mod.rs

+        // are lost here because partition_statistics(None) re-fetches raw child
+        // stats internally. Once #20184 lands, pass enhanced child_stats so the
+        // operator's built-in column mapping uses them instead.
+        let mut base = Arc::unwrap_or_clone(plan.partition_statistics(None)?);


Would it make sense to move this partition_statistics(None) plus rescale_byte_size(...) sequence into a helper?

The same pattern now shows up in multiple providers, and centralizing it would make the base-stats preservation logic easier to audit and less likely to drift as more providers get added.

asolimando added 3 commits April 8, 2026 19:17

Add ClosureStatisticsProvider for test injection and cardinality feed…

17b5653

…back

github-actions bot added documentation Improvements or additions to documentation optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate physical-plan Changes to the physical-plan crate labels Apr 8, 2026

asolimando force-pushed the asolimando/statistics-planner-prototype branch from 74753b3 to 0e25387 Compare April 8, 2026 19:51

asolimando mentioned this pull request Apr 8, 2026

Add ExpressionAnalyzer for pluggable expression-level statistics estimation #21122

Open

asolimando force-pushed the asolimando/statistics-planner-prototype branch from 0e25387 to 8132e06 Compare April 8, 2026 20:08

asolimando force-pushed the asolimando/statistics-planner-prototype branch from 8132e06 to 5cadfd3 Compare April 8, 2026 20:16

asolimando added 4 commits April 8, 2026 22:20

style: fix cargo fmt in operator_statistics/mod.rs

1bf4e8b

style: fix broken rustdoc link to RelationPlanner in operator_statistics

4ef5889

fix: move use_statistics_registry to correct alphabetical position in…

214eb67

… information_schema.slt

fix: relax aggregate delegation test assertions for force_hash_collis…

6a313b3

…ions compatibility

kosiew requested changes Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add pluggable StatisticsRegistry for operator-level statistics propagation#21483

feat: Add pluggable StatisticsRegistry for operator-level statistics propagation#21483
asolimando wants to merge 8 commits intoapache:mainfrom
asolimando:asolimando/statistics-planner-prototype

asolimando commented Apr 8, 2026

Uh oh!

kosiew left a comment •

edited

Loading

Uh oh!

kosiew Apr 9, 2026

Uh oh!

kosiew Apr 9, 2026

Uh oh!

kosiew Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

asolimando commented Apr 8, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew left a comment •

edited

Loading