feat: Add pluggable StatisticsRegistry for operator-level statistics propagation#21483
feat: Add pluggable StatisticsRegistry for operator-level statistics propagation#21483asolimando wants to merge 8 commits intoapache:mainfrom
Conversation
Introduces a chain-of-responsibility architecture for statistics computation on physical plan nodes: - StatisticsProvider trait: chain element computing stats for operators - StatisticsRegistry: chains providers, with fast-path for empty config - ExtendedStatistics: Statistics with type-erased extension map - DefaultStatisticsProvider: delegates to each operator's partition_statistics() - PhysicalOptimizerContext trait with optimize_with_context() dispatch - ConfigOnlyContext for backward-compatible rule invocation - SessionState integration with statistics_registry field and builder
Adds a set of default StatisticsProvider implementations that cover the most common physical operators: - FilterStatisticsProvider: selectivity-based row count, reuses the same pre-enhanced child statistics, with post-filter NDV adjustment - ProjectionStatisticsProvider: column mapping through projections - PassthroughStatisticsProvider: passthrough for cardinality-preserving operators (Sort, Repartition, Window, etc.) via CardinalityEffect - AggregateStatisticsProvider: NDV-product estimation for GROUP BY, delegates for Partial mode and multiple grouping sets (apache#20926) - JoinStatisticsProvider: NDV-based join output estimation (hash, sort-merge, nested-loop, cross) with join-type-aware cardinality bounds and correct key-column NDV lookup - LimitStatisticsProvider: caps output at the fetch limit (local and global) - UnionStatisticsProvider: sums input row counts - DefaultStatisticsProvider: fallback to partition_statistics(None)
74753b3 to
0e25387
Compare
0e25387 to
8132e06
Compare
Adds a pluggable statistics path for JoinSelection that uses the StatisticsRegistry instead of each operator's built-in partition_statistics. - Add optimizer.use_statistics_registry config flag (default=false) - Override optimize_with_context in JoinSelection to pass the registry to should_swap_join_order when the flag is enabled; if no registry is set on SessionState the built-in default is constructed lazily - Add statistics_registry.slt demonstrating how the registry produces more conservative join estimates for skewed data (10*10=100 cartesian fallback vs 10*10/3=33 range-NDV estimate), triggering the correct build-side swap that the built-in estimator misses
8132e06 to
5cadfd3
Compare
… information_schema.slt
…ions compatibility
There was a problem hiding this comment.
@asolimando
Visting this from 21122
I found one issue that looks blocking before this can land. I also left a couple of non-blocking suggestions that seem worth considering while this is fresh.
| } else if let Some(smj) = plan.downcast_ref::<SortMergeJoinExec>() { | ||
| let est = equi_join_estimate(smj.on(), left, right, left_rows, right_rows); | ||
| (est, false, smj.join_type()) | ||
| } else if let Some(nl_join) = plan.downcast_ref::<NestedLoopJoinExec>() { |
There was a problem hiding this comment.
Thanks for adding the NestedLoopJoinExec handling here. I think this branch is a bit too aggressive though.
Right now it treats every NL join as a Cartesian product with left_rows * right_rows, but NestedLoopJoinExec::partition_statistics intentionally returns unknown row counts for non-equi or filter-only joins because an arbitrary JoinFilter can be very selective.
With datafusion.optimizer.use_statistics_registry = true, a filtered NLJ would suddenly look much larger than before, and that can push join reordering or plan selection in the wrong direction.
Could we delegate when nl_join.filter().is_some() instead, or otherwise add a real selectivity model plus regression coverage? As written, this seems like it will overestimate filtered NLJs in a way that changes optimizer behavior.
| } | ||
| } | ||
|
|
||
| /// Get statistics for a plan node, using the registry if available. |
There was a problem hiding this comment.
One thing that stood out to me here is that get_stats() appears to call reg.compute(plan) from scratch for each lookup, while JoinSelection asks for stats multiple times on overlapping subtrees during transform_up.
That makes the registry-enabled path look potentially quadratic on larger join trees. It might be worth adding a small per-pass cache or precomputed stats map here so the feature does not add avoidable optimizer overhead.
| // are lost here because partition_statistics(None) re-fetches raw child | ||
| // stats internally. Once #20184 lands, pass enhanced child_stats so the | ||
| // operator's built-in column mapping uses them instead. | ||
| let mut base = Arc::unwrap_or_clone(plan.partition_statistics(None)?); |
There was a problem hiding this comment.
Would it make sense to move this partition_statistics(None) plus rescale_byte_size(...) sequence into a helper?
The same pattern now shows up in multiple providers, and centralizing it would make the base-stats preservation logic easier to audit and less likely to drift as more providers get added.
Which issue does this PR close?
Rationale for this change
DataFusion's built-in statistics propagation has no extension point: downstream projects cannot inject external catalog stats, override built-in estimation, or plug in custom strategies without forking.
This PR introduces
StatisticsRegistry, a pluggable chain-of-responsibility for operator-level statistics following the same pattern asRelationPlannerfor SQL parsing andExpressionAnalyzer(#21120) for expression-level stats. See #21443 for full motivation and design context.What changes are included in this PR?
Framework (
operator_statistics/mod.rs):StatisticsProvidertrait,StatisticsRegistry(chain-of-responsibility),ExtendedStatistics(Statistics + type-erased extension map),DefaultStatisticsProvider.PhysicalOptimizerContexttrait withoptimize_with_contextdispatch.SessionStateintegration.Built-in providers for Filter, Projection, Passthrough (sort/repartition/etc), Aggregate, Join (hash/sort-merge/nested-loop/cross), Limit, and Union. NDV utilities:
num_distinct_vals,ndv_after_selectivity.ClosureStatisticsProvider: closure-based provider for test injection and cardinality feedback.JoinSelection integration:
use_statistics_registryconfig flag (default false), registry-awareoptimize_with_context, SLT test demonstrating plan difference on skewed data.Are these changes tested?
statistics_registry.slt): three-table join on skewed data (8:1:1 customer_id distribution) where the built-in NDV formula estimates 33 rows (wrong; actual=66) and the registry conservatively estimates 100, producing the correct build-side swapAre there any user-facing changes?
New public API (purely additive, non-breaking):
StatisticsProvidertrait andStatisticsRegistryindatafusion-physical-planExtendedStatistics,StatisticsResulttypes; built-in provider structs;num_distinct_vals,ndv_after_selectivityutilitiesPhysicalOptimizerContexttrait andConfigOnlyContextindatafusion-physical-optimizerSessionState::statistics_registry(),SessionStateBuilder::with_statistics_registry()datafusion.optimizer.use_statistics_registry(default false)Default behavior is unchanged. The registry is only consulted when the flag is explicitly enabled.
Known limitations:
partition_statistics(None)internally, re-fetching raw child stats and discarding registry enrichment. 4 TODO comments mark the affected call sites; Let partition_statistics accept pre-computed children statistics #20184 would close this gap.ExpressionAnalyzerintegration yet (Add ExpressionAnalyzer for pluggable expression-level statistics estimation #21122).Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.