Enable pushing aggregate past Join by default#6538
Conversation
|
|
Does it need anything else for this PR? It seems like to resolve the performance part just need to allow it. Is there any concern to keep as a draft pr? |
|
running with set hive.transpose.aggr.join=true (default is false). It enables Hive's Calcite rule HiveAggregateJoinTransposeRule, which pushes an aggregation below a join — i.e., aggregate first, then join, instead of join then aggregate. Concrete example (q4): the query joins store_sales/catalog_sales/web_sales to customer and then does SUM(...) GROUP BY c_customer_id, ….
TPCDS results:
The root cause of the regressions: Hive's default cost model is cardinality-only (cpu=io=0), so the rule fires on tiny rowcount differences and can't tell a beneficial transpose from a harmful one. That's why "best-of per-query" beats turning it globally ON or OFF. Turning on cc @kasakrisz |



What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/HIVE-10785?focusedCommentId=14906571&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14906571
Why are the changes needed?
Perf optimization
Does this PR introduce any user-facing change?
How was this patch tested?