Sorted Merge: Eliminate coordinator Sort node for multi-shard ORDER BY queries by neildsh · Pull Request #8529 · citusdata/citus

neildsh · 2026-03-23T17:07:33Z

Sorted Merge: Eliminate coordinator Sort node for multi-shard ORDER BY queries

Summary

When a multi-shard SELECT ... ORDER BY query goes through the logical planner, Citus currently collects all worker results into a single tuplestore and relies on a PostgreSQL Sort node above the Custom Scan to produce the final ordering. This PR adds an alternative path: push ORDER BY to workers, k-way merge the pre-sorted worker results on the coordinator using a binary heap, and declare pathkeys on the CustomPath so PostgreSQL eliminates the coordinator Sort node entirely.

The feature is gated behind a hidden GUC citus.enable_sorted_merge (default off, PGC_SUSET, GUC_NO_SHOW_ALL). All eligibility decisions are made at planning time and baked into the serialized DistributedPlan, so cached/prepared plans remain correct regardless of later GUC changes.

The sorted merge implementation is based on the Postgres MergeAppend node.

How it works

Workers (each sorted locally)
  └─> per-task tuple stores (routed by taskId, no Task mutation)
        └─> coordinator k-way merge (binaryheap + SortSupport)
              └─> final scanState->tuplestorestate
                    └─> existing ReturnTupleFromTuplestore()
                          └─> quals / projection / backward scan / rescan (unchanged)

Worker sort pushdown (multi_logical_optimizer.c): WorkerSortClauseList() gains an early-return path that pushes ORDER BY to workers even without LIMIT, when the GUC is on and the sort is safe (no aggregates in ORDER BY, no non-pushable window functions, GROUP BY on distribution column or absent).
Planner eligibility (multi_logical_optimizer.c, multi_physical_planner.c): WorkerExtendedOpNode() tags the worker MultiExtendedOp with sortedMergeEligible = true when the worker sort clause semantically matches the original. SetSortedMergeFields() in the physical planner builds SortedMergeKey metadata (attno, sortop, collation, nullsFirst) and sets useSortedMerge on the DistributedPlan.
PathKeys (combine_query_planner.c): CreateCitusCustomScanPath() sets path->pathkeys = root->sort_pathkeys when the plan has useSortedMerge = true, causing PostgreSQL's create_ordered_paths() to skip adding a Sort node.
Per-task stores + k-way merge (sorted_merge.c, adaptive_executor.c): A new PerTaskDispatchTupleDest routes worker tuples to per-task tuplestores by taskId hash lookup (no Task fields mutated). After all tasks complete, MergePerTaskStoresIntoFinalStore() performs a k-way merge using PostgreSQL's public binaryheap and SortSupport APIs, writing sorted output into the existing scanState->tuplestorestate. The existing CitusExecScan()/ReturnTupleFromTuplestore() path is completely unchanged.

Follow up changes

The fact that we continue to use the default tuple store for the final result set in addition to the per task stores means that the memory consumption increases when this change is enabled. Follow up work is to stop using the default tuple store to reduce the memory consumption.

Safety properties

Plan-time only: The GUC is consulted only during planning. The executor reads only distributedPlan->useSortedMerge. Cached plans are safe.
No Task mutation: Per-task dispatch state lives on DistributedExecution (execution-local), not on reusable Task nodes. Only task->totalReceivedTupleData is updated (execution-time reporting field, reset each execution).
Scan contract preserved: The merge writes into the existing final tuplestore. CitusExecScan, CitusEndScan, CitusReScan, ReturnTupleFromTuplestore are all unchanged. Quals, projection, backward scan, and cursor support work exactly as before.
Shared intermediate-result accounting: All per-task destinations share a single TupleDestinationStats object, preserving citus.max_intermediate_result_size enforcement semantics. EnsureIntermediateSizeLimitNotExceeded() is now exported from tuple_destination.c for use by the dispatch destination.
Aggregate ORDER BY exclusion: Queries with ORDER BY on aggregates are excluded from sorted merge eligibility via HasOrderByAggregate().

Files changed

File	Change
`sorted_merge.h` / `sorted_merge.c`	NEW — `CreatePerTaskDispatchDest`, `MergePerTaskStoresIntoFinalStore`, `MergeHeapComparator`
`multi_logical_optimizer.c`	Worker sort pushdown + eligibility check + `SortClauseListsMatch()`
`multi_physical_planner.c`	`SetSortedMergeFields()` + `BuildSortedMergeKeys()`
`combine_query_planner.c`	Set `pathkeys` on `CustomPath`
`adaptive_executor.c`	Per-task store routing + post-merge into final tuplestore
`multi_physical_planner.h`	`SortedMergeKey` struct + fields on `DistributedPlan` and `MultiExtendedOp`
`tuple_destination.h` / `.c`	Export `EnsureIntermediateSizeLimitNotExceeded()`
`shared_library_init.c` / `multi_executor.c` / `.h`	GUC registration
`citus_outfuncs.c` / `citus_copyfuncs.c`	Serialization of new plan fields
`multi_orderby_pushdown.sql` / `.out`	NEW — 60+ regression tests

Test coverage

The new multi_orderby_pushdown regression test covers:

Eligibility: 10 EXPLAIN tests verifying worker Sort is pushed for eligible queries (simple ORDER BY, DESC, NULLS ordering, multi-column, mixed directions, GROUP BY dist_col, WHERE+ORDER BY, expressions, LIMIT)
Ineligibility: 4 EXPLAIN tests verifying Sort is NOT pushed for ineligible queries (ORDER BY aggregate, GROUP BY non-dist col)
Correctness: 8 GUC off/on result-comparison pairs (ASC, DESC, multi-column, non-dist col, GROUP BY dist_col, mixed directions, WHERE, aggregates in SELECT)
Complex queries: Subquery, CTE, co-located JOIN, UNION ALL, DISTINCT, DISTINCT ON, EXISTS, IN subquery, multiple aggregates, CASE, NULL ordering, OFFSET, ordinal ORDER BY
Sort elision: EXPLAIN verification that coordinator Sort node is absent with GUC on
Plan cache: PREPARE/EXECUTE with GUC toggling (plan-time decision baked in)
Cursor: FETCH FORWARD + FETCH BACKWARD over sorted merge results
EXPLAIN ANALYZE: Falls back to non-merge path
Memory pressure: Small work_mem (64kB) with 32 shards
Intermediate result limits: max_intermediate_result_size with CTE subplan
Subplan interactions: 7 tests for CTE/subquery patterns with sorted merge (multiple CTEs, cross-joins, nested subplans, correctness comparison)
Subplan EXPLAIN: Query plans for all subplan patterns

Validated with citus.enable_sorted_merge globally enabled: 0 crashes across check-multi (192 tests) and check-multi-1 (210 tests). All failures are expected plan-shape diffs (Sort node elision in EXPLAIN output).

codecov · 2026-03-24T23:22:40Z

Codecov Report

❌ Patch coverage is 12.29947% with 164 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.02%. Comparing base (029f381) to head (b11c90c).

❌ Your patch check has failed because the patch coverage (12.29%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (79.02%) is below the target coverage (87.50%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (029f381) and HEAD (b11c90c). Click for more details.

HEAD has 88 uploads less than BASE

Flag BASE (029f381) HEAD (b11c90c)

18_regress_check-query-generator 1 0

16_citus_upgrade 1 0

17_citus_upgrade 1 0

18_regress_check-pytest 1 0

17_regress_check-pytest 1 0

16_regress_check-pytest 1 0

18_regress_check-tap 1 0

16_regress_check-follower-cluster 1 0

16_17_upgrade 1 0

16_regress_check-tap 1 0

18_regress_check-columnar-isolation 1 0

16_regress_check-columnar-isolation 1 0

17_regress_check-follower-cluster 1 0

18_regress_check-add-backup-node 1 0

16_18_upgrade 1 0

16_regress_check-add-backup-node 1 0

18_regress_check-follower-cluster 1 0

17_regress_check-tap 1 0

17_regress_check-add-backup-node 1 0

17_18_upgrade 1 0

16_regress_check-query-generator 1 0

16_regress_check-enterprise-isolation-logicalrep-3 1 0

17_regress_check-enterprise-isolation-logicalrep-3 1 0

18_regress_check-enterprise-isolation-logicalrep-2 1 0

18_regress_check-enterprise-isolation-logicalrep-3 1 0

16_regress_check-split 1 0

18_regress_check-columnar 1 0

16_regress_check-enterprise-isolation-logicalrep-2 1 0

16_regress_check-enterprise-failure 1 0

17_regress_check-query-generator 1 0

17_regress_check-columnar 1 0

17_regress_check-columnar-isolation 1 0

17_regress_check-vanilla 1 0

16_regress_check-vanilla 1 0

17_regress_check-enterprise 1 0

16_regress_check-enterprise-isolation 1 0

16_regress_check-enterprise 1 0

18_regress_check-enterprise 1 0

17_regress_check-enterprise-failure 1 0

16_arbitrary_configs_3 1 0

18_regress_check-multi-mx 1 0

16_regress_check-multi-mx 1 0

17_regress_check-enterprise-isolation-logicalrep-2 1 0

18_regress_check-enterprise-failure 1 0

17_regress_check-split 1 0

18_regress_check-vanilla 1 0

18_regress_check-split 1 0

17_regress_check-multi-mx 1 0

17_regress_check-failure 1 0

18_regress_check-enterprise-isolation 1 0

17_regress_check-enterprise-isolation-logicalrep-1 1 0

16_regress_check-enterprise-isolation-logicalrep-1 1 0

18_cdc_installcheck 1 0

18_regress_check-multi-1-create-citus 1 0

17_regress_check-multi-1-create-citus 1 0

17_regress_check-enterprise-isolation 1 0

18_regress_check-enterprise-isolation-logicalrep-1 1 0

16_regress_check-failure 1 0

17_cdc_installcheck 1 0

18_regress_check-operations 1 0

17_regress_check-operations 1 0

16_regress_check-isolation 1 0

17_regress_check-isolation 1 0

17_arbitrary_configs_4 1 0

18_regress_check-isolation 1 0

16_arbitrary_configs_2 1 0

18_arbitrary_configs_3 1 0

16_regress_check-columnar 1 0

17_arbitrary_configs_3 1 0

16_regress_check-multi 1 0

16_cdc_installcheck 1 0

16_arbitrary_configs_0 1 0

18_regress_check-multi 1 0

18_arbitrary_configs_5 1 0

17_regress_check-multi 1 0

17_arbitrary_configs_5 1 0

16_arbitrary_configs_5 1 0

18_regress_check-multi-1 1 0

17_arbitrary_configs_2 1 0

18_arbitrary_configs_2 1 0

16_regress_check-operations 1 0

17_regress_check-multi-1 1 0

18_regress_check-failure 1 0

16_regress_check-multi-1-create-citus 1 0

18_arbitrary_configs_0 1 0

17_arbitrary_configs_0 1 0

18_arbitrary_configs_4 1 0

17_arbitrary_configs_1 1 0

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8529      +/-   ##
==========================================
- Coverage   88.91%   79.02%   -9.89%     
==========================================
  Files         286      287       +1     
  Lines       63198    61754    -1444     
  Branches     7933     7596     -337     
==========================================
- Hits        56190    48801    -7389     
- Misses       4734    10092    +5358     
- Partials     2274     2861     +587

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Phase 1 of the sorted-merge feature. This commit adds the data structures and GUC needed by later phases, with zero behavioral changes: - SortedMergeKey typedef in multi_physical_planner.h describing one sort key for the coordinator k-way merge - useSortedMerge, sortedMergeKeys[], sortedMergeKeyCount fields on DistributedPlan (plan-time decision, never checked at runtime via GUC) - sortedMergeEligible field on MultiExtendedOp (logical optimizer tag read by the physical planner) - Hidden GUC citus.enable_sorted_merge (PGC_SUSET, default off, GUC_NO_SHOW_ALL) consulted only during planning - Serialization in citus_outfuncs.c and deep-copy in citus_copyfuncs.c for all new fields All new fields default to false/0/NULL. Existing regression tests are unaffected. Co-authored-by: Copilot

Phase 2 of the sorted-merge feature. Workers now sort their results when citus.enable_sorted_merge is enabled at planning time, even for queries without LIMIT. The plan metadata is populated so later phases can execute the merge and set pathkeys. Logical optimizer changes (multi_logical_optimizer.c): - WorkerSortClauseList() gains an early-return path that pushes the sort clause to workers when the GUC is on and the sort is safe (no aggregates in ORDER BY, no non-pushable window functions, and either no GROUP BY or GROUP BY on partition column). - WorkerExtendedOpNode() sets sortedMergeEligible = true when the worker sort clause semantically matches the original sort clause, using the new SortClauseListsMatch() helper. - SortClauseListsMatch() compares tleSortGroupRef, sortop, nulls_first, and eqop for each pair. Physical planner changes (multi_physical_planner.c): - CreatePhysicalDistributedPlan() finds the worker MultiExtendedOp with sortedMergeEligible = true, builds SortedMergeKey metadata from the worker job query, and sets useSortedMerge on the plan. - BuildSortedMergeKeys() constructs the key array from the worker query's SortGroupClause list and target list. The coordinator Sort node is still present above the CustomScan (pathkeys not set yet — that is Phase 4). Results are correct because the redundant Sort re-sorts already-sorted data. Co-authored-by: Copilot

Phase 3 of the sorted-merge feature. When distributedPlan->useSortedMerge is true (set at planning time by Phase 2), the adaptive executor now: 1. Routes worker results into per-task tuple stores via a new PerTaskDispatchTupleDest that dispatches by task->taskId hash lookup. No Task fields are mutated — all state lives on DistributedExecution. 2. After all tasks complete, performs a k-way merge of the per-task stores into the final scanState->tuplestorestate using PostgreSQL's public binaryheap and SortSupport APIs. 3. Frees per-task stores after the merge. The existing CitusExecScan/ReturnTupleFromTuplestore/CitusEndScan/ CitusReScan code paths are completely unchanged — they read from the final tuplestore exactly as before. New files: - sorted_merge.h: CreatePerTaskDispatchDest, MergePerTaskStoresIntoFinalStore - sorted_merge.c: PerTaskDispatchTupleDest with taskId->index hash routing, MergePerTaskStoresIntoFinalStore with binaryheap merge, MergeHeapComparator modeled after PG's heap_compare_slots in nodeMergeAppend.c Modified: - adaptive_executor.c: DistributedExecution gains useSortedMerge/perTaskStores/ perTaskStoreCount fields. AdaptiveExecutor() branches on useSortedMerge to create per-task stores, then merges post-execution. EXPLAIN ANALYZE falls back to existing single-tuplestore path. Safety: - Shared TupleDestinationStats preserves citus.max_intermediate_result_size - Per-task stores allocated in AdaptiveExecutor local memory context (auto-cleanup on error via PG memory context teardown) - task->totalReceivedTupleData tracking preserved The coordinator Sort node is still present above the CustomScan (pathkeys not set until Phase 4). Results are correct because the redundant Sort re-sorts already-sorted data. Co-authored-by: Copilot

Phase 1 of the sorted-merge feature. This commit adds the data structures and GUC needed by later phases, with zero behavioral changes: - SortedMergeKey typedef in multi_physical_planner.h describing one sort key for the coordinator k-way merge - useSortedMerge, sortedMergeKeys[], sortedMergeKeyCount fields on DistributedPlan (plan-time decision, never checked at runtime via GUC) - sortedMergeEligible field on MultiExtendedOp (logical optimizer tag read by the physical planner) - Hidden GUC citus.enable_sorted_merge (PGC_SUSET, default off, GUC_NO_SHOW_ALL) consulted only during planning - Serialization in citus_outfuncs.c and deep-copy in citus_copyfuncs.c for all new fields All new fields default to false/0/NULL. Existing regression tests are unaffected. Co-authored-by: Copilot

…cases to impose a deterministic order

…cally sort GUCs

colm-mchugh · 2026-04-03T18:26:04Z

src/test/regress/multi_schedule

 test: custom_aggregate_support aggregate_support tdigest_aggregate_support
 test: multi_average_expression multi_working_columns multi_having_pushdown having_subquery
 test: multi_array_agg multi_limit_clause multi_orderby_limit_pushdown
+test: multi_orderby_pushdown


Can you try the test in multi_mx_schedule ? That exercises query from any node (any node can act as coordinator for select and DML queries). It doesn't need to be committed, just verify that the test passes there.

colm-mchugh · 2026-04-03T18:26:10Z

src/test/regress/sql/multi_orderby_pushdown.sql

+-- Cleanup
+-- =================================================================
+
+SET citus.enable_sorted_merge TO off;


Im slightly concerned that the GUC is off by default. The commit message states that when enabled the only failing regress tests are because of expected plan change. Why not enable by default?

colm-mchugh · 2026-04-03T18:26:15Z

src/test/regress/sql/multi_orderby_pushdown.sql

+    SELECT id, val, num FROM sorted_merge_test ORDER BY id LIMIT 20
+)
+SELECT * FROM cte WHERE num > 10 ORDER BY id LIMIT 5');
+


Can there be a category of tests for distributed transactions? To verify correctness in scenarios like:

BEGIN; INSERT INTO t .. UPDATE t .. SELECT ... FROM t ORDER BY c1, c2, c3; END;

colm-mchugh · 2026-04-03T18:29:30Z

src/backend/distributed/planner/multi_logical_optimizer.c

+	 */
+	if (EnableSortedMerge && sortClauseList != NIL &&
+		orderByLimitReference.onlyPushableWindowFunctions &&
+		!orderByLimitReference.hasOrderByAggregate)


May need to also check that the clauses do not contain expressions that need to be evaluated on the coordinator?

colm-mchugh · 2026-04-03T18:41:20Z

src/test/regress/expected/multi_orderby_pushdown.out

+                     Group Key: sorted_merge_test.id
+                     ->  Seq Scan on public.sorted_merge_test_960000 sorted_merge_test (actual rows=N loops=N)
+                           Output: id, val, num, ts
+(19 rows)


It would be useful to include examples that avoid a sort on the workers, by taking advantage of an index for example.

And also curious if you have tested the performance of this and what the observed gains are ?

colm-mchugh · 2026-04-05T07:35:10Z

src/test/regress/expected/multi_orderby_pushdown.out

+   Output: remote_scan.id, remote_scan.val
+   Task Count: N
+   Tuple data received from nodes: N bytes
+   Tasks Shown: One of N


Some indication of the N-way merge would be helpful here.

colm-mchugh · 2026-04-05T07:37:13Z

src/test/regress/sql/multi_orderby_pushdown.sql

+PREPARE merge_off_stmt AS SELECT id, val FROM sorted_merge_test ORDER BY id LIMIT 10;
+EXECUTE merge_off_stmt;
+SET citus.enable_sorted_merge TO on;
+EXECUTE merge_off_stmt;


Include an EXPLAIN to verify line 322 ? Also there may need to be 6+ EXECUTE statements before the plan is cached.

neildsh added 10 commits March 27, 2026 06:42

Add more comprehensive tests for sorted merge

c3e7c3d

Minor code clean up

fc981d4

Added subquery tests

dde00ce

Put multi_orderby_pushdown in its own test group

6cfc4c6

Replace the use of memcpy with memcpy_s and add ORDER BY to CTE test …

34b8c78

…cases to impose a deterministic order

Emit more verbose output for EXPLAIN in sorted merge tests. Alphabeti…

b11c90c

…cally sort GUCs

neildsh force-pushed the sortedMerge branch from df74625 to b11c90c Compare March 27, 2026 06:47

Use explain_filter to normalize regress test output

e040397

colm-mchugh reviewed Apr 3, 2026

View reviewed changes

colm-mchugh reviewed Apr 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sorted Merge: Eliminate coordinator Sort node for multi-shard ORDER BY queries#8529

Sorted Merge: Eliminate coordinator Sort node for multi-shard ORDER BY queries#8529
neildsh wants to merge 11 commits intocitusdata:mainfrom
neildsh:sortedMerge

neildsh commented Mar 23, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

colm-mchugh Apr 3, 2026 •

edited

Loading

Uh oh!

colm-mchugh Apr 3, 2026

Uh oh!

colm-mchugh Apr 3, 2026

Uh oh!

colm-mchugh Apr 3, 2026

Uh oh!

colm-mchugh Apr 3, 2026 •

edited

Loading

Uh oh!

colm-mchugh Apr 5, 2026

Uh oh!

colm-mchugh Apr 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neildsh commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!