Skip to content

[X-2935] Cherry-pick missing dedup with_new_children block from upstream #21807#61

Merged
zhuqi-lucas merged 1 commit into
branch-53from
qizhu/x-2935-cherry-pick-21807-fix
Jun 15, 2026
Merged

[X-2935] Cherry-pick missing dedup with_new_children block from upstream #21807#61
zhuqi-lucas merged 1 commit into
branch-53from
qizhu/x-2935-cherry-pick-21807-fix

Conversation

@zhuqi-lucas

Copy link
Copy Markdown
Collaborator

Summary

When this fork originally ported apache#21807 (commit 47f263d), the cache-hit with_new_children block in `DeduplicatingDeserializer::proto_to_physical_expr` was accidentally omitted. The simpler `return Ok(Arc::clone(cached))` we landed is correct only when every occurrence of a given Inner has the same outer wrapper shape -- which doesn't hold in practice.

Why this is broken in production

FilterPushdown clones a SortExec's `DynamicFilterPhysicalExpr` and rewrites its children's column refs to match the downstream FileScan's file schema. Both outer wrappers share the same `Arc` (same `expression_id`, same wire `expr_id`) but DIFFERENT children. On decode the simpler cache hit returns the first decode's outer wrapper as-is, silently discarding the second occurrence's children. `prune_by_statistics` then resolves column refs against the wrong positions and pruning becomes a no-op.

Observed end-to-end on ny2 staging 2026-06-15 after walker removal:

  • `row_groups_pruned_statistics=0 total`
  • `bytes_scanned=91 MB`
  • `time_elapsed_processing=31 s`
  • (was 1.99K pruned / 136 KB / ~100ms with the walker still active)

What this PR does

Cherry-picks the missing block from upstream `apache/datafusion@948cd09` (#21807\):

let parsed = parse_physical_expr_with_converter(...)?;
let mut cache = self.cache.borrow_mut();
if let Some(cached) = cache.get(&id) {
    let children: Vec<_> = parsed.children().into_iter().cloned().collect();
    return Arc::clone(cached).with_new_children(children);
}
cache.insert(id, Arc::clone(&parsed));
Ok(parsed)

This keeps the cached Inner (so TopK heap-max updates propagate end-to-end) but installs the proto body's children on a fresh outer wrapper (so each occurrence keeps its own column refs).

Tests

  • New regression test `dynamic_filter_dedup_distinct_children_via_with_new_children`: builds two outer Arcs sharing the same Inner but with different children (`Column("a", 0)` vs `Column("a", 1)`), runs them through the full DeduplicatingSerializer + DeduplicatingDeserializer pair, and asserts that each decoded Arc keeps its OCCURRENCE's children. Without this fix the test fails with left: 0, right: 1, exactly matching the prod symptom.
  • Existing tests `dynamic_filter_dedup_with_deduplicating_codec` and `dynamic_filter_dedup_distinct_outer_arcs_same_inner` relaxed their over-strict `Arc::ptr_eq` assertions to the actual invariant (shared Inner via `expression_id` equality), since the cache hit path now rewraps via `with_new_children` and returns a distinct outer Arc.

Affects

  • atlas internal build: required for X-2935 walker removal to actually work end-to-end. The atlas-side PR in rust-app-atlas can drop the workaround revert and let the pruner consume the shared-Inner-but-distinct-children invariant.
  • Branch-53 fork API: no API change. Behavior change is purely in `DeduplicatingDeserializer::proto_to_physical_expr` cache hit, and exactly matches upstream main.
  • Wire format: unchanged.
  • Staging / prod blast radius: none until the atlas PR bumps its DF pin to this SHA.

Copilot AI review requested due to automatic review settings June 15, 2026 06:31
@github-actions github-actions Bot added the proto label Jun 15, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a correctness issue in DeduplicatingDeserializer::proto_to_physical_expr where cache hits previously returned the cached outer expression Arc unchanged, potentially discarding occurrence-specific children (e.g., after FilterPushdown rewrites column refs). The change aligns behavior with upstream by rewrapping cached expressions using with_new_children(...) populated from the current proto body.

Changes:

  • Update DeduplicatingDeserializer::proto_to_physical_expr cache-hit logic to rewrap cached expressions via with_new_children(parsed.children()) while preserving shared inner state.
  • Relax existing tests that assumed outer-Arc pointer equality and add a regression test for “same Inner, distinct children” roundtrips.
  • Expand assertions around dynamic filter deduplication invariants in roundtrip tests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
datafusion/proto/src/physical_plan/mod.rs Rewrap cached physical expressions with occurrence-specific children on cache hits to avoid child clobbering.
datafusion/proto/tests/cases/roundtrip_physical_plan.rs Adjust existing dynamic-filter dedup tests and add a regression test for distinct-children occurrences.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread datafusion/proto/tests/cases/roundtrip_physical_plan.rs
Comment thread datafusion/proto/tests/cases/roundtrip_physical_plan.rs Outdated
Comment thread datafusion/proto/tests/cases/roundtrip_physical_plan.rs Outdated
Comment thread datafusion/proto/tests/cases/roundtrip_physical_plan.rs Outdated
@zhuqi-lucas zhuqi-lucas force-pushed the qizhu/x-2935-cherry-pick-21807-fix branch from 722eae2 to 5ff5d36 Compare June 15, 2026 06:37
…e#21807

When this fork ported apache#21807 in commit 47f263d, the
'with_new_children on cache hit' block in DeduplicatingDeserializer
was accidentally omitted. The simpler 'return Ok(Arc::clone(cached))'
version we landed is correct only when every occurrence of a given
Inner has the same outer wrapper shape.

In production this fails: FilterPushdown clones a SortExec's
DynamicFilterPhysicalExpr and rewrites its children's column refs to
match the downstream FileScan's file schema. Both outer wrappers
share the same Inner Arc (same expression_id, same wire expr_id) but
DIFFERENT children. On decode the simpler cache hit returns the
first decode's outer wrapper, silently discarding the second
occurrence's children. prune_by_statistics then resolves column refs
against the wrong positions and pruning becomes a no-op.

Observed end-to-end on ny2 staging 2026-06-15 after walker removal:
  row_groups_pruned_statistics=0 total
  bytes_scanned=91 MB
  time_elapsed_processing=31 s
  (was 1.99K pruned / 136 KB / ~100ms with the walker still active)

Cherry-picks the missing block: parse the proto body first, then on
cache hit return Arc::clone(cached).with_new_children(parsed.children()).
This keeps the cached Inner (so TopK heap-max updates propagate) but
installs the proto body's children on a fresh outer wrapper (so each
occurrence keeps its own column refs).

Adds a regression test that fails without the fix and passes with it,
with assertion messages pointing at the exact root cause.
@zhuqi-lucas zhuqi-lucas force-pushed the qizhu/x-2935-cherry-pick-21807-fix branch from 5ff5d36 to 766a24c Compare June 15, 2026 06:45
@zhuqi-lucas zhuqi-lucas merged commit eccc612 into branch-53 Jun 15, 2026
54 checks passed
@zhuqi-lucas zhuqi-lucas deleted the qizhu/x-2935-cherry-pick-21807-fix branch June 15, 2026 07:21
zhuqi-lucas added a commit to massive-com/datafusion-materialized-views that referenced this pull request Jun 15, 2026
Picks up the dedup with_new_children fix
(massive-com/arrow-datafusion#61) on branch-53.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants