Skip to content

[VL] Subdivide SplitRV CPU/wall timer into 4 sub-stages#12108

Open
luis4a0 wants to merge 5 commits into
apache:mainfrom
luis4a0:lpenaranda/pr-splitrv-substages
Open

[VL] Subdivide SplitRV CPU/wall timer into 4 sub-stages#12108
luis4a0 wants to merge 5 commits into
apache:mainfrom
luis4a0:lpenaranda/pr-splitrv-substages

Conversation

@luis4a0
Copy link
Copy Markdown
Contributor

@luis4a0 luis4a0 commented May 18, 2026

What changes are proposed in this pull request?

Stacked on top of #12107. Until that PR merges this PR's diff will appear larger (it carries both #12107's commits and this PR's commits). Once #12107 lands in main, the diff here will narrow to just the SplitRV sub-stage changes (2 commits, ~232 lines). Reviewers may wait for #12107 to merge if they prefer to look at this PR's contribution in isolation.

The existing CpuWallTimingSplitRV bucket wraps the entire splitRowVector() call, so when it dominates wall time in a VELOX_SHUFFLE_WRITER_LOG_FLAG=1 profile, there's no way to tell which of the four sub-paths is responsible: fixed-width scatter, validity copy, binary split, or complex (Presto-serialized) split.

This PR subdivides SplitRV into four new buckets, one per helper inside splitRowVector():

New enum Helper
CpuWallTimingSplitFixedWidth splitFixedWidthValueBuffer()
CpuWallTimingSplitValidity splitValidityBuffer()
CpuWallTimingSplitBinary splitBinaryArray()
CpuWallTimingSplitComplex splitComplexType()

The outer SplitRV bucket is unchanged: it continues to count once per call, and its wall/cpu measurement now happens to overlap the four inner buckets (the outer timer is still ticking while each inner one is). The outer minus the sum-of-inners is therefore the time spent in splitRowVector() outside the four sub-paths (mostly the post-split partitionBufferBase_ update loop). This is the same parent/child pattern most profiling tools use for nested timers, and it's documented in a comment next to the enum.

To make the new sub-stage data accessible to anything other than the existing compile-time log path (tests, future ShuffleWriterMetrics exposure, profilers), three things are hoisted from protected to public in VeloxShuffleWriter:

  • The CpuWallTimingType enum
  • The CpuWallTimingName() helper (already de-facto public, since its output is printed by the VELOX_SHUFFLE_WRITER_LOG_FLAG log)
  • A new single-value accessor cpuWallTiming(CpuWallTimingType)

The underlying cpuWallTimingList_ storage stays protected.

Why now

This is the natural follow-up to #12083 (which made the per-stage timers actually record real numbers) and to #12107 (which adds a per-batch input-encoding counter). Together, the three changes let a reviewer answer "what shape did the writer see, where did the time go, and what inner path drove it" from a single VELOX_SHUFFLE_WRITER_LOG_FLAG run, instead of being stuck at "SplitRV was 65 % of wall time, somewhere in there".

A per-query TPC-DS/TPC-H SF10 profile sweep on a closely-related internal branch shows SplitRV is the top-1 or top-2 bucket on 10/12 queries surveyed (46-71% of wall), with substantially different sub-stage breakdowns: bigint-heavy queries (q17, q14a) are overwhelmingly splitFixedWidthValueBuffer, while VARCHAR-heavy q67 shifts toward splitBinaryArray. Without the sub-stage timers, those breakdowns are invisible.

How was this patch tested?

This PR adds cpp/velox/tests/VeloxHashShuffleWriterSplitRvSubStagesTest.cc (second commit) with five gtest cases:

  • enumNames: validates the strings emitted under VELOX_SHUFFLE_WRITER_LOG_FLAG for the 4 new enum entries (these strings are stable observable surface).
  • fixedWidthBumpsItsBucket: single fixed-width batch; asserts each of the 4 inner buckets has count == 1 (every helper inside splitRowVector() is called once per batch regardless of column-type mix, since the new sub-stage SCOPED_TIMER calls live at function entry).
  • binaryBumpsItsBucket: VARCHAR batch; asserts the binary sub-stage's wallNanos > 0, confirming real work is being measured (not just function-entry overhead).
  • complexBumpsItsBucket: ARRAY batch; same wallNanos > 0 assertion against the complex sub-stage (Presto-serialized round-trip per partition).
  • outerSplitRvStillCounted: sanity: two batches in, outer SplitRV bucket has count == 2. Confirms the new inner timers refine, not replace, the existing outer timer.

luis4a0 and others added 2 commits May 18, 2026 13:39
Companion to apache#12083 (`SCOPED_TIMER` fix). With the per-stage timers
in `cpuWallTimingList_` now recording real numbers, the natural
complement is a counter that tells the reader what shape the writer
actually sees — i.e. for each `write(cb, ...)` call, how many of the
input children arrive as FLAT / DICTIONARY / CONSTANT / LAZY / other
encoding (the values of `facebook::velox::VectorEncoding::Simple`).

This information is useful for at least three things:

- Verifying whether a workload would benefit from writer-side
  dictionary / constant pass-through optimizations *before* writing the
  pass-through code (does the writer ever actually see those encodings?).
- Spotting unexpected upstream operator changes that collapse / preserve
  encodings differently across Velox versions (e.g., a change in
  HashAggregate that starts emitting CONSTANT children when it didn't
  before).
- Correlating a wall-time spike in a particular stage with the encoding
  mix that drove it.

The counter is always-on (one branch + one increment per input child,
swamped by the surrounding shuffle work). Output is gated through the
existing `VELOX_SHUFFLE_WRITER_LOG_FLAG` log channel — same line shape
as the existing `cpuWallTimingList_` lines — so behavior is unchanged
when the flag is off (default).

Scope is deliberately narrow: only the 5-bucket encoding mix at the
hash-shuffle writer entry, surfaced via the existing log channel. A
per-type breakdown (BIGINT / DECIMAL / VARCHAR / ...) and exposure
through `ShuffleWriterMetrics` to the JVM SparkListener are both
plausible follow-ups but kept out of this PR.

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three test cases covering the typical encoding mixes a hash shuffle
writer may see at `write()` entry:

- `allFlat` — every child arrives FLAT (the common case at HEAD
  on TPC-H / TPC-DS, where upstream operators flatten
  dict/const inputs before shuffle). Asserts `FLAT` bucket
  accumulates across two batches and the other 4 buckets stay 0.
- `mixedFlatDictConst` — one batch with FLAT + DICTIONARY +
  CONSTANT children, asserts the three respective buckets each
  increment.
- `lazy` — wraps a child in a `LazyVector` and asserts the
  `LAZY` bucket increments (the encoding is captured BEFORE
  any `loadedVector()` call inside the writer).

The test is a standalone `add_velox_test(...)` entry rather than a
new `TEST_P` case in `VeloxShuffleWriterTest.cc`, so it isolates
the counter behavior from the parameterised round-trip matrix —
which keeps the failure mode obvious if a future change quietly
breaks one of the buckets.

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@luis4a0 luis4a0 marked this pull request as ready for review May 18, 2026 14:08
@github-actions github-actions Bot added the VELOX label May 18, 2026
luis4a0 and others added 3 commits May 18, 2026 16:30
…ount

Two review comments from @yaooqinn on apache#12107:

(1) apache#12107 (comment) —
`VectorEncoding::Simple` has 11 values; the prior switch handled 4 and
funnelled the remaining 7 (BIASED, SEQUENCE, ROW, MAP, FLAT_MAP, ARRAY,
FUNCTION) into "Other". ROW / MAP / ARRAY are first-class struct /
map / array column types exercised by the sibling `VeloxShuffleWriterTest`
via `makeArrayVector` / `makeMapVector`, so a reader interpreting
"Other=80%" in the log would assume rare/unknown encodings rather than
"I have a struct column". Split into a new `kInputEncodingComplex`
bucket for ROW / MAP / FLAT_MAP / ARRAY; `kInputEncodingOther` now
only catches BIASED / SEQUENCE / FUNCTION (and future additions to
`VectorEncoding::Simple`).

(2) apache#12107 (comment) —
`accumulateInputEncodingCounts` early-returns on non-velox batches, so
the InputEncoding log line's denominator (sum of buckets) is over a
different population than the `count` field on the
`cpuWallTimingList_` lines emitted above. Added an
`inputEncodingSkippedBatches_` counter that is incremented on every
early-return path and printed at the end of the InputEncoding line as
`SkippedNonVeloxBatches=N`, so the two log blocks have comparable
denominators.

Test file updated: new `complex` test case exercises ARRAY + MAP
children landing in `kInputEncodingComplex` (not `Other`); all
existing cases also assert the new `Complex` bucket stays 0 and
`inputEncodingSkippedBatches() == 0`.

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The existing `CpuWallTimingSplitRV` bucket wraps the entire
`splitRowVector()` call, so when it dominates wall time in a
`VELOX_SHUFFLE_WRITER_LOG_FLAG=1` profile, there's no way to tell
which of the four sub-paths is responsible: fixed-width scatter,
validity copy, binary split, or complex (Presto-serialized) split.

This change subdivides `SplitRV` into four new buckets, one per
helper inside `splitRowVector()`:

  - `CpuWallTimingSplitFixedWidth` — `splitFixedWidthValueBuffer`
  - `CpuWallTimingSplitValidity`   — `splitValidityBuffer`
  - `CpuWallTimingSplitBinary`     — `splitBinaryArray`
  - `CpuWallTimingSplitComplex`    — `splitComplexType`

The outer `SplitRV` bucket is unchanged: it continues to count
once per call, and its wall/cpu measurement now happens to overlap
the four inner buckets (the outer timer is still ticking while
each inner is). The outer minus the sum-of-inners is therefore the
time spent in `splitRowVector()` outside the four sub-paths
(mostly the post-split `partitionBufferBase_` update loop).

To make the new sub-stage data accessible to anything other than
the existing compile-time log path (tests, future
`ShuffleWriterMetrics` exposure, profilers), the
`CpuWallTimingType` enum, the `CpuWallTimingName()` helper, and a
new single-value accessor `cpuWallTiming(CpuWallTimingType)` are
hoisted from `protected` to `public`. The underlying
`cpuWallTimingList_` storage stays `protected`.

This is the natural follow-up to apache#12083 (which made the per-stage
timers actually record real numbers) and to the input-encoding
counter in the preceding commit on this branch (which is itself
PR apache#12107). Together, the three changes let a reviewer answer
"what shape did the writer see, where did the time go, and what
inner path drove it" from a single `VELOX_SHUFFLE_WRITER_LOG_FLAG`
run.

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Five gtest cases exercise the new sub-stage timers introduced in
the previous commit:

  - `enumNames` — validates the strings emitted under
    `VELOX_SHUFFLE_WRITER_LOG_FLAG` for the 4 new enum entries
    (these strings are stable observable surface, since they're
    what shows up in the per-stage log line).
  - `fixedWidthBumpsItsBucket` — single fixed-width batch;
    asserts each of the 4 inner buckets has `count == 1` (every
    helper inside `splitRowVector()` is called regardless of
    column-type mix).
  - `binaryBumpsItsBucket` — VARCHAR batch; asserts the binary
    sub-stage's wallNanos > 0, confirming real work is being
    measured (not just function-entry overhead).
  - `complexBumpsItsBucket` — ARRAY batch; same wallNanos > 0
    assertion against the complex sub-stage (Presto serializer
    round-trip per partition).
  - `outerSplitRvStillCounted` — sanity: two batches in, outer
    `SplitRV` bucket has `count == 2`. Confirms the new inner
    timers refine, not replace, the existing outer timer.

The new test is a standalone `add_velox_test(...)` entry so each
sub-stage's behavior is asserted independently of the parameterised
round-trip matrix in `VeloxShuffleWriterTest.cc`.

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@luis4a0 luis4a0 force-pushed the lpenaranda/pr-splitrv-substages branch from ee523eb to fb4ac15 Compare May 18, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant