[VL] Subdivide SplitRV CPU/wall timer into 4 sub-stages#12108
Open
luis4a0 wants to merge 5 commits into
Open
Conversation
Companion to apache#12083 (`SCOPED_TIMER` fix). With the per-stage timers in `cpuWallTimingList_` now recording real numbers, the natural complement is a counter that tells the reader what shape the writer actually sees — i.e. for each `write(cb, ...)` call, how many of the input children arrive as FLAT / DICTIONARY / CONSTANT / LAZY / other encoding (the values of `facebook::velox::VectorEncoding::Simple`). This information is useful for at least three things: - Verifying whether a workload would benefit from writer-side dictionary / constant pass-through optimizations *before* writing the pass-through code (does the writer ever actually see those encodings?). - Spotting unexpected upstream operator changes that collapse / preserve encodings differently across Velox versions (e.g., a change in HashAggregate that starts emitting CONSTANT children when it didn't before). - Correlating a wall-time spike in a particular stage with the encoding mix that drove it. The counter is always-on (one branch + one increment per input child, swamped by the surrounding shuffle work). Output is gated through the existing `VELOX_SHUFFLE_WRITER_LOG_FLAG` log channel — same line shape as the existing `cpuWallTimingList_` lines — so behavior is unchanged when the flag is off (default). Scope is deliberately narrow: only the 5-bucket encoding mix at the hash-shuffle writer entry, surfaced via the existing log channel. A per-type breakdown (BIGINT / DECIMAL / VARCHAR / ...) and exposure through `ShuffleWriterMetrics` to the JVM SparkListener are both plausible follow-ups but kept out of this PR. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three test cases covering the typical encoding mixes a hash shuffle writer may see at `write()` entry: - `allFlat` — every child arrives FLAT (the common case at HEAD on TPC-H / TPC-DS, where upstream operators flatten dict/const inputs before shuffle). Asserts `FLAT` bucket accumulates across two batches and the other 4 buckets stay 0. - `mixedFlatDictConst` — one batch with FLAT + DICTIONARY + CONSTANT children, asserts the three respective buckets each increment. - `lazy` — wraps a child in a `LazyVector` and asserts the `LAZY` bucket increments (the encoding is captured BEFORE any `loadedVector()` call inside the writer). The test is a standalone `add_velox_test(...)` entry rather than a new `TEST_P` case in `VeloxShuffleWriterTest.cc`, so it isolates the counter behavior from the parameterised round-trip matrix — which keeps the failure mode obvious if a future change quietly breaks one of the buckets. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ount Two review comments from @yaooqinn on apache#12107: (1) apache#12107 (comment) — `VectorEncoding::Simple` has 11 values; the prior switch handled 4 and funnelled the remaining 7 (BIASED, SEQUENCE, ROW, MAP, FLAT_MAP, ARRAY, FUNCTION) into "Other". ROW / MAP / ARRAY are first-class struct / map / array column types exercised by the sibling `VeloxShuffleWriterTest` via `makeArrayVector` / `makeMapVector`, so a reader interpreting "Other=80%" in the log would assume rare/unknown encodings rather than "I have a struct column". Split into a new `kInputEncodingComplex` bucket for ROW / MAP / FLAT_MAP / ARRAY; `kInputEncodingOther` now only catches BIASED / SEQUENCE / FUNCTION (and future additions to `VectorEncoding::Simple`). (2) apache#12107 (comment) — `accumulateInputEncodingCounts` early-returns on non-velox batches, so the InputEncoding log line's denominator (sum of buckets) is over a different population than the `count` field on the `cpuWallTimingList_` lines emitted above. Added an `inputEncodingSkippedBatches_` counter that is incremented on every early-return path and printed at the end of the InputEncoding line as `SkippedNonVeloxBatches=N`, so the two log blocks have comparable denominators. Test file updated: new `complex` test case exercises ARRAY + MAP children landing in `kInputEncodingComplex` (not `Other`); all existing cases also assert the new `Complex` bucket stays 0 and `inputEncodingSkippedBatches() == 0`. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The existing `CpuWallTimingSplitRV` bucket wraps the entire `splitRowVector()` call, so when it dominates wall time in a `VELOX_SHUFFLE_WRITER_LOG_FLAG=1` profile, there's no way to tell which of the four sub-paths is responsible: fixed-width scatter, validity copy, binary split, or complex (Presto-serialized) split. This change subdivides `SplitRV` into four new buckets, one per helper inside `splitRowVector()`: - `CpuWallTimingSplitFixedWidth` — `splitFixedWidthValueBuffer` - `CpuWallTimingSplitValidity` — `splitValidityBuffer` - `CpuWallTimingSplitBinary` — `splitBinaryArray` - `CpuWallTimingSplitComplex` — `splitComplexType` The outer `SplitRV` bucket is unchanged: it continues to count once per call, and its wall/cpu measurement now happens to overlap the four inner buckets (the outer timer is still ticking while each inner is). The outer minus the sum-of-inners is therefore the time spent in `splitRowVector()` outside the four sub-paths (mostly the post-split `partitionBufferBase_` update loop). To make the new sub-stage data accessible to anything other than the existing compile-time log path (tests, future `ShuffleWriterMetrics` exposure, profilers), the `CpuWallTimingType` enum, the `CpuWallTimingName()` helper, and a new single-value accessor `cpuWallTiming(CpuWallTimingType)` are hoisted from `protected` to `public`. The underlying `cpuWallTimingList_` storage stays `protected`. This is the natural follow-up to apache#12083 (which made the per-stage timers actually record real numbers) and to the input-encoding counter in the preceding commit on this branch (which is itself PR apache#12107). Together, the three changes let a reviewer answer "what shape did the writer see, where did the time go, and what inner path drove it" from a single `VELOX_SHUFFLE_WRITER_LOG_FLAG` run. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Five gtest cases exercise the new sub-stage timers introduced in
the previous commit:
- `enumNames` — validates the strings emitted under
`VELOX_SHUFFLE_WRITER_LOG_FLAG` for the 4 new enum entries
(these strings are stable observable surface, since they're
what shows up in the per-stage log line).
- `fixedWidthBumpsItsBucket` — single fixed-width batch;
asserts each of the 4 inner buckets has `count == 1` (every
helper inside `splitRowVector()` is called regardless of
column-type mix).
- `binaryBumpsItsBucket` — VARCHAR batch; asserts the binary
sub-stage's wallNanos > 0, confirming real work is being
measured (not just function-entry overhead).
- `complexBumpsItsBucket` — ARRAY batch; same wallNanos > 0
assertion against the complex sub-stage (Presto serializer
round-trip per partition).
- `outerSplitRvStillCounted` — sanity: two batches in, outer
`SplitRV` bucket has `count == 2`. Confirms the new inner
timers refine, not replace, the existing outer timer.
The new test is a standalone `add_velox_test(...)` entry so each
sub-stage's behavior is asserted independently of the parameterised
round-trip matrix in `VeloxShuffleWriterTest.cc`.
Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ee523eb to
fb4ac15
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
The existing
CpuWallTimingSplitRVbucket wraps the entiresplitRowVector()call, so when it dominates wall time in aVELOX_SHUFFLE_WRITER_LOG_FLAG=1profile, there's no way to tell which of the four sub-paths is responsible: fixed-width scatter, validity copy, binary split, or complex (Presto-serialized) split.This PR subdivides
SplitRVinto four new buckets, one per helper insidesplitRowVector():CpuWallTimingSplitFixedWidthsplitFixedWidthValueBuffer()CpuWallTimingSplitValiditysplitValidityBuffer()CpuWallTimingSplitBinarysplitBinaryArray()CpuWallTimingSplitComplexsplitComplexType()The outer
SplitRVbucket is unchanged: it continues to count once per call, and its wall/cpu measurement now happens to overlap the four inner buckets (the outer timer is still ticking while each inner one is). The outer minus the sum-of-inners is therefore the time spent insplitRowVector()outside the four sub-paths (mostly the post-splitpartitionBufferBase_update loop). This is the same parent/child pattern most profiling tools use for nested timers, and it's documented in a comment next to the enum.To make the new sub-stage data accessible to anything other than the existing compile-time log path (tests, future
ShuffleWriterMetricsexposure, profilers), three things are hoisted fromprotectedtopublicinVeloxShuffleWriter:CpuWallTimingTypeenumCpuWallTimingName()helper (already de-facto public, since its output is printed by theVELOX_SHUFFLE_WRITER_LOG_FLAGlog)cpuWallTiming(CpuWallTimingType)The underlying
cpuWallTimingList_storage staysprotected.Why now
This is the natural follow-up to #12083 (which made the per-stage timers actually record real numbers) and to #12107 (which adds a per-batch input-encoding counter). Together, the three changes let a reviewer answer "what shape did the writer see, where did the time go, and what inner path drove it" from a single
VELOX_SHUFFLE_WRITER_LOG_FLAGrun, instead of being stuck at "SplitRVwas 65 % of wall time, somewhere in there".A per-query TPC-DS/TPC-H SF10 profile sweep on a closely-related internal branch shows
SplitRVis the top-1 or top-2 bucket on 10/12 queries surveyed (46-71% of wall), with substantially different sub-stage breakdowns: bigint-heavy queries (q17, q14a) are overwhelminglysplitFixedWidthValueBuffer, while VARCHAR-heavy q67 shifts towardsplitBinaryArray. Without the sub-stage timers, those breakdowns are invisible.How was this patch tested?
This PR adds
cpp/velox/tests/VeloxHashShuffleWriterSplitRvSubStagesTest.cc(second commit) with five gtest cases:enumNames: validates the strings emitted underVELOX_SHUFFLE_WRITER_LOG_FLAGfor the 4 new enum entries (these strings are stable observable surface).fixedWidthBumpsItsBucket: single fixed-width batch; asserts each of the 4 inner buckets hascount == 1(every helper insidesplitRowVector()is called once per batch regardless of column-type mix, since the new sub-stageSCOPED_TIMERcalls live at function entry).binaryBumpsItsBucket: VARCHAR batch; asserts the binary sub-stage'swallNanos > 0, confirming real work is being measured (not just function-entry overhead).complexBumpsItsBucket: ARRAY batch; samewallNanos > 0assertion against the complex sub-stage (Presto-serialized round-trip per partition).outerSplitRvStillCounted: sanity: two batches in, outerSplitRVbucket hascount == 2. Confirms the new inner timers refine, not replace, the existing outer timer.