perf: Introduce OptimizedHashPartitionFunction by yingsu00 · Pull Request #2016 · IBM/velox

yingsu00 · 2026-05-09T22:04:28Z

This PR contains 2 commits:

perf: Add AVX512 support
Introduce OptimizedHashPartitionFunction
Introduce OptimizedHashPartitionFunction as a faster drop-in replacement
for HashPartitionFunction, gated behind a new query config flag
optimized_hash_partition_function_enabled (default false). partition()
is improved from 50% to over 200x.

Add HashPartitionFunctionBase as a common base exposing numPartitions(),
and createHashPartitionFunction() factories that select the
implementation based on the flag. Thread QueryConfig* through
PartitionFunctionSpec::create() and update callsites (LocalPartition,
PartitionedOutput, MarkDistinct, RowNumber, Window,
SubPartitionedSortWindowBuild, HiveConnector) to construct partition
functions via the factory.

Register CMake targets for the new test and benchmark binaries.

xin-zhang2 · 2026-05-13T13:31:02Z

      int numPartitions,
-      bool localExchange) const override;
+      bool localExchange,
+      const core::QueryConfig* queryConfig) const override;


HivePartitionFunctionTest doesn't compile due to no default value provided for queryConfig. We need to either add the default value here or pass nullptr in HivePartitionFunctionTest.

xin-zhang2 · 2026-05-13T14:46:30Z

+  }
+}
+
+bool hasConstantKeys(


The name hasConstantKeys is a bit confusing. This function returns true only when all partition keys are constant or backed by a constant-encoded input vector. It might be clearer to rename it to allConstantKeys.

xin-zhang2 · 2026-05-14T13:07:25Z

+    return 0u;
+  }
+
+  const auto size = input.size();


early return if size is 0.

xin-zhang2 · 2026-05-14T13:15:47Z

The return value needs to be handled since it may return a single partition now.

Replaced this with

std::optional<uint32_t> singlePartition = subPartitioningFunction_->partition(*input, subPartitionIdsBuffer_); if (singlePartition.has_value()) { simd::simdFill<uint32_t>( subPartitionIdsBuffer_.data(), singlePartition.value(), input->size()); }

xin-zhang2

@yingsu00 I left a few comments, please take a look when you have a chance. Thanks.

xin-zhang2 · 2026-05-18T16:00:59Z

+      {{core::QueryConfig::kOptimizedHashPartitionFunctionEnabled, "true"}});
+
+  auto optimizedFunction =
+      spec.create(8, /*localExchange=*/false, &optimizedConfig);


The third argument is bool, so we can directly pass true to it, or use optimizedConfig.optimizedPartitionedOutputEnabled().

yingsu00 · 2026-05-19T01:26:34Z

@xin-zhang2 Thanks for reviewing. I have updated the PR.

xin-zhang2 · 2026-05-20T14:21:26Z

+  }
+}
+
+std::optional<uint32_t> OptimizedHashPartitionFunction::partition(


OptimizedHashpartitionFunction is intended to be a drop-in replacement for HashPartitionFunction, it might be expected to produce the same partition assignments as HashPartitionFunction. However, since it uses a different hash algorithm for non-hashBitRange cases, the resulting partition assignments might be different from HashPartitionFunction. Do you think we should add a comment to clarify this?

different hash algorithm for non-hashBitRange cases

No, the hash algorithms are the same. Both are using Folly's hash functions:

template <bool typeProvidesCustomComparison, TypeKind Kind> uint64_t hashOne(const DecodedVector& decoded, vector_size_t index) { if constexpr ( Kind == TypeKind::ROW || Kind == TypeKind::ARRAY || Kind == TypeKind::MAP) { return decoded.base()->hashValueAt(decoded.index(index)); } else { using T = typename KindToFlatVector<Kind>::HashRowType; const T value = decoded.valueAt<T>(index); if constexpr (typeProvidesCustomComparison) { return static_cast<const CanProvideCustomComparisonType<Kind>*>( decoded.base()->type().get()) ->hash(value); } else if constexpr (std::is_floating_point_v<T>) { return util::floating_point::NaNAwareHash<T>()(value); } else { return folly::hasher<T>()(value); } } }

It's just the loops that were optimized.

Maybe I didn't make this clear in my previous comment. I'm not referring to the algorithm used to compute the hash value, but rather how the partition is determined from the hash value.
In HashPartitionFunction, the partition is calculated as

partitions[i] = hashes_[i] % numPartitions_

whereas in OptimizedPartitionFunction, it uses

partitions[index] = reduceRange(hashes[index], numPartitions);

where

FOLLY_ALWAYS_INLINE uint32_t mixedHash(uint64_t hash) { return static_cast<uint32_t>(hash) ^ static_cast<uint32_t>(hash >> 32); } FOLLY_ALWAYS_INLINE uint32_t reduceRange(uint64_t hash, uint32_t numPartitions) { return (static_cast<uint64_t>(mixedHash(hash)) * numPartitions) >> 32; }

These two approaches are not equivalent, so for the same input hash value, they may produce different partition assignments.

xin-zhang2 · 2026-05-20T14:40:27Z

      int numPartitions,
-      bool localExchange = false) const = 0;
+      bool localExchange = false,
+      bool useOptimizedPartitionFunction = false) const = 0;


PartitionFunctionSpec::create() is intended to be a generic virtual interface and should be ideally implementation-agnostic. While useOptimizedPartitionFunction is specific to HashPartitionFunctionSpec, it exposes implementation details at the interface level. Would there be a cleaner approach here?

PartitionFunctionSpec::create() is intended to be a generic virtual interface and should be ideally implementation-agnostic. While useOptimizedPartitionFunction is specific to HashPartitionFunctionSpec, it exposes implementation details at the interface level. Would there be a cleaner approach here?

I will make optimized versions for other PartitionFunctions in near future too, so this is generic for all PartitionFunctions. Otherwise, do you have other suggestions?

I added a comment in PlanNode.h:

/// TODO: useOptimizedPartitionFunction = true is only supported in /// HashPartitionFunction now. Will extend the optimization to other /// PartitionFunctions soon.

xin-zhang2 · 2026-05-20T14:58:20Z

+    return std::nullopt;
+  }
+
+  uint64_t hash;


It would be better to add a VELOX_DCHECK_GT(decoded_.size(), 0) check before calling isNullAt(0)

Introduce OptimizedHashPartitionFunction as a faster drop-in replacement for HashPartitionFunction, gated behind a new query config flag optimized_hash_partition_function_enabled (default false). partition() is improved from 50% to over 200x. Add HashPartitionFunctionBase as a common base exposing numPartitions(), and createHashPartitionFunction() factories that select the implementation based on the flag. Thread QueryConfig* through PartitionFunctionSpec::create() and update callsites (LocalPartition, PartitionedOutput, MarkDistinct, RowNumber, Window, SubPartitionedSortWindowBuild, HiveConnector) to construct partition functions via the factory. Register CMake targets for the new test and benchmark binaries.

xin-zhang2 · 2026-05-21T16:02:03Z

+  suspender.dismiss();
+
+  for (uint32_t iteration = 0; iteration < iterations; ++iteration) {
+    partitionFunction->partition(*input, partitions);


We need logic to handle the case when it returns a single value.

There is another call site in PartitionedVectorTestBase::partitionRowVectors that doesn't handle this. That function is not currenlty used, but it would be better to either remove the function or address it as well.

perf: Add AVX512 support

b452805

yingsu00 requested a review from xin-zhang2 May 9, 2026 22:04

yingsu00 self-assigned this May 9, 2026

yingsu00 added the OptimizedPartitioning label May 9, 2026

yingsu00 requested a review from majetideepak as a code owner May 9, 2026 22:04

xin-zhang2 reviewed May 13, 2026

View reviewed changes

xin-zhang2 reviewed May 14, 2026

View reviewed changes

yingsu00 force-pushed the OptimizedHashPartitionFunction_1.0 branch from 68f3d0a to aca5238 Compare May 16, 2026 00:00

yingsu00 removed the request for review from majetideepak May 16, 2026 00:04

yingsu00 force-pushed the OptimizedHashPartitionFunction_1.0 branch from aca5238 to 799d1f6 Compare May 16, 2026 01:02

xin-zhang2 reviewed May 18, 2026

View reviewed changes

yingsu00 force-pushed the OptimizedHashPartitionFunction_1.0 branch from 799d1f6 to d7e5827 Compare May 19, 2026 01:26

xin-zhang2 reviewed May 20, 2026

View reviewed changes

yingsu00 force-pushed the OptimizedHashPartitionFunction_1.0 branch from d7e5827 to 15b6e1a Compare May 20, 2026 23:53

yingsu00 force-pushed the OptimizedHashPartitionFunction_1.0 branch from 15b6e1a to 24bd9b8 Compare May 21, 2026 00:30

yingsu00 mentioned this pull request May 21, 2026

perf: Improve ConstantVector performance in OptimizedHashPartitionFunction #2014

Open

xin-zhang2 reviewed May 21, 2026

View reviewed changes

Conversation

yingsu00 commented May 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xin-zhang2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yingsu00 commented May 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xin-zhang2 May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xin-zhang2 May 21, 2026 •

edited

Loading