Skip to content

perf: Introduce OptimizedHashPartitionFunction#2016

Open
yingsu00 wants to merge 2 commits into
IBM:optimized_partitionedoutputfrom
yingsu00:OptimizedHashPartitionFunction_1.0
Open

perf: Introduce OptimizedHashPartitionFunction#2016
yingsu00 wants to merge 2 commits into
IBM:optimized_partitionedoutputfrom
yingsu00:OptimizedHashPartitionFunction_1.0

Conversation

@yingsu00
Copy link
Copy Markdown
Collaborator

@yingsu00 yingsu00 commented May 9, 2026

This PR contains 2 commits:

  1. perf: Add AVX512 support

  2. Introduce OptimizedHashPartitionFunction
    Introduce OptimizedHashPartitionFunction as a faster drop-in replacement
    for HashPartitionFunction, gated behind a new query config flag
    optimized_hash_partition_function_enabled (default false). partition()
    is improved from 50% to over 200x.

    Add HashPartitionFunctionBase as a common base exposing numPartitions(),
    and createHashPartitionFunction() factories that select the
    implementation based on the flag. Thread QueryConfig* through
    PartitionFunctionSpec::create() and update callsites (LocalPartition,
    PartitionedOutput, MarkDistinct, RowNumber, Window,
    SubPartitionedSortWindowBuild, HiveConnector) to construct partition
    functions via the factory.

    Register CMake targets for the new test and benchmark binaries.

@yingsu00 yingsu00 requested a review from xin-zhang2 May 9, 2026 22:04
@yingsu00 yingsu00 self-assigned this May 9, 2026
@yingsu00 yingsu00 requested a review from majetideepak as a code owner May 9, 2026 22:04
Comment thread velox/connectors/hive/HiveConnector.h Outdated
int numPartitions,
bool localExchange) const override;
bool localExchange,
const core::QueryConfig* queryConfig) const override;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HivePartitionFunctionTest doesn't compile due to no default value provided for queryConfig. We need to either add the default value here or pass nullptr in HivePartitionFunctionTest.

}
}

bool hasConstantKeys(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name hasConstantKeys is a bit confusing. This function returns true only when all partition keys are constant or backed by a constant-encoded input vector. It might be clearer to rename it to allConstantKeys.

return 0u;
}

const auto size = input.size();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

early return if size is 0.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value needs to be handled since it may return a single partition now.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced this with

  std::optional<uint32_t> singlePartition =
      subPartitioningFunction_->partition(*input, subPartitionIdsBuffer_);
  if (singlePartition.has_value()) {
    simd::simdFill<uint32_t>(
        subPartitionIdsBuffer_.data(), singlePartition.value(), input->size());
  }

Copy link
Copy Markdown
Member

@xin-zhang2 xin-zhang2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 I left a few comments, please take a look when you have a chance. Thanks.

@yingsu00 yingsu00 force-pushed the OptimizedHashPartitionFunction_1.0 branch from 68f3d0a to aca5238 Compare May 16, 2026 00:00
@yingsu00 yingsu00 removed the request for review from majetideepak May 16, 2026 00:04
@yingsu00 yingsu00 force-pushed the OptimizedHashPartitionFunction_1.0 branch from aca5238 to 799d1f6 Compare May 16, 2026 01:02
{{core::QueryConfig::kOptimizedHashPartitionFunctionEnabled, "true"}});

auto optimizedFunction =
spec.create(8, /*localExchange=*/false, &optimizedConfig);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third argument is bool, so we can directly pass true to it, or use optimizedConfig.optimizedPartitionedOutputEnabled().

@yingsu00 yingsu00 force-pushed the OptimizedHashPartitionFunction_1.0 branch from 799d1f6 to d7e5827 Compare May 19, 2026 01:26
@yingsu00
Copy link
Copy Markdown
Collaborator Author

@xin-zhang2 Thanks for reviewing. I have updated the PR.

}
}

std::optional<uint32_t> OptimizedHashPartitionFunction::partition(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OptimizedHashpartitionFunction is intended to be a drop-in replacement for HashPartitionFunction, it might be expected to produce the same partition assignments as HashPartitionFunction. However, since it uses a different hash algorithm for non-hashBitRange cases, the resulting partition assignments might be different from HashPartitionFunction. Do you think we should add a comment to clarify this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different hash algorithm for non-hashBitRange cases

No, the hash algorithms are the same. Both are using Folly's hash functions:

template <bool typeProvidesCustomComparison, TypeKind Kind>
uint64_t hashOne(const DecodedVector& decoded, vector_size_t index) {
  if constexpr (
      Kind == TypeKind::ROW || Kind == TypeKind::ARRAY ||
      Kind == TypeKind::MAP) {
    return decoded.base()->hashValueAt(decoded.index(index));
  } else {
    using T = typename KindToFlatVector<Kind>::HashRowType;
    const T value = decoded.valueAt<T>(index);

    if constexpr (typeProvidesCustomComparison) {
      return static_cast<const CanProvideCustomComparisonType<Kind>*>(
                 decoded.base()->type().get())
          ->hash(value);
    } else if constexpr (std::is_floating_point_v<T>) {
      return util::floating_point::NaNAwareHash<T>()(value);
    } else {
      return folly::hasher<T>()(value);
    }
  }
}

It's just the loops that were optimized.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I didn't make this clear in my previous comment. I'm not referring to the algorithm used to compute the hash value, but rather how the partition is determined from the hash value.
In HashPartitionFunction, the partition is calculated as

partitions[i] = hashes_[i] % numPartitions_

whereas in OptimizedPartitionFunction, it uses

partitions[index] = reduceRange(hashes[index], numPartitions);

where

FOLLY_ALWAYS_INLINE uint32_t mixedHash(uint64_t hash) {
  return static_cast<uint32_t>(hash) ^ static_cast<uint32_t>(hash >> 32);
}

FOLLY_ALWAYS_INLINE uint32_t
reduceRange(uint64_t hash, uint32_t numPartitions) {
  return (static_cast<uint64_t>(mixedHash(hash)) * numPartitions) >> 32;
}

These two approaches are not equivalent, so for the same input hash value, they may produce different partition assignments.

Comment thread velox/core/PlanNode.h
int numPartitions,
bool localExchange = false) const = 0;
bool localExchange = false,
bool useOptimizedPartitionFunction = false) const = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartitionFunctionSpec::create() is intended to be a generic virtual interface and should be ideally implementation-agnostic. While useOptimizedPartitionFunction is specific to HashPartitionFunctionSpec, it exposes implementation details at the interface level. Would there be a cleaner approach here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartitionFunctionSpec::create() is intended to be a generic virtual interface and should be ideally implementation-agnostic. While useOptimizedPartitionFunction is specific to HashPartitionFunctionSpec, it exposes implementation details at the interface level. Would there be a cleaner approach here?

I will make optimized versions for other PartitionFunctions in near future too, so this is generic for all PartitionFunctions. Otherwise, do you have other suggestions?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment in PlanNode.h:

  /// TODO: useOptimizedPartitionFunction = true is only supported in
  /// HashPartitionFunction now. Will extend the optimization to other
  /// PartitionFunctions soon.

return std::nullopt;
}

uint64_t hash;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add a VELOX_DCHECK_GT(decoded_.size(), 0) check before calling isNullAt(0)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@yingsu00 yingsu00 force-pushed the OptimizedHashPartitionFunction_1.0 branch from d7e5827 to 15b6e1a Compare May 20, 2026 23:53
Introduce OptimizedHashPartitionFunction as a faster drop-in replacement
for HashPartitionFunction, gated behind a new query config flag
optimized_hash_partition_function_enabled (default false). partition()
is improved from 50% to over 200x.

Add HashPartitionFunctionBase as a common base exposing numPartitions(),
and createHashPartitionFunction() factories that select the
implementation based on the flag. Thread QueryConfig* through
PartitionFunctionSpec::create() and update callsites (LocalPartition,
PartitionedOutput, MarkDistinct, RowNumber, Window,
SubPartitionedSortWindowBuild, HiveConnector) to construct partition
functions via the factory.

Register CMake targets for the new test and benchmark binaries.
suspender.dismiss();

for (uint32_t iteration = 0; iteration < iterations; ++iteration) {
partitionFunction->partition(*input, partitions);
Copy link
Copy Markdown
Member

@xin-zhang2 xin-zhang2 May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need logic to handle the case when it returns a single value.

There is another call site in PartitionedVectorTestBase::partitionRowVectors that doesn't handle this. That function is not currenlty used, but it would be better to either remove the function or address it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants