HIVE-29647: Parallelize Parquet split generation directory listing by deniskuzZ · Pull Request #6526 · apache/hive

deniskuzZ · 2026-06-04T15:42:16Z

What changes were proposed in this pull request?

Override MapredParquetInputFormat.listStatus and configure mapreduce.input.fileinputformat.list-status.num-threads to enable parallel listing of input directories.

Why are the changes needed?

Parquet split generation lists each input directory (typically one per partition) serially, which dominates planning time on object stores where every listing is a network round trip.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Cluster (10 workers) TPC-DS scale 1Tb, external, parquet
query2 150sec (before) / 70 (after) # without the semijoins (see #6525)

… blob storage

Copilot

Pull request overview

This PR (HIVE-29647) speeds up Parquet split generation on object/blob stores by overriding MapredParquetInputFormat.listStatus to list multiple recursive input directories concurrently, falling back to the default FileInputFormat listing for other scenarios.

Changes:

Added a listStatus(JobConf) override that parallelizes recursive directory listing when multiple input dirs are present on blob storage.
Introduced a dedicated worker thread pool and completion-based result collection for per-directory listings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Aggarwal-Raghav · 2026-06-07T18:00:44Z

LGTM +1

abstractdog

thanks for working on this huge performance improvement so far, left 2 comments

abstractdog · 2026-06-08T14:14:34Z

+    if (dirs.length <= 1
+        || !job.getBoolean(FileInputFormat.INPUT_DIR_RECURSIVE, false)
+        || !BlobStorageUtils.isBlobStorageFileSystem(job, dirs[0].getFileSystem(job))) {
+      return super.listStatus(job);


in super.listStatus, I can see the same parallel thing is supported:
https://github.com/apache/hadoop/blob/ea0cb52c9a5c44b34e1e829892e439c14ced0b04/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L240

couldn't we simply reuse it by setting org.apache.hadoop.mapreduce.lib.input.FileInputFormat.LIST_STATUS_NUM_THREADS to HIVE_COMPUTE_SPLITS_NUM_THREADS?

there is a chance that LocatedFileStatusFetcher is already a battle-tested one

Fixed: mapreduce.input.fileinputformat.list-status.num-threads was not set in jobConf and defaulted to a single thread.

I'm glad to see this is working
LGTM, assuming that cluster testing has been done and the same performance improvement has been achieved

yes, got similar results on a test cluster. thanks for the suggestion!

sonarqubecloud · 2026-06-08T19:26:01Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

HIVE-29647: Parallelize Parquet split generation directory listing on…

5feb893

… blob storage

asf-ci-hive added the tests pending label Jun 4, 2026

deniskuzZ requested review from abstractdog and Copilot June 4, 2026 15:47

Copilot started reviewing on behalf of deniskuzZ June 4, 2026 15:48 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Aggarwal-Raghav reviewed Jun 4, 2026

View reviewed changes

Comment thread ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Outdated

asf-ci-hive added tests passed tests pending and removed tests pending tests passed labels Jun 4, 2026

review comments #1

09b3b9f

deniskuzZ force-pushed the HIVE-29647 branch from ab9115b to 09b3b9f Compare June 5, 2026 08:17

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Jun 5, 2026

deniskuzZ requested a review from Copilot June 7, 2026 11:43

Copilot started reviewing on behalf of deniskuzZ June 7, 2026 11:43 View session

Copilot AI reviewed Jun 7, 2026

View reviewed changes

Aggarwal-Raghav approved these changes Jun 7, 2026

View reviewed changes

abstractdog requested changes Jun 8, 2026

View reviewed changes

review comments #2

552b42f

asf-ci-hive added tests pending and removed tests passed labels Jun 8, 2026

deniskuzZ requested a review from abstractdog June 8, 2026 18:25

asf-ci-hive added tests passed and removed tests pending labels Jun 8, 2026

abstractdog approved these changes Jun 9, 2026

View reviewed changes

deniskuzZ merged commit 8c6f824 into apache:master Jun 9, 2026
4 checks passed

deniskuzZ changed the title ~~HIVE-29647: Parallelize Parquet split generation directory listing on blob storage~~ HIVE-29647: Parallelize Parquet split generation directory listing Jun 9, 2026

Conversation

deniskuzZ commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Aggarwal-Raghav commented Jun 7, 2026

Uh oh!

abstractdog left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abstractdog Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abstractdog Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 8, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

deniskuzZ commented Jun 4, 2026 •

edited

Loading

deniskuzZ Jun 8, 2026 •

edited

Loading

deniskuzZ Jun 9, 2026 •

edited

Loading