Skip to content

HIVE-29647: Parallelize Parquet split generation directory listing#6526

Merged
deniskuzZ merged 3 commits into
apache:masterfrom
deniskuzZ:HIVE-29647
Jun 9, 2026
Merged

HIVE-29647: Parallelize Parquet split generation directory listing#6526
deniskuzZ merged 3 commits into
apache:masterfrom
deniskuzZ:HIVE-29647

Conversation

@deniskuzZ

@deniskuzZ deniskuzZ commented Jun 4, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Override MapredParquetInputFormat.listStatus and configure mapreduce.input.fileinputformat.list-status.num-threads to enable parallel listing of input directories.

Why are the changes needed?

Parquet split generation lists each input directory (typically one per partition) serially, which dominates planning time on object stores where every listing is a network round trip.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Cluster (10 workers) TPC-DS scale 1Tb, external, parquet
query2 150sec (before) / 70 (after) # without the semijoins (see #6525)

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (HIVE-29647) speeds up Parquet split generation on object/blob stores by overriding MapredParquetInputFormat.listStatus to list multiple recursive input directories concurrently, falling back to the default FileInputFormat listing for other scenarios.

Changes:

  • Added a listStatus(JobConf) override that parallelizes recursive directory listing when multiple input dirs are present on blob storage.
  • Introduced a dedicated worker thread pool and completion-based result collection for per-directory listings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Outdated
Comment thread ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Outdated
Comment thread ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Outdated
Comment thread ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Outdated
Comment thread ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Outdated
Comment thread ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment thread ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Outdated
Comment thread ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Outdated
Comment thread ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Outdated
@Aggarwal-Raghav

Copy link
Copy Markdown
Contributor

LGTM +1

@abstractdog abstractdog left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this huge performance improvement so far, left 2 comments

Comment thread ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Outdated
if (dirs.length <= 1
|| !job.getBoolean(FileInputFormat.INPUT_DIR_RECURSIVE, false)
|| !BlobStorageUtils.isBlobStorageFileSystem(job, dirs[0].getFileSystem(job))) {
return super.listStatus(job);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in super.listStatus, I can see the same parallel thing is supported:
https://github.com/apache/hadoop/blob/ea0cb52c9a5c44b34e1e829892e439c14ced0b04/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L240

couldn't we simply reuse it by setting org.apache.hadoop.mapreduce.lib.input.FileInputFormat.LIST_STATUS_NUM_THREADS to HIVE_COMPUTE_SPLITS_NUM_THREADS?

there is a chance that LocatedFileStatusFetcher is already a battle-tested one

@deniskuzZ deniskuzZ Jun 8, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: mapreduce.input.fileinputformat.list-status.num-threads was not set in jobConf and defaulted to a single thread.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see this is working
LGTM, assuming that cluster testing has been done and the same performance improvement has been achieved

@deniskuzZ deniskuzZ Jun 9, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, got similar results on a test cluster. thanks for the suggestion!

@sonarqubecloud

sonarqubecloud Bot commented Jun 8, 2026

Copy link
Copy Markdown

@deniskuzZ deniskuzZ merged commit 8c6f824 into apache:master Jun 9, 2026
4 checks passed
@deniskuzZ deniskuzZ changed the title HIVE-29647: Parallelize Parquet split generation directory listing on blob storage HIVE-29647: Parallelize Parquet split generation directory listing Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants