HIVE-29647: Parallelize Parquet split generation directory listing#6526
Conversation
There was a problem hiding this comment.
Pull request overview
This PR (HIVE-29647) speeds up Parquet split generation on object/blob stores by overriding MapredParquetInputFormat.listStatus to list multiple recursive input directories concurrently, falling back to the default FileInputFormat listing for other scenarios.
Changes:
- Added a
listStatus(JobConf)override that parallelizes recursive directory listing when multiple input dirs are present on blob storage. - Introduced a dedicated worker thread pool and completion-based result collection for per-directory listings.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
LGTM +1 |
abstractdog
left a comment
There was a problem hiding this comment.
thanks for working on this huge performance improvement so far, left 2 comments
| if (dirs.length <= 1 | ||
| || !job.getBoolean(FileInputFormat.INPUT_DIR_RECURSIVE, false) | ||
| || !BlobStorageUtils.isBlobStorageFileSystem(job, dirs[0].getFileSystem(job))) { | ||
| return super.listStatus(job); |
There was a problem hiding this comment.
in super.listStatus, I can see the same parallel thing is supported:
https://github.com/apache/hadoop/blob/ea0cb52c9a5c44b34e1e829892e439c14ced0b04/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L240
couldn't we simply reuse it by setting org.apache.hadoop.mapreduce.lib.input.FileInputFormat.LIST_STATUS_NUM_THREADS to HIVE_COMPUTE_SPLITS_NUM_THREADS?
there is a chance that LocatedFileStatusFetcher is already a battle-tested one
There was a problem hiding this comment.
Fixed: mapreduce.input.fileinputformat.list-status.num-threads was not set in jobConf and defaulted to a single thread.
There was a problem hiding this comment.
I'm glad to see this is working
LGTM, assuming that cluster testing has been done and the same performance improvement has been achieved
There was a problem hiding this comment.
yes, got similar results on a test cluster. thanks for the suggestion!
|



What changes were proposed in this pull request?
Override
MapredParquetInputFormat.listStatusand configuremapreduce.input.fileinputformat.list-status.num-threadsto enable parallel listing of input directories.Why are the changes needed?
Parquet split generation lists each input directory (typically one per partition) serially, which dominates planning time on object stores where every listing is a network round trip.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Cluster (10 workers) TPC-DS scale 1Tb, external, parquet
query2 150sec (before) / 70 (after) # without the semijoins (see #6525)