Support Hadoop SequenceFiles Scan#14061
Conversation
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
This reverts commit 2ab5557.
There was a problem hiding this comment.
Pull request overview
This PR adds support for reading Hadoop SequenceFiles in the RAPIDS Accelerator for Apache Spark. It registers a new file format sequencefilebinary that reads SequenceFile key/value pairs as raw BinaryType columns on the GPU.
Key Changes
- Introduces
SequenceFileBinaryFileFormatas a new DataSource that reads SequenceFiles and exposes key/value as BinaryType columns - Implements GPU-accelerated reading via
GpuReadSequenceFileBinaryFormatand associated partition readers - Integrates the new format into
GpuFileSourceScanExecfor GPU execution path routing
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala | CPU-side FileFormat implementation with row-based reader for SequenceFiles |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuReadSequenceFileBinaryFormat.scala | GPU-enabled FileFormat wrapper with metadata support and multi-file reader factory |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala | Core GPU partition readers with host-side buffering and device column materialization |
| sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala | Integration point registering SequenceFileBinary format in GPU scan execution |
| tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala | Test suite in tests module for wildcard discovery |
| sql-plugin/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala | Duplicate test suite in sql-plugin module |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@Greptile full review |
Greptile SummaryThis PR introduces GPU-accelerated reading of Hadoop SequenceFiles via a physical plan replacement strategy: when Spark's plan contains a Key points from the review:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Driver
participant Meta as GpuSequenceFileSerializeFromObjectExecMeta
participant Exec as GpuSequenceFileSerializeFromObjectExec
participant Factory as GpuSequenceFileMultiFilePartitionReaderFactory
participant Reader as MultiFileCloudSequenceFilePartitionReader
participant Hadoop as SequenceFile.Reader (CPU)
participant GPU as cuDF / GPU
Driver->>Meta: tagPlanForGpu()
Meta->>Meta: isSimpleSequenceFileRDD() via reflection
Meta->>Meta: collectInputPaths() via NewHadoopRDD.getConf
Meta->>Meta: hasCompressedInput() — sample up to 5 files
Meta-->>Driver: convertToGpu() → GpuSequenceFileSerializeFromObjectExec
Driver->>Exec: internalDoExecuteColumnar()
Exec->>Exec: filePartitions (lazy: list + bin-pack files)
Exec->>Factory: build readerFactory (lazy)
Exec->>Driver: GpuDataSourceRDD[ColumnarBatch]
Note over Reader,GPU: Per-partition execution on executors
Reader->>Reader: getBatchRunner() → ReadBatchRunner (per file)
Reader->>Hadoop: SequenceFile.Reader.nextRaw() loop
Hadoop-->>Reader: key/value bytes (CPU decode)
Reader->>Reader: HostBinaryListBufferer → SequenceFileChunk (pinned memory)
Reader->>Reader: combineHMBs() — zero-copy chunk collection
Reader->>GPU: buildDeviceColumnFromChunks() → cuDF LIST<UINT8>
GPU-->>Reader: ColumnarBatch (key col + value col)
Reader-->>Exec: ColumnarBatch
|
There was a problem hiding this comment.
Additional Comments (4)
-
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala, line 125-128 (link)logic: check for potential INT32 overflow before it happens
the check happens after
dataLocationhas already grown beyondInt.MaxValue, which could cause issues during the buffer growth operations inaddBytesoraddValueBytes. move the overflow check earlier in those methods before updatingdataLocation. -
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala, line 73-83 (link)logic: potential integer overflow in row capacity calculation
rowsAllocated * 2can overflow whenrowsAllocatedis close toInt.MaxValue / 2. this causes the allocation to wrap to negative or small values. -
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala, line 32-136 (link)style: missing test coverage for key scenarios
add tests for:
- compressed SequenceFiles (should throw
UnsupportedOperationException) - multi-file reads to verify the multi-file reader path
- large batches that exceed
maxRowsPerBatchormaxBytesPerBatch - partition columns
- reading only key or only value (not both)
- empty files
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
- compressed SequenceFiles (should throw
-
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala, line 102-136 (link)style: tests only verify the CPU reader path, not GPU
this test uses
SequenceFileBinaryFileFormatwhich is the CPU fallback. to test the GPU path withGpuReadSequenceFileBinaryFormat, you'd need to enable the Rapids plugin configuration and verify GPU execution.
5 files reviewed, 4 comments
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
105cf2a to
9b1162e
Compare
|
@greptileai full review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptileai full review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptileai full review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptileai full review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
1 similar comment
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
|
@greptile review |
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
…t overflow
- Support tuple RDD toDF("value","key") on GPU via positional read schema;
reject plain tuple case that would silently produce wrong data
- Restore internal-name check for single-col case; use effective-name
resolution from parent ProjectExec for key-only reads
- Fix double-close of SequenceFileChunk in combine mode via _chunksTransferred flag
- Fix Int overflow in totalRows accumulation (use Long)
- Fix ExecChecks parameter order in GpuOverrides (GPU=BINARY, CPU=all)
- Improve isNewApiSequenceFileRDD reflection: name-first with type-based fallback
- Replace findAnyFile with lazy findSampleFiles for compression detection
- Add @tailrec to collectInputPaths
- Add tests for key-only reads and swapped column names
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
NOTE: release/26.04 has been created from main. Please retarget your PR to release/26.04 if it should be included in the release. |
Closes #14065
Description
This PR adds GPU-accelerated reading of Hadoop SequenceFiles in the RAPIDS Accelerator for Apache Spark. SequenceFile is commonly used as the storage format for serialized protobuf payloads (via
from_protobuf). Previously, reading SequenceFiles was only possible through Spark's CPU-based RDD API (sc.newAPIHadoopFile), which becomes the I/O bottleneck when downstream decode (e.g., protobuf) runs on the GPU.The implementation introduces a physical plan replacement strategy: when Spark's plan contains a
SerializeFromObjectExecover anExternalRDDScanExecbacked by a simple SequenceFile RDD, the plugin replaces it withGpuSequenceFileSerializeFromObjectExec, which bypasses the original RDD and reads files directly using a multi-threaded reader with combine mode.Key Design Decisions
SequenceFile.Readeron the CPU (the format is not amenable to GPU parsing). The decoded binary payloads are buffered in pinned host memory and transferred to the GPU asLIST<UINT8>columns. This meansCOALESCINGandPERFILEreader modes are not supported (no benefit), andMULTITHREADEDis the default.NewHadoopRDD/HadoopRDDto confirm that the RDD lineage is a simple SequenceFile scan withBytesWritablekey/value. Complex lineages (e.g., filtered/mapped RDDs) automatically fall back to CPU.sc.newAPIHadoopFile()andsc.hadoopFile()paths are supported.data/year=2024/*).spark.sql.files.ignoreMissingFilesandspark.sql.files.ignoreCorruptFilesconfigurations.New Files
GpuSequenceFileSerializeFromObjectExecMeta.scalaGpuSequenceFileSerializeFromObjectExec.scalasequencefile/GpuSequenceFileReaders.scalaHostBinaryListBufferer,SequenceFileChunk,MultiFileCloudSequenceFilePartitionReader, factoryNew Configs
spark.rapids.sql.format.sequencefile.reader.typeMULTITHREADEDMULTITHREADEDandAUTOsupported)spark.rapids.sql.format.sequencefile.multiThreadedRead.maxNumFilesParallelInteger.MAX_VALUEspark.rapids.sql.format.sequencefile.rddScan.physicalReplace.enabledtruePerformance tests
val NUM_FILES = 200
val RECORDS_PER_FILE = 50000
val VALUE_SIZE = 1024
val ITERATIONS = 5
I ran performance tests on 200 files with 50,000 records and a 1 MB size per value.
script
Since the decode happens on CPU we got similar perf numbers with CPU file format and we need to copy data to GPU for GPU file format. But it's about 2 times faster than CPU RDD scan.
Test Coverage
Scala unit tests (
SequenceFilePhysicalReplaceSuite,SequenceFileBinaryFileFormatSuite):hadoopRDD) supportignoreMissingFilesbehaviorPython integration tests (
sequencefile_test.py):Checklists
(Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)