Skip to content

Support Hadoop SequenceFiles Scan#14061

Draft
thirtiseven wants to merge 64 commits intoNVIDIA:mainfrom
thirtiseven:seq_file_reader
Draft

Support Hadoop SequenceFiles Scan#14061
thirtiseven wants to merge 64 commits intoNVIDIA:mainfrom
thirtiseven:seq_file_reader

Conversation

@thirtiseven
Copy link
Copy Markdown
Collaborator

@thirtiseven thirtiseven commented Dec 23, 2025

Closes #14065

Description

This PR adds GPU-accelerated reading of Hadoop SequenceFiles in the RAPIDS Accelerator for Apache Spark. SequenceFile is commonly used as the storage format for serialized protobuf payloads (via from_protobuf). Previously, reading SequenceFiles was only possible through Spark's CPU-based RDD API (sc.newAPIHadoopFile), which becomes the I/O bottleneck when downstream decode (e.g., protobuf) runs on the GPU.

The implementation introduces a physical plan replacement strategy: when Spark's plan contains a SerializeFromObjectExec over an ExternalRDDScanExec backed by a simple SequenceFile RDD, the plugin replaces it with GpuSequenceFileSerializeFromObjectExec, which bypasses the original RDD and reads files directly using a multi-threaded reader with combine mode.

Key Design Decisions

  • CPU decode, GPU transfer: SequenceFile decoding uses Hadoop's SequenceFile.Reader on the CPU (the format is not amenable to GPU parsing). The decoded binary payloads are buffered in pinned host memory and transferred to the GPU as LIST<UINT8> columns. This means COALESCING and PERFILE reader modes are not supported (no benefit), and MULTITHREADED is the default.
  • RDD lineage reflection: The meta layer uses reflection on NewHadoopRDD / HadoopRDD to confirm that the RDD lineage is a simple SequenceFile scan with BytesWritable key/value. Complex lineages (e.g., filtered/mapped RDDs) automatically fall back to CPU.
  • Compression: Compressed SequenceFiles (record or block compression) are detected at planning time and fall back to CPU. Only uncompressed SequenceFiles are GPU-accelerated.
  • Combine mode: Small files are merged into larger batches via zero-copy GPU concatenation, improving throughput when reading many small SequenceFiles.
  • Old and New Hadoop API: Both sc.newAPIHadoopFile() and sc.hadoopFile() paths are supported.
  • Glob patterns: Input paths support Hadoop-style glob patterns (e.g., data/year=2024/*).
  • Error handling: Respects spark.sql.files.ignoreMissingFiles and spark.sql.files.ignoreCorruptFiles configurations.

New Files

File Description
GpuSequenceFileSerializeFromObjectExecMeta.scala Planning meta: detects SequenceFile RDD lineage via reflection, validates schema/compression, collects input paths
GpuSequenceFileSerializeFromObjectExec.scala GPU exec node: file listing, partition creation, multi-threaded reader orchestration, CPU fallback path
sequencefile/GpuSequenceFileReaders.scala Core reader: HostBinaryListBufferer, SequenceFileChunk, MultiFileCloudSequenceFilePartitionReader, factory

New Configs

Config Default Description
spark.rapids.sql.format.sequencefile.reader.type MULTITHREADED Reader type (only MULTITHREADED and AUTO supported)
spark.rapids.sql.format.sequencefile.multiThreadedRead.maxNumFilesParallel Integer.MAX_VALUE Max files read in parallel per task
spark.rapids.sql.format.sequencefile.rddScan.physicalReplace.enabled true Enable/disable physical plan replacement

Performance tests

val NUM_FILES = 200
val RECORDS_PER_FILE = 50000
val VALUE_SIZE = 1024
val ITERATIONS = 5

I ran performance tests on 200 files with 50,000 records and a 1 MB size per value.

script

Since the decode happens on CPU we got similar perf numbers with CPU file format and we need to copy data to GPU for GPU file format. But it's about 2 times faster than CPU RDD scan.

CPU value-only            |          1.24 s |           2.00x
GPU MT-8 threads          |          1.25 s |           1.98x
GPU MT-4 threads          |          1.27 s |           1.95x
GPU MT-2 threads          |          1.28 s |           1.94x
GPU MULTITHREADED         |          1.28 s |           1.94x
CPU FileFormat            |          1.35 s |           1.84x
GPU AUTO                  |          1.37 s |           1.82x
GPU PERFILE               |          1.37 s |           1.81x
GPU value-only            |          1.38 s |           1.80x
GPU COALESCING            |          1.41 s |           1.76x
RDD Scan                  |          2.49 s |        baseline

Test Coverage

Scala unit tests (SequenceFilePhysicalReplaceSuite, SequenceFileBinaryFileFormatSuite):

  • Physical plan replacement for uncompressed SequenceFile
  • CPU fallback for compressed input, complex RDD lineage, unsupported reader types
  • Glob patterns (simple and recursive)
  • Old Hadoop API (hadoopRDD) support
  • ignoreMissingFiles behavior
  • Corrupt file handling
  • Column name / schema semantics

Python integration tests (sequencefile_test.py):

  • Basic key/value read (GPU vs CPU correctness)
  • Value-only read (protobuf use case)
  • Empty files, large batches, large records
  • Combine mode correctness with many small files
  • Binary data patterns
  • Missing/corrupt file handling
  • Config-disabled fallback path

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for reading Hadoop SequenceFiles in the RAPIDS Accelerator for Apache Spark. It registers a new file format sequencefilebinary that reads SequenceFile key/value pairs as raw BinaryType columns on the GPU.

Key Changes

  • Introduces SequenceFileBinaryFileFormat as a new DataSource that reads SequenceFiles and exposes key/value as BinaryType columns
  • Implements GPU-accelerated reading via GpuReadSequenceFileBinaryFormat and associated partition readers
  • Integrates the new format into GpuFileSourceScanExec for GPU execution path routing

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormat.scala CPU-side FileFormat implementation with row-based reader for SequenceFiles
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuReadSequenceFileBinaryFormat.scala GPU-enabled FileFormat wrapper with metadata support and multi-file reader factory
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala Core GPU partition readers with host-side buffering and device column materialization
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala Integration point registering SequenceFileBinary format in GPU scan execution
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala Test suite in tests module for wildcard discovery
sql-plugin/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala Duplicate test suite in sql-plugin module

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@Greptile full review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Dec 23, 2025

Greptile Summary

This PR introduces GPU-accelerated reading of Hadoop SequenceFiles via a physical plan replacement strategy: when Spark's plan contains a SerializeFromObjectExec over a SequenceFile-backed ExternalRDDScanExec, the plugin replaces it with GpuSequenceFileSerializeFromObjectExec, which reads files directly using a multi-threaded reader with zero-copy GPU-concat combine mode. SequenceFile decoding itself remains on the CPU (using Hadoop's SequenceFile.Reader); the gain comes from parallelised I/O and batched host→device transfer of binary payloads into LIST<UINT8> columns. Performance benchmarks show ~2× throughput improvement over the original CPU RDD scan path.

Key points from the review:

  • P1 — reflection issue in isNewApiSequenceFileRDD: Class.getField("inputFormatClass") only finds public fields. In Spark 3.x, NewHadoopRDD's inputFormatClass constructor parameter has no val modifier, making it a private JVM field. Both the primary getField call and the getFields.find(...) fallback miss private fields, so the method always returns false for real NewHadoopRDD instances. This silently causes the new-API path (sc.newAPIHadoopFile) to fall back to CPU with no warning. Using getDeclaredField + setAccessible(true) would fix this.
  • P2 — duplicate Scaladoc block: GpuSequenceFileSerializeFromObjectExec has two consecutive /** ... */ doc blocks before the class declaration; Scala/Java discards the first, so the class-level description is lost in generated documentation.
  • The GpuMultiFileReader changes (wrapping combineHMBs in try-catch to close partially-collected host buffers) are a solid resource-safety improvement.
  • Resource ownership in combine mode (markChunksTransferred flag) is carefully managed but complex — the pattern is correctly applied.
  • The GpuMultiFileReader exception handler for the combine path could theoretically attempt to close emptyBuffers that doCombineHmbs already closed internally, but safeClose is tolerant of this and it is a pre-existing design pattern in the codebase.

Confidence Score: 4/5

  • Safe to merge after resolving the getField vs getDeclaredField reflection issue, which could silently prevent GPU acceleration for the sc.newAPIHadoopFile path.
  • The PR is a well-designed, thoroughly tested feature with good documentation and broad test coverage (unit + integration + performance). One P1 issue exists: isNewApiSequenceFileRDD uses Class.getField() (public fields only), which silently returns false for NewHadoopRDD in Spark versions where inputFormatClass is a private constructor parameter — causing the new API path to never get GPU-accelerated. If the author can confirm this works on the tested Spark versions (e.g., inputFormatClass happens to be public in those compiled JARs), the score would rise to 5. The P2 duplicate Scaladoc is cosmetic and doesn't affect runtime behavior.
  • GpuSequenceFileSerializeFromObjectExecMeta.scala — specifically the isNewApiSequenceFileRDD reflection logic at line 215.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala New core reader: HostBinaryListBufferer (pinned-memory list column builder), SequenceFileChunk, SequenceFileHostBuffersWithMetaData, and MultiFileCloudSequenceFilePartitionReader with GPU-concat combine mode. Resource management is careful (closeOnExcept, chunk ownership transfer via markChunksTransferred). The split-boundary logic in doRead correctly mirrors Hadoop's SequenceFileRecordReader.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSequenceFileSerializeFromObjectExecMeta.scala Planning meta for physical plan replacement. Contains a P1 issue in isNewApiSequenceFileRDD: uses Class.getField() (public-only) to find inputFormatClass, but in Spark 3.x that is a private constructor parameter, so detection silently returns false for all NewHadoopRDD instances, preventing GPU acceleration of the new-API path. Old-API detection (HadoopRDD) correctly uses a public method and is unaffected.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuSequenceFileSerializeFromObjectExec.scala GPU exec node that replaces SerializeFromObjectExec. File listing and readerFactory are @transient lazy val evaluated on the driver. Contains a minor duplicate Scaladoc issue (two consecutive /**...*/ blocks — the first is discarded). CPU fallback via doExecute() uses the original child RDD correctly.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala Bug fix: wraps combineHMBs calls in try-catch in both getNextBatchFromSingleFileTask and getNextBatchFromCombineTask to close partially-collected host buffers on exception, preventing memory leaks during early task teardown. Also extracts collectCompletedReaderBuffersForCleanup helper for cleaner close logic.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Adds three new SequenceFile-specific config entries: SEQUENCEFILE_READER_TYPE (default MULTITHREADED), SEQUENCEFILE_MULTITHREAD_READ_MAX_NUM_FILES_PARALLEL, and SEQUENCEFILE_RDD_PHYSICAL_REPLACE_ENABLED. Correctly validates and rejects COALESCING/PERFILE modes with descriptive error messages.
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFilePhysicalReplaceSuite.scala Unit tests for physical plan replacement. Covers the GPU plan detection, ignoreMissingFiles behavior, and correct data round-trip. Tests rely on queryExecution.executedPlan plan inspection, which doesn't require a physical GPU.
tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala Comprehensive unit tests for glob patterns (simple and recursive), old Hadoop API (hadoopRDD), corrupt file handling, and column name/schema semantics. Good coverage of edge cases.
integration_tests/src/main/python/sequencefile_test.py Python integration tests covering key/value reads, value-only reads, empty files, large batches, combine mode, binary data patterns, and missing/corrupt file handling. Well-structured with GPU vs CPU correctness comparisons.

Sequence Diagram

sequenceDiagram
    participant Driver
    participant Meta as GpuSequenceFileSerializeFromObjectExecMeta
    participant Exec as GpuSequenceFileSerializeFromObjectExec
    participant Factory as GpuSequenceFileMultiFilePartitionReaderFactory
    participant Reader as MultiFileCloudSequenceFilePartitionReader
    participant Hadoop as SequenceFile.Reader (CPU)
    participant GPU as cuDF / GPU

    Driver->>Meta: tagPlanForGpu()
    Meta->>Meta: isSimpleSequenceFileRDD() via reflection
    Meta->>Meta: collectInputPaths() via NewHadoopRDD.getConf
    Meta->>Meta: hasCompressedInput() — sample up to 5 files
    Meta-->>Driver: convertToGpu() → GpuSequenceFileSerializeFromObjectExec

    Driver->>Exec: internalDoExecuteColumnar()
    Exec->>Exec: filePartitions (lazy: list + bin-pack files)
    Exec->>Factory: build readerFactory (lazy)
    Exec->>Driver: GpuDataSourceRDD[ColumnarBatch]

    Note over Reader,GPU: Per-partition execution on executors

    Reader->>Reader: getBatchRunner() → ReadBatchRunner (per file)
    Reader->>Hadoop: SequenceFile.Reader.nextRaw() loop
    Hadoop-->>Reader: key/value bytes (CPU decode)
    Reader->>Reader: HostBinaryListBufferer → SequenceFileChunk (pinned memory)
    Reader->>Reader: combineHMBs() — zero-copy chunk collection
    Reader->>GPU: buildDeviceColumnFromChunks() → cuDF LIST<UINT8>
    GPU-->>Reader: ColumnarBatch (key col + value col)
    Reader-->>Exec: ColumnarBatch
Loading

Comments Outside Diff (1)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSequenceFileSerializeFromObjectExecMeta.scala, line 214-222 (link)

    getField only finds public fields — private inputFormatClass would silently fail

    Class.getField() returns only public members (per the Java docs). In Spark 3.x, NewHadoopRDD's inputFormatClass constructor parameter does not carry a val modifier, making it a private field in the compiled bytecode. As a result, clazz.getField("inputFormatClass") always throws NoSuchFieldException, and the fallback clazz.getFields.find(...) also only searches public fields — so ifc ends up null and the method always returns false.

    When that happens, isSimpleSequenceFileRDD returns false for every NewHadoopRDD, tagPlanForGpu calls willNotWorkOnGpu("RDD lineage is not a simple SequenceFile scan"), and the new-API path (sc.newAPIHadoopFile) is silently never GPU-accelerated.

    The fix is to use getDeclaredField with setAccessible(true), which works for private fields:

Reviews (20): Last reviewed commit: "verify" | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (4)

  1. sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala, line 125-128 (link)

    logic: check for potential INT32 overflow before it happens

    the check happens after dataLocation has already grown beyond Int.MaxValue, which could cause issues during the buffer growth operations in addBytes or addValueBytes. move the overflow check earlier in those methods before updating dataLocation.

  2. sql-plugin/src/main/scala/com/nvidia/spark/rapids/sequencefile/GpuSequenceFileReaders.scala, line 73-83 (link)

    logic: potential integer overflow in row capacity calculation

    rowsAllocated * 2 can overflow when rowsAllocated is close to Int.MaxValue / 2. this causes the allocation to wrap to negative or small values.

  3. tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala, line 32-136 (link)

    style: missing test coverage for key scenarios

    add tests for:

    • compressed SequenceFiles (should throw UnsupportedOperationException)
    • multi-file reads to verify the multi-file reader path
    • large batches that exceed maxRowsPerBatch or maxBytesPerBatch
    • partition columns
    • reading only key or only value (not both)
    • empty files

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  4. tests/src/test/scala/com/nvidia/spark/rapids/SequenceFileBinaryFileFormatSuite.scala, line 102-136 (link)

    style: tests only verify the CPU reader path, not GPU

    this test uses SequenceFileBinaryFileFormat which is the CPU fallback. to test the GPU path with GpuReadSequenceFileBinaryFormat, you'd need to enable the Rapids plugin configuration and verify GPU execution.

5 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@sameerz sameerz requested a review from mythrocks December 23, 2025 21:57
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptileai full review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptileai full review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptileai full review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptileai full review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptile review

1 similar comment
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven
Copy link
Copy Markdown
Collaborator Author

@greptile review

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven self-assigned this Mar 16, 2026
@thirtiseven thirtiseven marked this pull request as ready for review March 16, 2026 07:20
thirtiseven and others added 3 commits March 23, 2026 18:03
…t overflow

- Support tuple RDD toDF("value","key") on GPU via positional read schema;
  reject plain tuple case that would silently produce wrong data
- Restore internal-name check for single-col case; use effective-name
  resolution from parent ProjectExec for key-only reads
- Fix double-close of SequenceFileChunk in combine mode via _chunksTransferred flag
- Fix Int overflow in totalRows accumulation (use Long)
- Fix ExecChecks parameter order in GpuOverrides (GPU=BINARY, CPU=all)
- Improve isNewApiSequenceFileRDD reflection: name-first with type-based fallback
- Replace findAnyFile with lazy findSampleFiles for compression detection
- Add @tailrec to collectInputPaths
- Add tests for key-only reads and swapped column names

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@nvauto
Copy link
Copy Markdown
Collaborator

nvauto commented Mar 30, 2026

NOTE: release/26.04 has been created from main. Please retarget your PR to release/26.04 if it should be included in the release.

@thirtiseven thirtiseven marked this pull request as draft April 7, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support reading Hadoop SequenceFiles on GPU

4 participants