[SPARK-57135][SQL] Support reading CSV files inside tar archives#56193
Open
akshatshenoi-eng wants to merge 2 commits into
Open
[SPARK-57135][SQL] Support reading CSV files inside tar archives#56193akshatshenoi-eng wants to merge 2 commits into
akshatshenoi-eng wants to merge 2 commits into
Conversation
Member
|
Con't we already support compression codec in CSV, JSON and text? I think we should rather add an option there instead of introducing a new datasource |
Member
|
in addition to gzip tarball, can it be extended to support other codec? at least I think zstd should be supported, a similar request was raised in the Hadoop dev list recently https://lists.apache.org/thread/ntlx40h3vn6k7q3y5qf22vm815nw8lkz |
3f8d192 to
e31d86a
Compare
e31d86a to
99b7166
Compare
99b7166 to
670e233
Compare
Address review feedback: move the per-entry tar-archive streaming/parsing from CSVFileFormat.buildReader into the CSVDataSource.readFile overrides via a shared readArchive helper (archiveLines moves to TextInputCSVDataSource with brief @param/@return docs), and update CSVPartitionReaderFactory to the new readFile signature (archiveReadEnabled = false; the V2 reader does not read archives).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Adds support for reading CSV files packaged in tar archives (
.tar,.tar.gz,.tgz) directly through the CSV data source, by streaming each archive entry through the CSV parser without unpacking it to disk. Gated behind a new configspark.sql.files.archive.enabled(defaultfalse).ArchiveReader(new): a small streaming core.readEntries(path, conf)(parseEntry)opens the tar once, hands each non-skipped entry toparseEntryas a bounded, non-closingInputStream, and concatenates the per-entry results into a single iterator. It advances to the next entry only after the current one is fully consumed, so at most one entry is in flight and memory stays bounded regardless of archive size. Directories and dot-prefixed entries (macOS._*,.DS_Store, …) are skipped; the stream is closed on exhaustion, onclose(), and on task completion..tar.gzis auto-decompressed by Hadoop's codec factory;.tgz(not a registered codec extension) is unwrapped withGZIPInputStream.CSVFileFormat: archives are non-splittable (isSplitablereturnsfalse), so each archive is read as a single split;buildReaderstreams every entry throughUnivocityParser(parseStreamformultiLine, otherwiseparseIteratorover aLineReader-backed line iterator). Each entry is treated as the start of its own file, so headers are handled exactly as for standalone CSV files.CSVDataSource: schema inference streams archive entries through the sameCSVInferSchemapath used for a multi-file CSV read (first entry's header, per-entry header drop), so inferring from an archive matches inferring from the same entries as separate files.This supersedes the earlier revision of this PR, which used a format-agnostic layer that materialized each entry to a local temp file. The streaming approach avoids local disk entirely; the trade-off is that it only supports formats parseable from a sequential stream, so this PR scopes the feature to CSV. Formats needing random access within a file (Parquet/ORC footers) cannot stream from a tar and are out of scope.
Why are the changes needed?
A common ingestion pattern packs many small CSV files into tar archives to reduce file/namespace pressure on object stores and HDFS. Today these can't be read without unpacking them externally first. This lets users point the CSV reader directly at a tar archive. Streaming (vs. materializing entries to disk) keeps the read bounded in memory and adds no local-disk requirement.
Does this PR introduce any user-facing change?
Yes. A new config
spark.sql.files.archive.enabled(defaultfalse) is added. When enabled, the CSV data source reads.tar/.tar.gz/.tgzpaths by streaming their entries during both schema inference and scan. With the defaultfalse, behavior is unchanged.How was this patch tested?
New tests:
ArchiveReaderSuite(unit):isArchivePathdispatch andreadEntries— entry ordering, gzip handling (.tar.gzand.tgz), directory/dotfile skipping, lazy one-entry-at-a-time advance, the non-closing entry stream, idempotentclose(), andTaskContextcleanup.ArchiveReadSuite(end-to-end): reading.tar/.tar.gz/.tgzof CSV through the data source; parity with directory reads (headers, headerless, custom delimiter, multiline quoted fields, column pruning, mixed archive/non-archive partitioned layout); single-partition splittability; and schema-inference parity with a directory of the same files.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8)