[SPARK-57135][SQL] Support reading CSV files inside tar archives by akshatshenoi-eng · Pull Request #56193 · apache/spark

akshatshenoi-eng · 2026-05-28T21:29:56Z

What changes were proposed in this pull request?

Adds support for reading CSV files packaged in tar archives (.tar, .tar.gz, .tgz) directly through the CSV data source, by streaming each archive entry through the CSV parser without unpacking it to disk. Gated behind a new config spark.sql.files.archive.enabled (default false).

ArchiveReader (new): a small streaming core. readEntries(path, conf)(parseEntry) opens the tar once, hands each non-skipped entry to parseEntry as a bounded, non-closing InputStream, and concatenates the per-entry results into a single iterator. It advances to the next entry only after the current one is fully consumed, so at most one entry is in flight and memory stays bounded regardless of archive size. Directories and dot-prefixed entries (macOS ._*, .DS_Store, …) are skipped; the stream is closed on exhaustion, on close(), and on task completion. .tar.gz is auto-decompressed by Hadoop's codec factory; .tgz (not a registered codec extension) is unwrapped with GZIPInputStream.
CSVFileFormat: archives are non-splittable (isSplitable returns false), so each archive is read as a single split; buildReader streams every entry through UnivocityParser (parseStream for multiLine, otherwise parseIterator over a LineReader-backed line iterator). Each entry is treated as the start of its own file, so headers are handled exactly as for standalone CSV files.
CSVDataSource: schema inference streams archive entries through the same CSVInferSchema path used for a multi-file CSV read (first entry's header, per-entry header drop), so inferring from an archive matches inferring from the same entries as separate files.

This supersedes the earlier revision of this PR, which used a format-agnostic layer that materialized each entry to a local temp file. The streaming approach avoids local disk entirely; the trade-off is that it only supports formats parseable from a sequential stream, so this PR scopes the feature to CSV. Formats needing random access within a file (Parquet/ORC footers) cannot stream from a tar and are out of scope.

Why are the changes needed?

A common ingestion pattern packs many small CSV files into tar archives to reduce file/namespace pressure on object stores and HDFS. Today these can't be read without unpacking them externally first. This lets users point the CSV reader directly at a tar archive. Streaming (vs. materializing entries to disk) keeps the read bounded in memory and adds no local-disk requirement.

Does this PR introduce any user-facing change?

Yes. A new config spark.sql.files.archive.enabled (default false) is added. When enabled, the CSV data source reads .tar/.tar.gz/.tgz paths by streaming their entries during both schema inference and scan. With the default false, behavior is unchanged.

How was this patch tested?

New tests:

ArchiveReaderSuite (unit): isArchivePath dispatch and readEntries — entry ordering, gzip handling (.tar.gz and .tgz), directory/dotfile skipping, lazy one-entry-at-a-time advance, the non-closing entry stream, idempotent close(), and TaskContext cleanup.
ArchiveReadSuite (end-to-end): reading .tar/.tar.gz/.tgz of CSV through the data source; parity with directory reads (headers, headerless, custom delimiter, multiline quoted fields, column pruning, mixed archive/non-archive partitioned layout); single-partition splittability; and schema-inference parity with a directory of the same files.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8)

HyukjinKwon · 2026-05-28T22:00:18Z

Con't we already support compression codec in CSV, JSON and text? I think we should rather add an option there instead of introducing a new datasource

pan3793 · 2026-05-29T04:58:06Z

in addition to gzip tarball, can it be extended to support other codec? at least I think zstd should be supported, a similar request was raised in the Hadoop dev list recently

https://lists.apache.org/thread/ntlx40h3vn6k7q3y5qf22vm815nw8lkz

@return

Address review feedback: move the per-entry tar-archive streaming/parsing from CSVFileFormat.buildReader into the CSVDataSource.readFile overrides via a shared readArchive helper (archiveLines moves to TextInputCSVDataSource with brief @param/@return docs), and update CSVPartitionReaderFactory to the new readFile signature (archiveReadEnabled = false; the V2 reader does not read archives).

akshatshenoi-eng force-pushed the archive-format branch from 3f8d192 to e31d86a Compare May 29, 2026 19:00

akshatshenoi-eng changed the title ~~[SPARK-57135][SQL] Add ArchiveFormat for reading .tar/.tar.gz/.tgz archives as files~~ [SPARK-57135][SQL] Support reading CSV files inside tar archives May 29, 2026

akshatshenoi-eng force-pushed the archive-format branch from e31d86a to 99b7166 Compare May 29, 2026 19:07

[SPARK-57135][SQL] Support reading CSV files inside tar archives

670e233

akshatshenoi-eng force-pushed the archive-format branch from 99b7166 to 670e233 Compare May 29, 2026 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57135][SQL] Support reading CSV files inside tar archives#56193

[SPARK-57135][SQL] Support reading CSV files inside tar archives#56193
akshatshenoi-eng wants to merge 2 commits into
apache:masterfrom
akshatshenoi-eng:archive-format

akshatshenoi-eng commented May 28, 2026 •

edited

Loading

Uh oh!

HyukjinKwon commented May 28, 2026

Uh oh!

pan3793 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

akshatshenoi-eng commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented May 28, 2026

Uh oh!

pan3793 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akshatshenoi-eng commented May 28, 2026 •

edited

Loading