[SPARK-56171][SQL] Enable V2 file write path for non-partitioned DataFrame API writes and delete `FallBackFileSourceV2` by LuciferYang · Pull Request #54998 · apache/spark

LuciferYang · 2026-03-25T04:05:11Z

What changes were proposed in this pull request?

This PR enables the V2 file write path for non-partitioned df.write.mode("append"/"overwrite").save(path) operations across all built-in file formats (Parquet, ORC, JSON, CSV, Text, Avro), and deletes the FallBackFileSourceV2 analysis rule which is now redundant.

Key changes

V2 write foundation (FileTable, FileWrite, *Write, *Table)

FileTable.createFileWriteBuilder: new infrastructure for creating WriteBuilder with SupportsTruncate and SupportsDynamicOverwrite capabilities
FileWrite: partition schema support, truncation (overwrite), dynamic partition overwrite, schema validation (nested column name duplication, data type, collation in map keys)
All 6 format-specific Write case classes (ParquetWrite, OrcWrite, JsonWrite, CSVWrite, TextWrite, AvroWrite) accept new parameters from createFileWriteBuilder
All 6 format-specific Table classes implement newWriteBuilder via createFileWriteBuilder

Delete FallBackFileSourceV2

Remove the analysis rule and its registrations in BaseSessionStateBuilder and HiveSessionStateBuilder
This rule was redundant: USE_V1_SOURCE_LIST (defaulting to all built-in formats) already prevents V2 file tables from being created, and DataSourceV2Utils.getTableProvider gates catalog table loading

Cache invalidation (DataSourceV2Strategy)

refreshCache: use recacheByPath (instead of recacheByPlan) for FileTable writes, with fileIndex.refresh() to update the cached file listing

Error handling (FileFormatDataWriter)

Override writeAll to wrap errors with TASK_WRITE_FAILED, consistent with the per-row write() method

V2 write gating (DataFrameWriter, DataSourceV2Utils)

DataFrameWriter.lookupV2Provider: allow V2 for FileDataSourceV2 only when mode is Append or Overwrite AND no partitionBy is specified; fall back to V1 for ErrorIfExists/Ignore (TODO: SPARK-56174) and partitioned writes (TODO: SPARK-56185)
DataFrameWriter.saveAsTable / insertInto: always fall back to V1 for FileDataSourceV2 (TODO: SPARK-56185)
DataSourceV2Utils.getTableProvider: return None for FileDataSourceV2 to prevent V2 catalog table loading until stats, partition management, and data type validation gaps are addressed (TODO: SPARK-56185)

Data type validation (V2SessionCatalog)

Add V1 FileFormat.supportDataType validation in the createTable fallback branch, ensuring CREATE TABLE with unsupported types (e.g., Variant in CSV) is rejected consistently

Avro Table

Fix formatName from "AVRO" to "Avro" to match V1's AvroFileFormat.toString

Why are the changes needed?

The V2 Data Source API provides a cleaner, more extensible write path than V1's InsertIntoHadoopFsRelationCommand. Enabling V2 writes for built-in file formats is a step toward fully migrating file sources to V2, which will simplify the codebase and enable future optimizations.

FallBackFileSourceV2 was an analysis rule that converted V2 file InsertIntoStatement back to V1. It is no longer needed because:

USE_V1_SOURCE_LIST (default: all built-in formats) prevents V2 file tables from being created for reads or writes
DataSourceV2Utils.getTableProvider gates V2 catalog table loading
The DataFrame API path uses AppendData/OverwriteByExpression, not InsertIntoStatement

Does this PR introduce any user-facing change?

No. With default configuration (spark.sql.sources.useV1SourceList = "avro,csv,json,kafka,orc,parquet,text"), all file writes continue to use the V1 path. The V2 write path is only activated when a user explicitly clears USE_V1_SOURCE_LIST and uses df.write.mode("append"/"overwrite").save(path) without partitionBy.

How was this patch tested?

Pass Github Actions
New test suite FileDataSourceV2WriteSuite (23 tests) covering:
- V1 fallback behavior for all save modes
- V2 non-partitioned append and overwrite across parquet, orc, json, csv
- V2 cache invalidation on append and overwrite
- Partitioned writes, dynamic partition overwrite, multi-level partitioning (via V1)
- Catalog table INSERT INTO and CTAS (via V1)

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code 4.6

LuciferYang · 2026-03-25T04:05:30Z

Follow-up tickets

Ticket	Description
SPARK-56174	V2 file write ErrorIfExists/Ignore modes
SPARK-56175	FileTable implements SupportsPartitionManagement and V2 catalog table loading
SPARK-56176	V2-native ANALYZE TABLE and ANALYZE COLUMN for file tables

The currently recorded plan: https://issues.apache.org/jira/browse/SPARK-56170

dongjoon-hyun · 2026-03-25T04:45:37Z

Wow, this and many TODO IDs. Thank you for working on this area, @LuciferYang .

…Frame API writes and delete FallBackFileSourceV2 Key changes: - FileWrite: added partitionSchema, customPartitionLocations, dynamicPartitionOverwrite, isTruncate; path creation and truncate logic; dynamic partition overwrite via FileCommitProtocol - FileTable: createFileWriteBuilder with SupportsDynamicOverwrite and SupportsTruncate; capabilities now include TRUNCATE and OVERWRITE_DYNAMIC; fileIndex skips file existence checks when userSpecifiedSchema is provided (write path) - All file format writes (Parquet, ORC, CSV, JSON, Text, Avro) use createFileWriteBuilder with partition/truncate/overwrite support - DataFrameWriter.lookupV2Provider: enabled FileDataSourceV2 for non-partitioned Append and Overwrite via df.write.save(path) - DataFrameWriter.insertInto: V1 fallback for file sources (TODO: SPARK-56175) - DataFrameWriter.saveAsTable: V1 fallback for file sources (TODO: SPARK-56230, needs StagingTableCatalog) - DataSourceV2Utils.getTableProvider: V1 fallback for file sources (TODO: SPARK-56175) - Removed FallBackFileSourceV2 rule - V2SessionCatalog.createTable: V1 FileFormat data type validation

LuciferYang force-pushed the SPARK-56171-combined branch from 30a677f to 677a482 Compare March 26, 2026 07:38

LuciferYang mentioned this pull request Mar 26, 2026

[SPARK-56175][SQL] FileTable implements SupportsPartitionManagement and V2 catalog table loading #55034

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56171][SQL] Enable V2 file write path for non-partitioned DataFrame API writes and delete `FallBackFileSourceV2`#54998

[SPARK-56171][SQL] Enable V2 file write path for non-partitioned DataFrame API writes and delete `FallBackFileSourceV2`#54998
LuciferYang wants to merge 1 commit intoapache:masterfrom
LuciferYang:SPARK-56171-combined

LuciferYang commented Mar 25, 2026

Uh oh!

LuciferYang commented Mar 25, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LuciferYang commented Mar 25, 2026

What changes were proposed in this pull request?

Key changes

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LuciferYang commented Mar 25, 2026 •

edited

Loading