Skip to content

[SPARK-56171][SQL] Enable V2 file write path for non-partitioned DataFrame API writes and delete FallBackFileSourceV2#54998

Open
LuciferYang wants to merge 1 commit intoapache:masterfrom
LuciferYang:SPARK-56171-combined
Open

[SPARK-56171][SQL] Enable V2 file write path for non-partitioned DataFrame API writes and delete FallBackFileSourceV2#54998
LuciferYang wants to merge 1 commit intoapache:masterfrom
LuciferYang:SPARK-56171-combined

Conversation

@LuciferYang
Copy link
Contributor

What changes were proposed in this pull request?

This PR enables the V2 file write path for non-partitioned df.write.mode("append"/"overwrite").save(path) operations across all built-in file formats (Parquet, ORC, JSON, CSV, Text, Avro), and deletes the FallBackFileSourceV2 analysis rule which is now redundant.

Key changes

V2 write foundation (FileTable, FileWrite, *Write, *Table)

  • FileTable.createFileWriteBuilder: new infrastructure for creating WriteBuilder with SupportsTruncate and SupportsDynamicOverwrite capabilities
  • FileWrite: partition schema support, truncation (overwrite), dynamic partition overwrite, schema validation (nested column name duplication, data type, collation in map keys)
  • All 6 format-specific Write case classes (ParquetWrite, OrcWrite, JsonWrite, CSVWrite, TextWrite, AvroWrite) accept new parameters from createFileWriteBuilder
  • All 6 format-specific Table classes implement newWriteBuilder via createFileWriteBuilder

Delete FallBackFileSourceV2

  • Remove the analysis rule and its registrations in BaseSessionStateBuilder and HiveSessionStateBuilder
  • This rule was redundant: USE_V1_SOURCE_LIST (defaulting to all built-in formats) already prevents V2 file tables from being created, and DataSourceV2Utils.getTableProvider gates catalog table loading

Cache invalidation (DataSourceV2Strategy)

  • refreshCache: use recacheByPath (instead of recacheByPlan) for FileTable writes, with fileIndex.refresh() to update the cached file listing

Error handling (FileFormatDataWriter)

  • Override writeAll to wrap errors with TASK_WRITE_FAILED, consistent with the per-row write() method

V2 write gating (DataFrameWriter, DataSourceV2Utils)

  • DataFrameWriter.lookupV2Provider: allow V2 for FileDataSourceV2 only when mode is Append or Overwrite AND no partitionBy is specified; fall back to V1 for ErrorIfExists/Ignore (TODO: SPARK-56174) and partitioned writes (TODO: SPARK-56185)
  • DataFrameWriter.saveAsTable / insertInto: always fall back to V1 for FileDataSourceV2 (TODO: SPARK-56185)
  • DataSourceV2Utils.getTableProvider: return None for FileDataSourceV2 to prevent V2 catalog table loading until stats, partition management, and data type validation gaps are addressed (TODO: SPARK-56185)

Data type validation (V2SessionCatalog)

  • Add V1 FileFormat.supportDataType validation in the createTable fallback branch, ensuring CREATE TABLE with unsupported types (e.g., Variant in CSV) is rejected consistently

Avro Table

  • Fix formatName from "AVRO" to "Avro" to match V1's AvroFileFormat.toString

Why are the changes needed?

The V2 Data Source API provides a cleaner, more extensible write path than V1's InsertIntoHadoopFsRelationCommand. Enabling V2 writes for built-in file formats is a step toward fully migrating file sources to V2, which will simplify the codebase and enable future optimizations.

FallBackFileSourceV2 was an analysis rule that converted V2 file InsertIntoStatement back to V1. It is no longer needed because:

  1. USE_V1_SOURCE_LIST (default: all built-in formats) prevents V2 file tables from being created for reads or writes
  2. DataSourceV2Utils.getTableProvider gates V2 catalog table loading
  3. The DataFrame API path uses AppendData/OverwriteByExpression, not InsertIntoStatement

Does this PR introduce any user-facing change?

No. With default configuration (spark.sql.sources.useV1SourceList = "avro,csv,json,kafka,orc,parquet,text"), all file writes continue to use the V1 path. The V2 write path is only activated when a user explicitly clears USE_V1_SOURCE_LIST and uses df.write.mode("append"/"overwrite").save(path) without partitionBy.

How was this patch tested?

  • Pass Github Actions
  • New test suite FileDataSourceV2WriteSuite (23 tests) covering:
    • V1 fallback behavior for all save modes
    • V2 non-partitioned append and overwrite across parquet, orc, json, csv
    • V2 cache invalidation on append and overwrite
    • Partitioned writes, dynamic partition overwrite, multi-level partitioning (via V1)
    • Catalog table INSERT INTO and CTAS (via V1)

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code 4.6

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Mar 25, 2026

Follow-up tickets

Ticket Description
SPARK-56174 V2 file write ErrorIfExists/Ignore modes
SPARK-56175 FileTable implements SupportsPartitionManagement and V2 catalog table loading
SPARK-56176 V2-native ANALYZE TABLE and ANALYZE COLUMN for file tables

The currently recorded plan: https://issues.apache.org/jira/browse/SPARK-56170

@dongjoon-hyun
Copy link
Member

Wow, this and many TODO IDs. Thank you for working on this area, @LuciferYang .

…Frame API writes and delete FallBackFileSourceV2

Key changes:
- FileWrite: added partitionSchema, customPartitionLocations,
  dynamicPartitionOverwrite, isTruncate; path creation and truncate
  logic; dynamic partition overwrite via FileCommitProtocol
- FileTable: createFileWriteBuilder with SupportsDynamicOverwrite
  and SupportsTruncate; capabilities now include TRUNCATE and
  OVERWRITE_DYNAMIC; fileIndex skips file existence checks when
  userSpecifiedSchema is provided (write path)
- All file format writes (Parquet, ORC, CSV, JSON, Text, Avro) use
  createFileWriteBuilder with partition/truncate/overwrite support
- DataFrameWriter.lookupV2Provider: enabled FileDataSourceV2 for
  non-partitioned Append and Overwrite via df.write.save(path)
- DataFrameWriter.insertInto: V1 fallback for file sources
  (TODO: SPARK-56175)
- DataFrameWriter.saveAsTable: V1 fallback for file sources
  (TODO: SPARK-56230, needs StagingTableCatalog)
- DataSourceV2Utils.getTableProvider: V1 fallback for file sources
  (TODO: SPARK-56175)
- Removed FallBackFileSourceV2 rule
- V2SessionCatalog.createTable: V1 FileFormat data type validation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants