Skip to content

[SPARK-52812][CONNECT] Preserve spark.sql.sources.default for eager createTable(tableName, path)#56211

Open
haoyangeng-db wants to merge 2 commits into
apache:masterfrom
haoyangeng-db:spark-52812-followup-createtable-default-source
Open

[SPARK-52812][CONNECT] Preserve spark.sql.sources.default for eager createTable(tableName, path)#56211
haoyangeng-db wants to merge 2 commits into
apache:masterfrom
haoyangeng-db:spark-52812-followup-createtable-default-source

Conversation

@haoyangeng-db
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

SPARK-52812 (#56064) made Spark Connect Catalog.createTable eager by re-routing the two-argument createTable(tableName, path) overload through createTable(tableName, path, "parquet"). That hardcodes the parquet provider and drops the spark.sql.sources.default fallback that the overload previously relied on.

This PR restores the original behavior: the two-argument overload again leaves the source unset so the server resolves spark.sql.sources.default, while keeping the eager execution introduced by SPARK-52812. A regression test is added to CatalogSuite.

Why are the changes needed?

The two-argument createTable(tableName, path) overload is documented as "It will use the default data source configured by spark.sql.sources.default." After SPARK-52812 it always used parquet regardless of that configuration, contradicting its own contract and the classic Catalog behavior.

Does this PR introduce any user-facing change?

Yes, within the unreleased master branch. spark.catalog.createTable(tableName, path) on Spark Connect once again honors spark.sql.sources.default instead of always creating a parquet table. The eager-execution behavior from SPARK-52812 is preserved.

How was this patch tested?

Added a regression test in CatalogSuite that sets spark.sql.sources.default to json, writes JSON data, creates the table via the two-argument overload, and asserts the resulting table uses the json provider and is readable. The test fails on the previous hardcoded-parquet behavior.

Was this patch authored or co-authored using generative AI tooling?

Co-authored with Claude Code.

…reateTable(tableName, path)

### What changes were proposed in this pull request?
SPARK-52812 (apache#56064) made Spark Connect `Catalog.createTable` eager by re-routing the two-argument `createTable(tableName, path)` overload through `createTable(tableName, path, "parquet")`. That hardcodes the parquet provider and drops the `spark.sql.sources.default` fallback that the overload previously relied on.

This PR restores the original behavior: the two-argument overload again leaves the source unset so the server resolves `spark.sql.sources.default`, while keeping the eager execution introduced by SPARK-52812. A regression test is added to `CatalogSuite`.

### Why are the changes needed?
The two-argument `createTable(tableName, path)` overload is documented as "It will use the default data source configured by spark.sql.sources.default." After SPARK-52812 it always used parquet regardless of that configuration, contradicting its own contract and the classic Catalog behavior.

### Does this PR introduce _any_ user-facing change?
Yes, within the unreleased master branch. `spark.catalog.createTable(tableName, path)` on Spark Connect once again honors `spark.sql.sources.default` instead of always creating a parquet table. The eager-execution behavior from SPARK-52812 is preserved.

### How was this patch tested?
Added a regression test in `CatalogSuite` that sets `spark.sql.sources.default` to `json`, writes JSON data, creates the table via the two-argument overload, and asserts the resulting table uses the json provider and is readable. The test fails on the previous hardcoded-parquet behavior.

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8)
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 1 non-blocking, 0 nits.
Clean, correct, well-tested follow-up restoring the spark.sql.sources.default fallback while keeping eager execution.

Design / architecture (1)

  • Catalog.scala:398: 2-arg overload inlines the proto build, bypassing the delegation chain all other overloads use and duplicating the 5-arg execute{}+table() block. Correct and the setSource omission is necessary, but a private source: Option[String] impl shared by both would remove the duplication. Non-blocking. — see inline

// Leave the source unset so the server resolves spark.sql.sources.default, as documented
// above. Routing through createTable(tableName, path, "parquet") would hardcode the provider
// and ignore that configuration.
sparkSession.execute { builder =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two-argument overload inlines the CreateTable proto build here, bypassing the delegation chain every other createTable overload uses (each delegates down to the single 5-arg general overload) and duplicating that method's execute{...} + table() block, minus setSource.

The behavior is correct and omitting setSource is necessary — source is optional in the proto, so the server only resolves spark.sql.sources.default when hasSource() is false, and delegating to the 5-arg overload would call setSource("") and pin the source to a literal empty string. So this is non-blocking.

That said, the duplication can be removed cleanly by extracting a private impl taking source: Option[String] that both the 2-arg (None) and 5-arg (Some(source)) overloads route through, calling setSource only when defined. That keeps a single place building the proto while preserving the unset-source semantics. Your call.

… Option[String]

Address review feedback: route the two-argument createTable(tableName, path)
overload (source unset) and the five-argument overload (source set) through a
single private impl that calls setSource only when the source is defined,
removing the duplicated proto-build block while preserving the unset-source
semantics that lets the server resolve spark.sql.sources.default.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants