Skip to content

[SPARK-56221][SQL][PYTHON] Feature parity between spark.catalog.* vs DDL commands#55025

Closed
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-56221
Closed

[SPARK-56221][SQL][PYTHON] Feature parity between spark.catalog.* vs DDL commands#55025
HyukjinKwon wants to merge 1 commit intoapache:masterfrom
HyukjinKwon:SPARK-56221

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Mar 26, 2026

What changes were proposed in this pull request?

SQL

  • SHOW CACHED TABLES: lists relations cached with an explicit name (CACHE TABLE, catalog.cacheTable, etc.); unnamed Dataset.cache() entries are not listed.

Catalog API (Scala / Java / PySpark)

  • listCachedTables(): same information as SHOW CACHED TABLES (CachedTable: name, storage level).
  • dropTable / dropView: drop persistent table or view (with ifExists, purge where applicable).
  • createDatabase / dropDatabase: create or drop a namespace (with options: ifNotExists / ifExists, cascade, properties map).
  • listPartitions: partition strings for a table (aligned with SHOW PARTITIONS).
  • listViews: list views in the current or given namespace; optional name pattern.
  • getTableProperties: all table properties (aligned with SHOW TBLPROPERTIES).
  • getCreateTableString: DDL from SHOW CREATE TABLE (optional asSerde).
  • truncateTable: remove all table data (not for views).
  • analyzeTable: ANALYZE TABLE ... COMPUTE STATISTICS (optional noScan).

Why are the changes needed?

Gives stable programmatic ways to do what users already do with SQL (SHOW CACHED TABLES, SHOW PARTITIONS, etc.), without routing everything through raw SQL.

Does this PR introduce any user-facing change?

Yes. New SQL command, new Catalog API API.

How was this patch tested?

Unittests were added.

Was this patch authored or co-authored using generative AI tooling?

No.

// [SQL] SafeJsonSerializer.safeMapToJValue: second parameter widened from Function1 to
// Function2 so the key is passed to the value serializer (progress.scala). Binary-incompatible
// vs spark-sql-api 4.0.0; not part of the public supported API (private[streaming] package).
ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.sql.streaming.SafeJsonSerializer.safeMapToJValue"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from 72fc87b

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-56221 branch 3 times, most recently from 6e0d342 to 45ceac2 Compare March 26, 2026 06:00
@HyukjinKwon HyukjinKwon force-pushed the SPARK-56221 branch 4 times, most recently from f8e7eec to d04360a Compare March 26, 2026 11:38
Comment on lines +41 to +44
| tableName| storageLevel|
+----------+------------------+
| my_table| MEMORY_AND_DISK|
+----------+------------------+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's expose the fully qualified name here.
Ideally as three columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually.. How do you refer to this table? How woudl it qualify? Is it liek a temp table? What's the namespace...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. SHOW CACHED TABLES shows the string stored when the cache was registered, not a normalized three part name. CACHE TABLE my_table AS SELECT … creates a session local temp view named my_table and caches under that name.

So you refer to it like any other temp view in the session (my_table). There is no separate metastore catalog.database.table for that entry - only that registration string appears in SHOW CACHED TABLES (e.g. my_table).

CACHE TABLE spark_catalog.default.t (or CACHE TABLE on an existing table) uses the resolved multipart identifier, serialized as a single string (multipartIdentifier.quoted in the executor), so you typically see something like spark_catalog.default.t.

CREATE OR REPLACE TEMPORARY VIEW src AS SELECT 1 AS id;
CACHE TABLE my_table AS SELECT * FROM src;
CREATE TABLE default.demo_cached (id INT) USING parquet;
INSERT INTO demo_cached VALUES (1), (2);
CACHE TABLE spark_catalog.default.demo_cached;
SHOW CACHED TABLES;
+---------------------------------+--------------------------------------+
|tableName                        |storageLevel                          |
+---------------------------------+--------------------------------------+
|spark_catalog.default.demo_cached|Disk Memory Deserialized 1x Replicated|
|my_table                         |Disk Memory Deserialized 1x Replicated|
+---------------------------------+--------------------------------------+

Comment on lines +41 to +44
| tableName| storageLevel|
+----------+------------------+
| my_table| MEMORY_AND_DISK|
+----------+------------------+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually.. How do you refer to this table? How woudl it qualify? Is it liek a temp table? What's the namespace...?

@HyukjinKwon
Copy link
Member Author

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants