Skip to content

feat: support size() for MapType input#4580

Open
marvelshan wants to merge 14 commits into
apache:mainfrom
marvelshan:feat/size-map-support
Open

feat: support size() for MapType input#4580
marvelshan wants to merge 14 commits into
apache:mainfrom
marvelshan:feat/size-map-support

Conversation

@marvelshan

@marvelshan marvelshan commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #4472

Rationale for this change

Spark's Size expression accepts both ArrayType and MapType inputs, but Comet previously marked MapType as Unsupported, causing size() on maps to always fall back to Spark. The native Rust implementation already had DataType::Map handling in spark_size_array; the only missing pieces were the ScalarValue::Map arm in spark_size_scalar and the serde gate that blocked MapType from reaching native execution.

What changes are included in this PR?

  • Added ScalarValue::Map arm to spark_size_scalar in native/spark-expr/src/array_funcs/size.rs so the scalar code path also handles map inputs
  • Changed CometSize.getSupportLevel from Unsupported to Compatible() for MapType in spark/src/main/scala/org/apache/comet/serde/arrays.scala
  • Removed getUnsupportedReasons override (MapType is no longer unsupported) and the outdated reason string
  • Enabled test_spark_size_map_array Rust test (removed #[ignore], fixed MapArray::try_new API usage)
  • Changed size.sql SQL test from query spark_answer_only to query so size(m) runs through Comet; also added ConfigMatrix: spark.sql.legacy.sizeOfNull=true,false per review feedback (consolidating the separate size_legacy_off.sql into size.sql)
  • Added literal map, cardinality alias, and cast(NULL as map) queries to size.sql
  • Updated posexplode.sql fallback reason to match the actual serde reason from Explode/GenerateExec
  • Added test("fallback for size with map input") Scala test using checkSparkAnswerAndFallbackReason with reason "map is not supported", since map(_8, _9) produces a CreateMap expression that has no Comet serde yet
  • Added a separate test("size with map input - v2 reader") for v2 reader coverage with checkSparkAnswerAndOperator, where map data comes from Parquet columns (no CreateMap involved)
  • Cleared the "MapType input falls back" notes for both size and cardinality in docs/source/user-guide/latest/expressions.md

How are these changes tested?

  • Rust unit test: test_spark_size_map_array verifies map size calculation (2 entries, 1 entry, empty map, null map returning -1)
cd native && cargo test --package datafusion-comet-spark-expr test_spark_size_map_array
  • SQL file test: size.sql now exercises size(m) through Comet native path with both sizeOfNull settings
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite size" -Dtest=none
  • ScalaTest: CometMapExpressionSuite "fallback for size with map input" uses checkSparkAnswerAndFallbackReason to verify the fallback reason for CreateMap; "size with map input - v2 reader" uses checkSparkAnswerAndOperator to verify native execution on map columns from Parquet
./mvnw test -Dtest=none -Dsuites="org.apache.comet.CometMapExpressionSuite"

@marvelshan

Copy link
Copy Markdown
Contributor Author

@andygrove

@marvelshan marvelshan force-pushed the feat/size-map-support branch from afd4792 to 58d254f Compare June 3, 2026 09:56

@comphead comphead left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marvelshan
Before running the CI, please include .sql tests to for the maptype size

Comment thread native/spark-expr/src/array_funcs/size.rs
Comment thread native/spark-expr/src/array_funcs/size.rs Outdated

query spark_answer_only
query
SELECT size(arr), size(m) FROM test_size

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add literal map cases below the existing literal-args query? Something like:

query
SELECT size(map('a', 1, 'b', 2)), size(map()), size(cast(NULL as map<string,int>))

That way the literal path is covered for both shapes. While you're here, a cardinality(m) query (Spark registers it as an alias for size) and a query with spark.sql.legacy.sizeOfNull=false would lock down the alias and the non-legacy null branch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think literal map is not supported yet

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — all three items added:

  1. Literal map cases: size(map(...)) queries added (0b3c451). Those going through CreateMap use spark_answer_only since Comet has no CreateMap serde yet; cast(NULL as map) uses default mode as CometLiteral supports null MapType.
  2. Cardinality alias: SELECT cardinality(arr), cardinality(m) FROM test_size added (0b3c451) — Spark registers it as Size in FunctionRegistry, so it goes through CometSize automatically.
  3. sizeOfNull=false: New size_legacy_off.sql with -- Config: spark.sql.legacy.sizeOfNull=false covering column-based, literal null, and cardinality queries under non-legacy semantics (0b3c451).

ScalarValue::Map unit tests also added per review (ff01bbc), and stale docs notes cleared (b1d177a).

Should I file a follow-up issue for CreateMap/MapType literal support so the literal map queries can be promoted to native execution later?

@andygrove

Copy link
Copy Markdown
Member

Thanks for tackling this. The fix is small and focused, and the cleanup on posexplode.sql (catching that the old fallback reason was unrelated to posexplode) was a nice find.

One docs gap to flag, since the file isn't in the diff: docs/source/user-guide/latest/expressions.md still says "MapType input falls back" for both size (line 189) and cardinality (line 186). After this PR that caveat is no longer accurate. cardinality is registered as an alias for Size in Spark (FunctionRegistry.scala), so this PR transparently enables it too. Could you clear those notes in the same PR?

Inline comments cover smaller test-coverage suggestions.

@marvelshan marvelshan force-pushed the feat/size-map-support branch 2 times, most recently from d36788f to d4d53d8 Compare June 4, 2026 03:04
@marvelshan marvelshan force-pushed the feat/size-map-support branch 2 times, most recently from ecb7b5e to 8591ee8 Compare June 4, 2026 03:55
@marvelshan marvelshan force-pushed the feat/size-map-support branch from 8591ee8 to b1d177a Compare June 4, 2026 06:31
@marvelshan

marvelshan commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Done, cleared the "MapType input falls back" notes for both size and cardinality in expressions.md

@marvelshan marvelshan changed the title feat: support size() for MapType input (#4472) feat: support size() for MapType input Jun 4, 2026
@marvelshan marvelshan force-pushed the feat/size-map-support branch from b1d177a to 92789ac Compare June 4, 2026 13:23
Comment thread docs/source/user-guide/latest/expressions.md Outdated
@andygrove

Copy link
Copy Markdown
Member

Thanks @marvelshan. This LGTM. Can you fix the expressions.md formatting, then I will approve the CI workflows.

@marvelshan

Copy link
Copy Markdown
Contributor Author

Done

@marvelshan marvelshan requested review from andygrove and comphead June 5, 2026 01:21
@marvelshan

Copy link
Copy Markdown
Contributor Author

Hi @andygrove @comphead,

I have addressed all the review comments and fixed the formatting in expressions.md.
The PR has been ready for a few days. Could you please approve the CI workflows so the tests can run?

Thanks!

Comment thread spark/src/test/resources/sql-tests/expressions/array/size_legacy_off.sql Outdated
}

// fails with "map is not supported"
ignore("size with map input") {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test still available?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it has been promoted from ignore to test("size with map input") using checkSparkAnswerAndOperator, and the original test using makeParquetFileAllPrimitiveTypes is also restored with checkSparkAnswerAndOperator instead of checkSparkAnswerAndFallbackReason. A separate test("size with map input - v2 reader") covers the v2 reader path.

}
ScalarValue::Map(array) => {
if array.is_null(0) {
Ok(ScalarValue::Int32(Some(-1)))

@hsiang-c hsiang-c Jun 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the spark.sql.legacy.sizeOfNull config change the return value here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.legacy.sizeOfNull doesn't affect the native return value. The serde layer already wraps size(child) into CASE WHEN isNotNull(child) THEN size(child) ELSE (-1|null) END, so the null branch is handled by CaseWhen's ELSE literal and native size() only receives non-null inputs. The ConfigMatrix in size.sql already covers both behaviors.

marvelshan added a commit to marvelshan/datafusion-comet that referenced this pull request Jun 10, 2026
@marvelshan marvelshan force-pushed the feat/size-map-support branch from d6a8374 to a7f7529 Compare June 10, 2026 01:09
@marvelshan marvelshan force-pushed the feat/size-map-support branch from 4d65d68 to 89870f0 Compare June 10, 2026 06:14
@marvelshan marvelshan force-pushed the feat/size-map-support branch from 89870f0 to d5ecc47 Compare June 10, 2026 06:38
marvelshan and others added 2 commits June 11, 2026 04:47
map(_8, _9) in SQL produces a CreateMap expression which has no
Comet serde, causing the entire size() expression to fall back to
Spark. checkSparkAnswerAndOperator would fail because it requires
Comet native execution. Switch to checkSparkAnswerAndFallbackReason
to correctly assert that the fallback reason is CreateMap not being
supported, and rename the test to make the fallback expectation
explicit. Native execution of size() on map columns from Parquet is
already tested in 'size with map input - v2 reader'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] support size() for MapType inputs

4 participants