Describe the bug
When the native_datafusion scan adapter rejects an incompatible Parquet column read, the resulting SparkError::ParquetSchemaConvert carries an empty file_path. The JVM shim translates this to a SparkException whose message reads:
Parquet column cannot be converted in file . Column: [a], Expected: int, Found: BINARY.
(Note the empty path between in file and ..) Spark's vectorized reader populates this path via FileScanRDD's catch block (currentFile.urlEncodedPath), so its message reads e.g. ... in file file:/tmp/.../part-00000.parquet. Column: ....
This blocks several Spark SQL tests that extract the path from the message and re-open the file (e.g. ParquetSchemaSuite > schema mismatch failure error message for parquet vectorized reader).
Where the gap is
SparkPhysicalExprAdapter::replace_with_spark_cast and the deferred RejectOnNonEmpty expression build the error with file_path: String::new() because PhysicalExprAdapterFactory::create does not receive the file path. Fixing this likely requires either:
- Capturing the file path when the per-file adapter is created (would need a DataFusion API extension), or
- Catching
ParquetSchemaConvert at a higher layer with file context (e.g. the parquet ScanExec/FileOpener wrapper) and re-raising with the path filled in.
Repro
./dev/diffs/3.4.3.diff has the test currently tagged with IgnoreCometNativeDataFusion pointing at this issue. Drop the tag and run:
ENABLE_COMET=true ENABLE_COMET_ONHEAP=true build/sbt "sql/testOnly *ParquetSchemaSuite -- -z 'schema mismatch failure error message for parquet vectorized reader'"
Describe the bug
When the
native_datafusionscan adapter rejects an incompatible Parquet column read, the resultingSparkError::ParquetSchemaConvertcarries an emptyfile_path. The JVM shim translates this to aSparkExceptionwhose message reads:(Note the empty path between
in fileand..) Spark's vectorized reader populates this path viaFileScanRDD's catch block (currentFile.urlEncodedPath), so its message reads e.g.... in file file:/tmp/.../part-00000.parquet. Column: ....This blocks several Spark SQL tests that extract the path from the message and re-open the file (e.g.
ParquetSchemaSuite > schema mismatch failure error message for parquet vectorized reader).Where the gap is
SparkPhysicalExprAdapter::replace_with_spark_castand the deferredRejectOnNonEmptyexpression build the error withfile_path: String::new()becausePhysicalExprAdapterFactory::createdoes not receive the file path. Fixing this likely requires either:ParquetSchemaConvertat a higher layer with file context (e.g. the parquetScanExec/FileOpenerwrapper) and re-raising with the path filled in.Repro
./dev/diffs/3.4.3.diffhas the test currently tagged withIgnoreCometNativeDataFusionpointing at this issue. Drop the tag and run: