Describe
Iceberg scans in Auron currently fall back when the projected schema contains Iceberg metadata columns, even for cases where file-level metadata such as _file could be provided natively.
This prevents queries like select _file from iceberg_table or select data_col, _file from iceberg_table from using the native Iceberg scan path.
Describe the solution you'd like
Add native support for Iceberg metadata columns that can be represented as file-level constant values, starting with _file.
A possible approach is:
- stop treating all Iceberg metadata columns as unsupported by default
- distinguish between:
- file-backed data columns
- metadata columns that can be materialized outside the file payload
- extend
IcebergScanPlan to carry both the file schema and metadata/extra column schema
- pass supported metadata values through the native file-scan path, for example via per-file partition/constant values
- project both normal data columns and supported metadata columns from the native Iceberg scan
- continue to fall back for unsupported row-level metadata columns such as
_pos
Additional context
This should be implemented conservatively.
A good initial scope is:
- support
_file
- keep
_pos and other row-level metadata columns on the fallback path
Suggested regression coverage:
select _file from iceberg_table uses NativeIcebergTableScan
select id, _file from iceberg_table uses NativeIcebergTableScan
select _pos from iceberg_table still falls back
- native results match vanilla Spark results
Describe
Iceberg scans in Auron currently fall back when the projected schema contains Iceberg metadata columns, even for cases where file-level metadata such as
_filecould be provided natively.This prevents queries like
select _file from iceberg_tableorselect data_col, _file from iceberg_tablefrom using the native Iceberg scan path.Describe the solution you'd like
Add native support for Iceberg metadata columns that can be represented as file-level constant values, starting with
_file.A possible approach is:
IcebergScanPlanto carry both the file schema and metadata/extra column schema_posAdditional context
This should be implemented conservatively.
A good initial scope is:
_file_posand other row-level metadata columns on the fallback pathSuggested regression coverage:
select _file from iceberg_tableusesNativeIcebergTableScanselect id, _file from iceberg_tableusesNativeIcebergTableScanselect _pos from iceberg_tablestill falls back