Apache Iceberg version
None
Please describe the bug π
I ran into what looks like an oversight when using (monkey-patching) pyicberg so that it calls HDFS natively (which comes down to: (a) select FsspecFileIO, and (b) bind to hdfs_native. The first part fails in my setup, see below.
StaticTable.from_metadata calls load_file_io three times. Two pass location, so load_file_io can infer the FileIO from the location's scheme. The third β the io the returned StaticTable carries and uses for all subsequent reads (manifests, data files) β omits it:
https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L1691
io = load_file_io(properties=properties, location=metadata_location) # has location
...
io=load_file_io({**properties, **metadata.properties}), # no location
Without a location, load_file_io skips scheme inference (_infer_file_io_from_scheme) and falls through to the PyArrow default, regardless of the location's scheme and of what SCHEMA_TO_FILE_IO prefers for that scheme.
Consequences:
- For schemes whose preferred FileIO is not PyArrow (e.g.
abfs/hf prefer FsspecFileIO, or hdfs), the returned table's io is nonetheless PyArrowFileIO, inconsistent with the FileIO used to read the metadata file moments earlier in the same method.
- Users must set
py-io-impl explicitly to get the scheme-appropriate FileIO, even though the metadata was just read successfully via a correctly-inferred one.
Expected behavior
from_metadata should pass location
Proposed PR is ready for review.
Willingness to contribute
Apache Iceberg version
None
Please describe the bug π
I ran into what looks like an oversight when using (monkey-patching) pyicberg so that it calls HDFS natively (which comes down to: (a) select
FsspecFileIO, and (b) bind tohdfs_native. The first part fails in my setup, see below.StaticTable.from_metadatacallsload_file_iothree times. Two passlocation, soload_file_iocan infer the FileIO from the location's scheme. The third β theiothe returnedStaticTablecarries and uses for all subsequent reads (manifests, data files) β omits it:https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L1691
Without a
location,load_file_ioskips scheme inference (_infer_file_io_from_scheme) and falls through to the PyArrow default, regardless of the location's scheme and of whatSCHEMA_TO_FILE_IOprefers for that scheme.Consequences:
abfs/hfpreferFsspecFileIO, orhdfs), the returned table'siois nonethelessPyArrowFileIO, inconsistent with the FileIO used to read the metadata file moments earlier in the same method.py-io-implexplicitly to get the scheme-appropriate FileIO, even though the metadata was just read successfully via a correctly-inferred one.Expected behavior
from_metadatashould passlocationProposed PR is ready for review.
Willingness to contribute