Skip to content

StaticTable.from_metadata forces PyArrowFileIO by not passing location for FileIOΒ #3591

Description

@kdn36

Apache Iceberg version

None

Please describe the bug 🐞

I ran into what looks like an oversight when using (monkey-patching) pyicberg so that it calls HDFS natively (which comes down to: (a) select FsspecFileIO, and (b) bind to hdfs_native. The first part fails in my setup, see below.

StaticTable.from_metadata calls load_file_io three times. Two pass location, so load_file_io can infer the FileIO from the location's scheme. The third β€” the io the returned StaticTable carries and uses for all subsequent reads (manifests, data files) β€” omits it:

https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L1691

io = load_file_io(properties=properties, location=metadata_location)   # has location
...
io=load_file_io({**properties, **metadata.properties}),                 # no location

Without a location, load_file_io skips scheme inference (_infer_file_io_from_scheme) and falls through to the PyArrow default, regardless of the location's scheme and of what SCHEMA_TO_FILE_IO prefers for that scheme.

Consequences:

  • For schemes whose preferred FileIO is not PyArrow (e.g. abfs/hf prefer FsspecFileIO, or hdfs), the returned table's io is nonetheless PyArrowFileIO, inconsistent with the FileIO used to read the metadata file moments earlier in the same method.
  • Users must set py-io-impl explicitly to get the scheme-appropriate FileIO, even though the metadata was just read successfully via a correctly-inferred one.

Expected behavior

from_metadata should pass location

Proposed PR is ready for review.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions