Skip to content

[python] Add schema short-circuit to SplitRead and FileScanner read paths#8217

Open
MgjLLL wants to merge 3 commits into
apache:masterfrom
MgjLLL:python-fix-stats-evolutions-eager-schema-read
Open

[python] Add schema short-circuit to SplitRead and FileScanner read paths#8217
MgjLLL wants to merge 3 commits into
apache:masterfrom
MgjLLL:python-fix-stats-evolutions-eager-schema-read

Conversation

@MgjLLL

@MgjLLL MgjLLL commented Jun 12, 2026

Copy link
Copy Markdown

Purpose

Fix redundant filesystem I/O in SplitRead and FileScanner when reading schema.

SplitRead has 3 call sites that unconditionally call schema_manager.get_schema(schema_id) even when schema_id == table.table_schema.id — the schema is already in memory. This causes unnecessary filesystem reads in the common case (no schema evolution).

Java equivalent (RawFileSplitRead.createFileReader()) short-circuits with:

schemaId == schema.id() ? schema : schemaManager.schema(schemaId)

Changes

  • split_read.py: Add _resolve_schema() method that returns in-memory schema when id matches, replacing 3 direct get_schema() calls in raw_reader_supplier, _get_fields_and_predicate, and _file_read_fields
  • file_scanner.py: Add _schema_fields() method with same short-circuit pattern for SimpleStatsEvolutions

Tests

  • Added file_scanner_schema_fields_test.py with 3 test cases covering short-circuit, delegation, and zero-id edge case
  • All existing tests pass (106 passed)

This closes #8216

MgjLLL added 3 commits June 12, 2026 11:13
The schema short-circuit in FileScanner._schema_fields() returns
table.table_schema.fields when schema_id matches the current schema id.
The test fixture only mocked Mock(id=0) without .fields, causing the
short-circuit path to return a Mock auto-attribute that is not iterable
when used by SimpleStatsEvolutions._create_index_cast_mapping.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][python] SplitRead redundantly reads schema from filesystem when current schema is already in memory

1 participant