Fix Iceberg read optimization returning NULLs for stats-less manifests#1991
Open
zvonand wants to merge 1 commit into
Open
Fix Iceberg read optimization returning NULLs for stats-less manifests#1991zvonand wants to merge 1 commit into
zvonand wants to merge 1 commit into
Conversation
When an Iceberg manifest carries no per-column statistics, the parsed `DataFileMetaInfo::columns_info` is empty. The read optimization in `StorageObjectStorageSource::createReader` misread this as "every requested column is absent from the file": it replaced each nullable column with a constant `NULL` and set `need_only_count`, so the reader returned correct row counts but all-NULL values — silent data loss. Gate the absent-column-to-NULL loop on a non-empty `columns_info` so that stats-less manifests fall through to the regular reader, which reads present columns normally and resolves schema-evolution-absent columns via `IcebergMetadata::getInitialSchemaByPath`. Affects `icebergLocal`, `icebergS3`, `icebergAzure`, `icebergHDFS` and their `*Cluster` variants. Antalya-only, introduced by #1069. Add stateless test `04302_iceberg_read_optimization_no_column_stats` with a checked-in stats-less Iceberg fixture and a `generate.py` that reproduces it. C++ change taken from #1895 Closes #1545 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
C++ change similar to #1895 (without unneeded check)
tests added
Closes #1545
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix Iceberg read optimization returning NULL for every column when reading manifests without per-file column statistics (e.g. pyiceberg with default settings). Affects
icebergLocal,icebergS3,icebergAzure,icebergHDFS, and all*Clustervariants.CI/CD Options
Exclude tests:
Regression jobs to run: