Skip to content

Fix Iceberg read optimization returning NULLs for stats-less manifests#1991

Open
zvonand wants to merge 1 commit into
antalya-26.3from
fix/antalya-26.3/stateless-manifests
Open

Fix Iceberg read optimization returning NULLs for stats-less manifests#1991
zvonand wants to merge 1 commit into
antalya-26.3from
fix/antalya-26.3/stateless-manifests

Conversation

@zvonand

@zvonand zvonand commented Jun 30, 2026

Copy link
Copy Markdown
Member

C++ change similar to #1895 (without unneeded check)

tests added

Closes #1545

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix Iceberg read optimization returning NULL for every column when reading manifests without per-file column statistics (e.g. pyiceberg with default settings). Affects icebergLocal, icebergS3, icebergAzure, icebergHDFS, and all *Cluster variants.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • Aarch64 tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • OAuth (5m)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

When an Iceberg manifest carries no per-column statistics, the parsed
`DataFileMetaInfo::columns_info` is empty. The read optimization in
`StorageObjectStorageSource::createReader` misread this as "every requested
column is absent from the file": it replaced each nullable column with a
constant `NULL` and set `need_only_count`, so the reader returned correct row
counts but all-NULL values — silent data loss.

Gate the absent-column-to-NULL loop on a non-empty `columns_info` so that
stats-less manifests fall through to the regular reader, which reads present
columns normally and resolves schema-evolution-absent columns via
`IcebergMetadata::getInitialSchemaByPath`.

Affects `icebergLocal`, `icebergS3`, `icebergAzure`, `icebergHDFS` and their
`*Cluster` variants. Antalya-only, introduced by #1069.

Add stateless test `04302_iceberg_read_optimization_no_column_stats` with a
checked-in stats-less Iceberg fixture and a `generate.py` that reproduces it.

C++ change taken from #1895
Closes #1545

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Workflow [PR], commit [1eccfc1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read using IcebergLocal returns Nulls instead of actual rows

1 participant