GH-50156: [C++][Parquet] Ignore min/max for unknown column order#50157
GH-50156: [C++][Parquet] Ignore min/max for unknown column order#50157wgtmac wants to merge 5 commits into
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
This PR updates the Parquet C++ reader to treat unsupported/unknown Parquet ColumnOrder as unsafe for interpreting min/max statistics, preventing incorrect filtering when the ordering semantics are unknown.
Changes:
- Add an internal
ColumnOrder::UNKNOWNstate and propagate it when file metadata contains an unsupported column order. - Gate chunk-level statistics min/max and page-index min/max usage based on
(column_order, sort_order)trustability. - Add regression tests covering unsupported column order, missing column order (legacy behavior), and page-index trust guards.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/parquet/types.h | Extend ColumnOrder enum with UNKNOWN and add unknown_ singleton. |
| cpp/src/parquet/types.cc | Define ColumnOrder::unknown_; minor macro formatting. |
| cpp/src/parquet/statistics_test.cc | Add regression assertion that unknown sort order yields no min/max. |
| cpp/src/parquet/page_index.cc | Add guard to return nullptr for page indexes when min/max can’t be trusted. |
| cpp/src/parquet/page_index_test.cc | Add tests asserting page indexes are ignored for unknown/undefined column order and unknown sort order. |
| cpp/src/parquet/metadata.cc | Introduce min/max “mode” selection (normal/legacy/discard) and apply it when decoding stats; map unsupported column order to UNKNOWN. |
| cpp/src/parquet/metadata_test.cc | Add tests for unknown column order ignoring min/max and missing column order using legacy min/max. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| kNormal, | ||
| }; | ||
|
|
||
| inline StatsMinMaxMode GetStatsMinMaxMode(const ColumnDescriptor& descr) { |
There was a problem hiding this comment.
I think inline is meaningless here. But there are many inline in this file. We can remove them in a new PR in the future.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| if (!CanTrustPageIndexMinMax(descr)) { | ||
| return nullptr; | ||
| } |
|
|
||
| } // namespace | ||
|
|
||
| static EncodedStatistics EncodedStatisticsFromThrift(const format::Statistics& statistics, |
There was a problem hiding this comment.
Just put this in the anonymous namespace above instead of marking it static.
| throw ParquetException("Invalid ColumnIndex boundary_order"); | ||
| } | ||
| if (!CanTrustPageIndexMinMax(descr)) { | ||
| return nullptr; |
There was a problem hiding this comment.
It doesn't seem right for a factory function to be allowed to return NULL. It's also an API change.
There was a problem hiding this comment.
Note we're raising an exception if we encounter an unknown boundary order above.
There was a problem hiding this comment.
But I think the bottom line is that the ColumnIndex object only gives access to raw encoded statistics anyway. It's up to the caller to decide if they know how to handle them. So why check this here?
Instead, perhaps ColumnDescriptor can expose a method can_use_statistics or something similar.
| } // namespace | ||
|
|
||
| static EncodedStatistics EncodedStatisticsFromThrift(const format::Statistics& statistics, | ||
| StatsMinMaxMode min_max) { |
There was a problem hiding this comment.
Also, can we reconcile this with the existing function in thrift_internal.h?
Rationale for this change
Parquet column order defines how min/max statistics should be interpreted. If a reader sees an unsupported ColumnOrder, it cannot safely use chunk statistics or page index min/max values for that column.
What changes are included in this PR?
ColumnOrder::UNKNOWNstate for unsupported thrift column order.UNDEFINED, preserving legacy min/max behavior.Are these changes tested?
Added regression tests for unsupported column order, missing column order, and page index guards.
Are there any user-facing changes?
No.