Skip to content

GH-50156: [C++][Parquet] Ignore min/max for unknown column order#50157

Open
wgtmac wants to merge 5 commits into
apache:mainfrom
wgtmac:unknown_order
Open

GH-50156: [C++][Parquet] Ignore min/max for unknown column order#50157
wgtmac wants to merge 5 commits into
apache:mainfrom
wgtmac:unknown_order

Conversation

@wgtmac

@wgtmac wgtmac commented Jun 11, 2026

Copy link
Copy Markdown
Member

Rationale for this change

Parquet column order defines how min/max statistics should be interpreted. If a reader sees an unsupported ColumnOrder, it cannot safely use chunk statistics or page index min/max values for that column.

What changes are included in this PR?

  • Added an internal ColumnOrder::UNKNOWN state for unsupported thrift column order.
  • Kept missing ColumnOrder as UNDEFINED, preserving legacy min/max behavior.
  • Ignored chunk-level min/max statistics when column order or sort order is unknown.
  • Ignored page indexes when column order is missing/unsupported or sort order is unknown.

Are these changes tested?

Added regression tests for unsupported column order, missing column order, and page index guards.

Are there any user-facing changes?

No.

@wgtmac wgtmac requested a review from pitrou as a code owner June 11, 2026 07:50
Copilot AI review requested due to automatic review settings June 11, 2026 07:50
@github-actions github-actions Bot added the awaiting review Awaiting review label Jun 11, 2026
@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #50156 has been automatically assigned in GitHub to PR creator.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Parquet C++ reader to treat unsupported/unknown Parquet ColumnOrder as unsafe for interpreting min/max statistics, preventing incorrect filtering when the ordering semantics are unknown.

Changes:

  • Add an internal ColumnOrder::UNKNOWN state and propagate it when file metadata contains an unsupported column order.
  • Gate chunk-level statistics min/max and page-index min/max usage based on (column_order, sort_order) trustability.
  • Add regression tests covering unsupported column order, missing column order (legacy behavior), and page-index trust guards.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
cpp/src/parquet/types.h Extend ColumnOrder enum with UNKNOWN and add unknown_ singleton.
cpp/src/parquet/types.cc Define ColumnOrder::unknown_; minor macro formatting.
cpp/src/parquet/statistics_test.cc Add regression assertion that unknown sort order yields no min/max.
cpp/src/parquet/page_index.cc Add guard to return nullptr for page indexes when min/max can’t be trusted.
cpp/src/parquet/page_index_test.cc Add tests asserting page indexes are ignored for unknown/undefined column order and unknown sort order.
cpp/src/parquet/metadata.cc Introduce min/max “mode” selection (normal/legacy/discard) and apply it when decoding stats; map unsupported column order to UNKNOWN.
cpp/src/parquet/metadata_test.cc Add tests for unknown column order ignoring min/max and missing column order using legacy min/max.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/src/parquet/metadata_test.cc
Comment thread cpp/src/parquet/metadata.cc Outdated
kNormal,
};

inline StatsMinMaxMode GetStatsMinMaxMode(const ColumnDescriptor& descr) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think inline is meaningless here. But there are many inline in this file. We can remove them in a new PR in the future.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 11, 2026
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 11, 2026 08:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

Comment thread cpp/src/parquet/metadata_test.cc Outdated
Copilot AI review requested due to automatic review settings June 11, 2026 08:16
Comment thread cpp/src/parquet/metadata.cc Outdated
Comment thread cpp/src/parquet/page_index.cc Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment on lines +982 to +984
if (!CanTrustPageIndexMinMax(descr)) {
return nullptr;
}

} // namespace

static EncodedStatistics EncodedStatisticsFromThrift(const format::Statistics& statistics,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just put this in the anonymous namespace above instead of marking it static.

throw ParquetException("Invalid ColumnIndex boundary_order");
}
if (!CanTrustPageIndexMinMax(descr)) {
return nullptr;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem right for a factory function to be allowed to return NULL. It's also an API change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we're raising an exception if we encounter an unknown boundary order above.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think the bottom line is that the ColumnIndex object only gives access to raw encoded statistics anyway. It's up to the caller to decide if they know how to handle them. So why check this here?

Instead, perhaps ColumnDescriptor can expose a method can_use_statistics or something similar.

} // namespace

static EncodedStatistics EncodedStatisticsFromThrift(const format::Statistics& statistics,
StatsMinMaxMode min_max) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we reconcile this with the existing function in thrift_internal.h?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants