Add a tool to check if row groups `.min` / `.max` are strictly increasing within a parquet file #40

damian0815 · 2025-10-29T13:24:22Z

Add a tool to check if row groups .min / .max for a particular column (eg url_surtkey) are strictly increasing within a particular parquet file or collection of parquet files; see README for more information and limitations - in particular, this does not check of the rows are sorted, just that the row groups min/max within a single parquet file are strictly increasing. The tool is intended to help check for #12.

Initial implementation
Unit tests
GitHub workflow

damian0815 · 2025-10-29T14:50:14Z

TBD: is it expected that urls are sorted in between the parquet files, ie should max in part-00001-....gz.parquet always be <= min in part-00002-....gz.parquet?

sebastian-nagel

Thanks, @damian0815!

Would you mind adding some context to the description of the PR? Namely checking for #12 and a short description how the tool works. The latter could be also in the README or the command-line help of the tool.

Since the tools checks only the row group metadata whether the min/max values of a single column overlap, its name is_table_sorted.py is not quite precise resp. may raise undeliverable expectations. Maybe the name and corresponding function names can be adjusted?

I've successfully tested the tool on data from CC-MAIN-2022-05 (#12 not yet fixed) and CC-MAIN-2022-21 (#12 fixed):

it failed to detect that the column url_surtkey is not properly sorted on some input files of the first crawl. Definitely, if there is only a single row group. That's not unlikely for the robots.txt partition, e.g. this file.
but if run over more or all files the test works.

src/util/is_table_sorted.py

…hecking Signed-off-by: Damian Stewart <ot@damianstewart.com>

damian0815 · 2025-10-31T14:48:11Z

I have updated the title and description to better correspond with what the tool does.

damian0815 · 2025-10-31T14:52:47Z

TBD: is it expected that urls are sorted in between the parquet files, ie should max in part-00001-....gz.parquet always be <= min in part-00002-....gz.parquet?

Determined: this is not intended, ie part-00001.max may be out of order w.r.t part-00002.min

jenenglish · 2025-11-12T18:42:54Z

@damian0815 This is waiting on @sebastian-nagel to re-review with your changes, correct?

damian0815 · 2025-11-13T22:31:32Z

@jenenglish that's correct yes

sebastian-nagel

Hi @damian0815, thanks! Please, see the inline comments...

sebastian-nagel · 2025-11-17T16:41:52Z

src/util/are_part_min_max_increasing.py

+    for row_group_index in range(pf.num_row_groups):
+        row_group = pf.metadata.row_group(row_group_index)
+        column = row_group.column(sort_column_index)
+        if prev_max is not None and prev_max > column.statistics.min:


After thinking about this strict condition: values in url_surtkey are not unique and it may happen (although the probability is low) that two rows with the same SURT key end up in two row groups. Than the tool reports an error, although the column might be perfectly sorted.

I thnk this is fine as-is? in the case where prev_max == column.statistics.min this condition fails and no error is reported

Added a unit test to validate that this case is indeed supported

sebastian-nagel · 2025-11-17T16:46:46Z

src/util/are_part_min_max_increasing.py

+    for row_group_index in range(pf.num_row_groups):
+        row_group = pf.metadata.row_group(row_group_index)
+        column = row_group.column(sort_column_index)
+        if prev_max is not None and prev_max > column.statistics.min:


Must prepared that no min/max statistics are available in a row group because of overlong URLs / SURT keys. Cf. PARQUET-1685:

if prev_max is not None and prev_max > column.statistics.min: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: '>' not supported between instances of 'str' and 'NoneType'

Seen on https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2022-33/subset=crawldiagnostics/part-00272-d466b69e-be2b-4525-ac34-1b10d57329da.c000.gz.parquet

ok, i can reproduce - what's the expected behaviour if column.statistics.min is None?

I added logic to skip the row.

👍 If there are no statistics it should not fail. Skipping and reporting the row is fine.

wumpus · 2025-11-20T17:44:38Z

This is a little late, but, if you want to support local files, s3, and https, please use the smart_open package. Don't roll your own.

wumpus · 2025-12-11T18:39:29Z

... and to contradict myself, turns out that fsspec is a better choice than smart_open. @damian0815 I think this is almost ready to ship if you make these few minor changes.

wumpus · 2025-12-22T04:21:15Z

@sebastian-nagel thank you for the example of overly-long values causing problems! For this particular situation I'm happy to ignore the lack of statistics, as long as it's rare.

wumpus · 2025-12-22T04:22:13Z

@damian0815 this PR is ready for a revision

… statistics

sebastian-nagel

Thanks, @damian0815! Looks good to me.

sebastian-nagel · 2025-12-22T14:42:44Z

src/util/are_part_min_max_increasing.py

+    for row_group_index in range(pf.num_row_groups):
+        row_group = pf.metadata.row_group(row_group_index)
+        column = row_group.column(sort_column_index)
+        if prev_max is not None and prev_max > column.statistics.min:


👍 If there are no statistics it should not fail. Skipping and reporting the row is fine.

…g.py

damian0815 and others added 3 commits October 29, 2025 14:23

is_table_sorted initial implementation

4d4038c

wip tests

0b229d9

tests and fixes

2914f42

damian0815 force-pushed the damian/feat/is_table_sorted branch from fec818b to 2914f42 Compare October 29, 2025 14:31

Damian Stewart added 2 commits October 29, 2025 15:33

reorganise

9285ec6

file-level unit tests

6d63937

Damian Stewart added 3 commits October 29, 2025 15:52

don't fail if not filewise sorted

b120571

github workflow for python unit tests

aeffeec

fix github action

a3484ac

damian0815 marked this pull request as ready for review October 29, 2025 14:58

damian0815 requested a review from wumpus October 30, 2025 14:32

sebastian-nagel requested changes Oct 31, 2025

View reviewed changes

src/util/is_table_sorted.py Outdated Show resolved Hide resolved

add README details; clarify min/max row group checking vs full file c…

62f7a9a

…hecking Signed-off-by: Damian Stewart <ot@damianstewart.com>

damian0815 changed the title ~~Check if tables are sorted~~ Add a tool to check if row groups .min / .max are strictly increasing within a parquet file Oct 31, 2025

damian0815 requested a review from sebastian-nagel October 31, 2025 14:51

sebastian-nagel requested changes Nov 17, 2025

View reviewed changes

wip: add test for case where group min == prev group max; handle null…

c94b835

… statistics

damian0815 requested a review from sebastian-nagel December 22, 2025 14:09

sebastian-nagel approved these changes Dec 22, 2025

View reviewed changes

doc: clarify boundary conditions for utils/are_part_min_max_increasin…

eea7ed9

…g.py

damian0815 merged commit 940a084 into main Dec 22, 2025
7 checks passed

damian0815 deleted the damian/feat/is_table_sorted branch December 22, 2025 15:21

Add a tool to check if row groups .min / .max are strictly increasing within a parquet file #40

Add a tool to check if row groups .min / .max are strictly increasing within a parquet file #40

Uh oh!

Conversation

damian0815 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damian0815 commented Oct 29, 2025

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

damian0815 commented Oct 31, 2025

Uh oh!

damian0815 commented Oct 31, 2025

Uh oh!

jenenglish commented Nov 12, 2025

Uh oh!

damian0815 commented Nov 13, 2025

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

damian0815 Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damian0815 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

damian0815 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

damian0815 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

wumpus commented Nov 20, 2025

Uh oh!

wumpus commented Dec 11, 2025

Uh oh!

wumpus commented Dec 22, 2025

Uh oh!

wumpus commented Dec 22, 2025

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add a tool to check if row groups `.min` / `.max` are strictly increasing within a parquet file #40

Add a tool to check if row groups `.min` / `.max` are strictly increasing within a parquet file #40

damian0815 commented Oct 29, 2025 •

edited

Loading

damian0815 Dec 22, 2025 •

edited

Loading