Skip to content

Conversation

@damian0815
Copy link
Contributor

@damian0815 damian0815 commented Oct 29, 2025

Add a tool to check if row groups .min / .max for a particular column (eg url_surtkey) are strictly increasing within a particular parquet file or collection of parquet files; see README for more information and limitations - in particular, this does not check of the rows are sorted, just that the row groups min/max within a single parquet file are strictly increasing. The tool is intended to help check for #12.

  • Initial implementation
  • Unit tests
  • GitHub workflow

@damian0815 damian0815 force-pushed the damian/feat/is_table_sorted branch from fec818b to 2914f42 Compare October 29, 2025 14:31
@damian0815
Copy link
Contributor Author

TBD: is it expected that urls are sorted in between the parquet files, ie should max in part-00001-....gz.parquet always be <= min in part-00002-....gz.parquet?

@damian0815 damian0815 marked this pull request as ready for review October 29, 2025 14:58
@damian0815 damian0815 requested a review from wumpus October 30, 2025 14:32
Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @damian0815!

Would you mind adding some context to the description of the PR? Namely checking for #12 and a short description how the tool works. The latter could be also in the README or the command-line help of the tool.

Since the tools checks only the row group metadata whether the min/max values of a single column overlap, its name is_table_sorted.py is not quite precise resp. may raise undeliverable expectations. Maybe the name and corresponding function names can be adjusted?

I've successfully tested the tool on data from CC-MAIN-2022-05 (#12 not yet fixed) and CC-MAIN-2022-21 (#12 fixed):

  • it failed to detect that the column url_surtkey is not properly sorted on some input files of the first crawl. Definitely, if there is only a single row group. That's not unlikely for the robots.txt partition, e.g. this file.
  • but if run over more or all files the test works.

…hecking

Signed-off-by: Damian Stewart <ot@damianstewart.com>
@damian0815 damian0815 changed the title Check if tables are sorted Add a tool to check if row groups .min / .max are strictly increasing within a parquet file Oct 31, 2025
@damian0815
Copy link
Contributor Author

I have updated the title and description to better correspond with what the tool does.

@damian0815
Copy link
Contributor Author

TBD: is it expected that urls are sorted in between the parquet files, ie should max in part-00001-....gz.parquet always be <= min in part-00002-....gz.parquet?

Determined: this is not intended, ie part-00001.max may be out of order w.r.t part-00002.min

@jenenglish
Copy link

@damian0815 This is waiting on @sebastian-nagel to re-review with your changes, correct?

@damian0815
Copy link
Contributor Author

@jenenglish that's correct yes

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @damian0815, thanks! Please, see the inline comments...

for row_group_index in range(pf.num_row_groups):
row_group = pf.metadata.row_group(row_group_index)
column = row_group.column(sort_column_index)
if prev_max is not None and prev_max > column.statistics.min:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this strict condition: values in url_surtkey are not unique and it may happen (although the probability is low) that two rows with the same SURT key end up in two row groups. Than the tool reports an error, although the column might be perfectly sorted.

Copy link
Contributor Author

@damian0815 damian0815 Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thnk this is fine as-is? in the case where prev_max == column.statistics.min this condition fails and no error is reported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a unit test to validate that this case is indeed supported

for row_group_index in range(pf.num_row_groups):
row_group = pf.metadata.row_group(row_group_index)
column = row_group.column(sort_column_index)
if prev_max is not None and prev_max > column.statistics.min:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must prepared that no min/max statistics are available in a row group because of overlong URLs / SURT keys. Cf. PARQUET-1685:

    if prev_max is not None and prev_max > column.statistics.min:
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>' not supported between instances of 'str' and 'NoneType'

Seen on https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl=CC-MAIN-2022-33/subset=crawldiagnostics/part-00272-d466b69e-be2b-4525-ac34-1b10d57329da.c000.gz.parquet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i can reproduce - what's the expected behaviour if column.statistics.min is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added logic to skip the row.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 If there are no statistics it should not fail. Skipping and reporting the row is fine.

@wumpus
Copy link
Member

wumpus commented Nov 20, 2025

This is a little late, but, if you want to support local files, s3, and https, please use the smart_open package. Don't roll your own.

@wumpus
Copy link
Member

wumpus commented Dec 11, 2025

... and to contradict myself, turns out that fsspec is a better choice than smart_open. @damian0815 I think this is almost ready to ship if you make these few minor changes.

@wumpus
Copy link
Member

wumpus commented Dec 22, 2025

@sebastian-nagel thank you for the example of overly-long values causing problems! For this particular situation I'm happy to ignore the lack of statistics, as long as it's rare.

@wumpus
Copy link
Member

wumpus commented Dec 22, 2025

@damian0815 this PR is ready for a revision

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @damian0815! Looks good to me.

for row_group_index in range(pf.num_row_groups):
row_group = pf.metadata.row_group(row_group_index)
column = row_group.column(sort_column_index)
if prev_max is not None and prev_max > column.statistics.min:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 If there are no statistics it should not fail. Skipping and reporting the row is fine.

@damian0815 damian0815 merged commit 940a084 into main Dec 22, 2025
7 checks passed
@damian0815 damian0815 deleted the damian/feat/is_table_sorted branch December 22, 2025 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants