-
Notifications
You must be signed in to change notification settings - Fork 14
Add a tool to check if row groups .min / .max are strictly increasing within a parquet file
#40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fec818b to
2914f42
Compare
|
TBD: is it expected that urls are sorted in between the parquet files, ie should |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @damian0815!
Would you mind adding some context to the description of the PR? Namely checking for #12 and a short description how the tool works. The latter could be also in the README or the command-line help of the tool.
Since the tools checks only the row group metadata whether the min/max values of a single column overlap, its name is_table_sorted.py is not quite precise resp. may raise undeliverable expectations. Maybe the name and corresponding function names can be adjusted?
I've successfully tested the tool on data from CC-MAIN-2022-05 (#12 not yet fixed) and CC-MAIN-2022-21 (#12 fixed):
- it failed to detect that the column url_surtkey is not properly sorted on some input files of the first crawl. Definitely, if there is only a single row group. That's not unlikely for the robots.txt partition, e.g. this file.
- but if run over more or all files the test works.
…hecking Signed-off-by: Damian Stewart <ot@damianstewart.com>
.min / .max are strictly increasing within a parquet file
|
I have updated the title and description to better correspond with what the tool does. |
Determined: this is not intended, ie part-00001.max may be out of order w.r.t part-00002.min |
|
@damian0815 This is waiting on @sebastian-nagel to re-review with your changes, correct? |
|
@jenenglish that's correct yes |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @damian0815, thanks! Please, see the inline comments...
| for row_group_index in range(pf.num_row_groups): | ||
| row_group = pf.metadata.row_group(row_group_index) | ||
| column = row_group.column(sort_column_index) | ||
| if prev_max is not None and prev_max > column.statistics.min: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this strict condition: values in url_surtkey are not unique and it may happen (although the probability is low) that two rows with the same SURT key end up in two row groups. Than the tool reports an error, although the column might be perfectly sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thnk this is fine as-is? in the case where prev_max == column.statistics.min this condition fails and no error is reported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a unit test to validate that this case is indeed supported
| for row_group_index in range(pf.num_row_groups): | ||
| row_group = pf.metadata.row_group(row_group_index) | ||
| column = row_group.column(sort_column_index) | ||
| if prev_max is not None and prev_max > column.statistics.min: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must prepared that no min/max statistics are available in a row group because of overlong URLs / SURT keys. Cf. PARQUET-1685:
if prev_max is not None and prev_max > column.statistics.min:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>' not supported between instances of 'str' and 'NoneType'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i can reproduce - what's the expected behaviour if column.statistics.min is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added logic to skip the row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 If there are no statistics it should not fail. Skipping and reporting the row is fine.
|
This is a little late, but, if you want to support local files, s3, and https, please use the smart_open package. Don't roll your own. |
|
... and to contradict myself, turns out that fsspec is a better choice than smart_open. @damian0815 I think this is almost ready to ship if you make these few minor changes. |
|
@sebastian-nagel thank you for the example of overly-long values causing problems! For this particular situation I'm happy to ignore the lack of statistics, as long as it's rare. |
|
@damian0815 this PR is ready for a revision |
sebastian-nagel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @damian0815! Looks good to me.
| for row_group_index in range(pf.num_row_groups): | ||
| row_group = pf.metadata.row_group(row_group_index) | ||
| column = row_group.column(sort_column_index) | ||
| if prev_max is not None and prev_max > column.statistics.min: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 If there are no statistics it should not fail. Skipping and reporting the row is fine.
Add a tool to check if row groups
.min/.maxfor a particular column (egurl_surtkey) are strictly increasing within a particular parquet file or collection of parquet files; see README for more information and limitations - in particular, this does not check of the rows are sorted, just that the row groups min/max within a single parquet file are strictly increasing. The tool is intended to help check for #12.