Skip to content

Fix row index miscalculation in ParquetLoader#810

Open
vini-fda wants to merge 2 commits into
Lightning-AI:mainfrom
vini-fda:fix/parquet-loader-outofbounds
Open

Fix row index miscalculation in ParquetLoader#810
vini-fda wants to merge 2 commits into
Lightning-AI:mainfrom
vini-fda:fix/parquet-loader-outofbounds

Conversation

@vini-fda
Copy link
Copy Markdown
Contributor

@vini-fda vini-fda commented Apr 24, 2026

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #809.

This replaces the uniform-size arithmetic with a per-chunk prefix-sum of row-group sizes, computed once when the ParquetFile is first opened, then use bisect.bisect_right(offsets, row_index) - 1 to locate the group and row_index - offsets[group] for the offset inside it.

The same num_rows_per_row_group value is also used in the cache-eviction check at

if read_count >= num_rows_per_row_group:
.

But that check needs the actual size of the current group (offsets[g+1] - offsets[g]), otherwise with uneven groups memory either leaks (never hits threshold) or is freed too early (forcing re-reads).

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 81%. Comparing base (1fdfad7) to head (3bdc6b5).

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #810   +/-   ##
===================================
  Coverage    81%    81%           
===================================
  Files        54     54           
  Lines      7617   7630   +13     
===================================
+ Hits       6143   6157   +14     
+ Misses     1474   1473    -1     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

polars.exceptions.OutOfBoundsError in ParquetLoader with low_memory=True

2 participants