Fix row index miscalculation in ParquetLoader by vini-fda · Pull Request #810 · Lightning-AI/litData

vini-fda · 2026-04-24T03:32:34Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #809.

This replaces the uniform-size arithmetic with a per-chunk prefix-sum of row-group sizes, computed once when the ParquetFile is first opened, then use bisect.bisect_right(offsets, row_index) - 1 to locate the group and row_index - offsets[group] for the offset inside it.

The same num_rows_per_row_group value is also used in the cache-eviction check at

litData/src/litdata/streaming/item_loader.py

Line 749 in 1fdfad7

if read_count >= num_rows_per_row_group:

.

But that check needs the actual size of the current group (offsets[g+1] - offsets[g]), otherwise with uneven groups memory either leaks (never hits threshold) or is freed too early (forcing re-reads).

index miscalculated

codecov · 2026-04-24T09:48:21Z

Codecov Report

❌ Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 81%. Comparing base (1fdfad7) to head (3bdc6b5).

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #810   +/-   ##
===================================
  Coverage    81%    81%           
===================================
  Files        54     54           
  Lines      7617   7630   +13     
===================================
+ Hits       6143   6157   +14     
+ Misses     1474   1473    -1

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fix bug where parquet tables with non-uniform row group sizes get their

b1068e7

index miscalculated

vini-fda requested review from justusschock and tchaton as code owners April 24, 2026 03:32

remove test with shuffle due to ruff lint

3bdc6b5

Borda approved these changes Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix row index miscalculation in ParquetLoader#810

Fix row index miscalculation in ParquetLoader#810
vini-fda wants to merge 2 commits into
Lightning-AI:mainfrom
vini-fda:fix/parquet-loader-outofbounds

vini-fda commented Apr 24, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vini-fda commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

codecov Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vini-fda commented Apr 24, 2026 •

edited

Loading

codecov Bot commented Apr 24, 2026 •

edited

Loading