fix(arrow): drop-row at batch boundary in RowsToRecord by benben · Pull Request #685 · PostHog/duckgres

benben · 2026-06-05T10:27:37Z

Summary

RowsToRecord lost one row at every Arrow batch transition (batchSize=1024) for unbounded SELECTs.
The loop condition rows.Next() && count < batchSize advanced the cursor once after the last scan; that row was never Scan'd and silently skipped by the next batch call.
Swap to count < batchSize && rows.Next() so Next() is not called when the batch is already full. One-line fix.

Repro (prod)

Reported by Marce Coll. On dbt_marce.credit_purchase_events:

SELECT count(*) FROM dbt_marce.credit_purchase_events  -- 12617
SELECT credit_pool_id, customer_id FROM dbt_marce.credit_purchase_events  -- 12605
SELECT credit_pool_id FROM ... WHERE credit_pool_id = '2807'  -- 1 row, exists

Probing with LIMIT:

LIMIT 1024 → 1024  (no transition, no loss)
LIMIT 1025 → 1024  (1 transition, lose 1)
LIMIT 2048 → 2047  (1 transition, lose 1)
LIMIT 3072 → 3070  (2 transitions, lose 2)
LIMIT 12617 → 12605 (12 transitions, lose 12)

Pattern: for every batch transition after the first, lose row at index k * batchSize. The loop body never executed for that row because count < batchSize failed after rows.Next() already advanced the driver cursor.

Why nobody noticed earlier

COUNT(*) reads parquet/DuckLake metadata row counts → unaffected.
WHERE col = X returns a single row that fits in one batch → no transition → no loss.
Loss rate ~0.1% of result size; small ad-hoc queries hide it.
Hits production every multi-row SELECT > 1024 rows. dbt models incrementally built on SELECT (not aggregations) of large tables may have silently lost rows for as long as this code has been live.

Test

New TestRowsToRecordNoRowsLostAtBatchBoundary covers:

single batch exact (1024, no transition)
one row over boundary (1025)
two batches exact (2048)
three batches + partial (3500)
production case (12617)

Walks the seen-set, reports first dropped id. On the broken code: delivered 12605 rows, expected 12617 (first dropped id = 1024). On the fix: all cases pass.

Test plan

New regression test fails before the fix, passes after.
Full duckdbservice package tests pass.
Reviewer to confirm no other call sites of RowsToRecord depend on the old (buggy) cursor-advance behavior.
After merge: tell Marce + dbt owners — any model incrementally built off a multi-row SELECT of a >1024-row table needs a full rebuild.

The loop condition `rows.Next() && count < batchSize` advanced the underlying cursor once after the final scan; that row was silently dropped when the caller asked for the next batch. Unbounded SELECTs crossing the 1024-row batch boundary lost one row per transition (deterministic, ORDER BY-stable). COUNT(*) returned the parquet metadata count, hiding the discrepancy from aggregation queries. Reported by Marce Coll on dbt_marce.credit_purchase_events (COUNT(*) = 12617, SELECT = 12605) and dbt.usage_allocation (2,689,942 vs 2,687,758). WHERE filters still found the rows. Swap the order: `count < batchSize && rows.Next()` so Next() is not called when the batch is already full. Add regression test covering the production case.

mcoll-posthog approved these changes Jun 5, 2026

View reviewed changes

benben merged commit 663706f into main Jun 5, 2026
24 checks passed

benben deleted the ben/fix-rows-to-record-skip branch June 5, 2026 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(arrow): drop-row at batch boundary in RowsToRecord#685

fix(arrow): drop-row at batch boundary in RowsToRecord#685
benben merged 1 commit into
mainfrom
ben/fix-rows-to-record-skip

benben commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

benben commented Jun 5, 2026

Summary

Repro (prod)

Why nobody noticed earlier

Test

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants