Skip to content

Commit 24c938c

Browse files
cpsievertclaude
andcommitted
docs: clarify origin of MAX_ARROW_BATCH_ROWS constant
Document where STANDARD_VECTOR_SIZE comes from in DuckDB's C++ source, explain the failure mode, and note that the chunking is conservative (not harmful) if the upstream constant changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 2d85c53 commit 24c938c

1 file changed

Lines changed: 17 additions & 4 deletions

File tree

src/reader/duckdb.rs

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -516,10 +516,23 @@ impl Reader for DuckDBReader {
516516
)));
517517
}
518518

519-
// DuckDB's Arrow virtual table function (in duckdb-rs) writes an entire
520-
// RecordBatch into a single DataChunk whose vectors have a fixed capacity
521-
// of STANDARD_VECTOR_SIZE (2048). Passing a RecordBatch with more rows
522-
// causes a panic. Work around this by chunking large DataFrames.
519+
// Workaround for a duckdb-rs limitation (not a DuckDB limitation).
520+
//
521+
// duckdb-rs's `ArrowVTab` writes each RecordBatch into a single DuckDB
522+
// `DataChunk`, which has a fixed capacity of `STANDARD_VECTOR_SIZE`.
523+
// That constant is defined in DuckDB's C++ source at
524+
// `src/include/duckdb/common/constants.hpp` and is currently 2048.
525+
// When a RecordBatch exceeds this, `FlatVector::copy` panics with
526+
// `assertion failed: data.len() <= self.capacity()`.
527+
//
528+
// We chunk large DataFrames to stay within this limit. The first chunk
529+
// creates the table (letting DuckDB infer the schema from Arrow), and
530+
// subsequent chunks INSERT into it.
531+
//
532+
// If `STANDARD_VECTOR_SIZE` increases upstream, this limit becomes
533+
// unnecessarily conservative (more SQL round-trips than needed).
534+
// If it decreases, the panic would resurface — but that constant
535+
// has been 2048 since at least DuckDB 0.8 and is unlikely to shrink.
523536
const MAX_ARROW_BATCH_ROWS: usize = 2048;
524537
let total_rows = df.height();
525538

0 commit comments

Comments
 (0)