docs: clarify origin of MAX_ARROW_BATCH_ROWS constant

cpsievert · claude · cpsievert · commit 24c938c2cd36 · 2026-02-17T18:54:25.000-06:00
Document where STANDARD_VECTOR_SIZE comes from in DuckDB's C++ source,
explain the failure mode, and note that the chunking is conservative
(not harmful) if the upstream constant changes.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/src/reader/duckdb.rs b/src/reader/duckdb.rs
@@ -516,10 +516,23 @@ impl Reader for DuckDBReader {
             )));
         }
 
-        // DuckDB's Arrow virtual table function (in duckdb-rs) writes an entire
-        // RecordBatch into a single DataChunk whose vectors have a fixed capacity
-        // of STANDARD_VECTOR_SIZE (2048). Passing a RecordBatch with more rows
-        // causes a panic. Work around this by chunking large DataFrames.
+        // Workaround for a duckdb-rs limitation (not a DuckDB limitation).
+        //
+        // duckdb-rs's `ArrowVTab` writes each RecordBatch into a single DuckDB
+        // `DataChunk`, which has a fixed capacity of `STANDARD_VECTOR_SIZE`.
+        // That constant is defined in DuckDB's C++ source at
+        // `src/include/duckdb/common/constants.hpp` and is currently 2048.
+        // When a RecordBatch exceeds this, `FlatVector::copy` panics with
+        // `assertion failed: data.len() <= self.capacity()`.
+        //
+        // We chunk large DataFrames to stay within this limit. The first chunk
+        // creates the table (letting DuckDB infer the schema from Arrow), and
+        // subsequent chunks INSERT into it.
+        //
+        // If `STANDARD_VECTOR_SIZE` increases upstream, this limit becomes
+        // unnecessarily conservative (more SQL round-trips than needed).
+        // If it decreases, the panic would resurface — but that constant
+        // has been 2048 since at least DuckDB 0.8 and is unlikely to shrink.
         const MAX_ARROW_BATCH_ROWS: usize = 2048;
         let total_rows = df.height();