-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Motivation
Buckaroo currently uses Parquet as the binary transport between Python and JS:
- Python:
fastparquet/pyarrow→ Parquet bytes - JS:
hyparquet(~10 KB) decodes in browser
This works, but fastparquet only operates on pandas DataFrames and uses pandas internal APIs (pandas._libs.tslibs, pandas._libs.json). This is the #1 blocker for making pandas an optional dependency (#533-adjacent).
Switching the transport format to Arrow IPC with flechette as the JS decoder would:
- Make the serialization layer backend-agnostic (pyarrow works with
pa.Tabledirectly — no pandas needed) - Drop
fastparquetfrom core dependencies - Enable zero-copy typed arrays on the JS side
- Potentially improve deserialization performance (7-11x faster row extraction vs apache-arrow JS)
Research summary
Full writeup: docs/flechette-arrow-ipc-research.md
flechette (@uwdata/flechette)
Built by the UW Interactive Data Lab (Heer et al — the D3/Vega/Mosaic group). Used by Mosaic, Arquero v7, and vega-loader-arrow in production.
| hyparquet (current) | flechette | apache-arrow JS | |
|---|---|---|---|
| Size (min+gz) | 10 KB | 14 KB | 43 KB |
| Dependencies | 0 | 0 | Multiple |
| Reads Parquet | Yes | No | No |
| Reads Arrow IPC | No | Yes | Yes |
| Zero-copy | No | Yes | Yes |
| Tree-shaking | Works | Works | Broken |
| Row extraction speed | Baseline | 7-11x vs apache-arrow | Baseline |
apache-arrow JS is not recommended — 3x the bundle, broken tree-shaking (open JIRA for years), and still can't read Parquet.
What the Python side looks like
pyarrow is already a core dependency. Both pandas and polars produce pyarrow Tables:
import pyarrow as pa
import pyarrow.ipc as ipc
# From pandas
table = pa.Table.from_pandas(df)
# From polars
table = polars_df.to_arrow()
# Serialize to IPC bytes
sink = pa.BufferOutputStream()
writer = ipc.new_stream(sink, table.schema)
writer.write_table(table)
writer.close()
raw_bytes = sink.getvalue().to_pybytes()This replaces to_parquet() (fastparquet, pandas-only) with a backend-agnostic path.
What the JS side looks like
import { tableFromIPC } from '@uwdata/flechette';
// Infinite scroll (binary buffer from model.send)
const table = tableFromIPC(buffer);
const rows = table.toArray(); // [{col1: val, col2: val}, ...]
// Summary stats (base64 payload)
const table = tableFromIPC(b64ToArrayBuffer(payload.data));
const rows = table.toArray();Mixed-type columns (summary stats)
Current approach: pre-JSON-encode each cell into a string column, then JSON.parse on JS side. Same strategy works identically with Arrow IPC.
Proposed plan
Phase 1 (minimal, unblocks optional pandas)
Swap Python serialization from fastparquet → pyarrow. Keep hyparquet on JS side. pyarrow-produced Parquet is readable by hyparquet. Zero JS changes.
Phase 2 (performance + clean architecture)
Switch transport from Parquet to Arrow IPC. Replace hyparquet with flechette (+4 KB bundle). Enables zero-copy typed arrays and faster deserialization.
Bundle impact
+4 KB gzipped (Phase 2 only). Phase 1 has zero JS impact.
Files affected
Python (both phases):
buckaroo/serialization_utils.py— main serialization logicbuckaroo/buckaroo_widget.py— callsto_parquet()for infinite scrollpyproject.toml— move fastparquet to[pandas]extra
JS (Phase 2 only):
packages/buckaroo-js-core/src/components/DFViewerParts/resolveDFData.tspackages/js/widget.tsx— buffer handling in infinite scrollpackages/buckaroo-js-core/package.json— swap hyparquet → flechette