Skip to content

Evaluate Arrow IPC + flechette as transport format #546

@paddymul

Description

@paddymul

Motivation

Buckaroo currently uses Parquet as the binary transport between Python and JS:

  • Python: fastparquet / pyarrow → Parquet bytes
  • JS: hyparquet (~10 KB) decodes in browser

This works, but fastparquet only operates on pandas DataFrames and uses pandas internal APIs (pandas._libs.tslibs, pandas._libs.json). This is the #1 blocker for making pandas an optional dependency (#533-adjacent).

Switching the transport format to Arrow IPC with flechette as the JS decoder would:

  1. Make the serialization layer backend-agnostic (pyarrow works with pa.Table directly — no pandas needed)
  2. Drop fastparquet from core dependencies
  3. Enable zero-copy typed arrays on the JS side
  4. Potentially improve deserialization performance (7-11x faster row extraction vs apache-arrow JS)

Research summary

Full writeup: docs/flechette-arrow-ipc-research.md

flechette (@uwdata/flechette)

Built by the UW Interactive Data Lab (Heer et al — the D3/Vega/Mosaic group). Used by Mosaic, Arquero v7, and vega-loader-arrow in production.

hyparquet (current) flechette apache-arrow JS
Size (min+gz) 10 KB 14 KB 43 KB
Dependencies 0 0 Multiple
Reads Parquet Yes No No
Reads Arrow IPC No Yes Yes
Zero-copy No Yes Yes
Tree-shaking Works Works Broken
Row extraction speed Baseline 7-11x vs apache-arrow Baseline

apache-arrow JS is not recommended — 3x the bundle, broken tree-shaking (open JIRA for years), and still can't read Parquet.

What the Python side looks like

pyarrow is already a core dependency. Both pandas and polars produce pyarrow Tables:

import pyarrow as pa
import pyarrow.ipc as ipc

# From pandas
table = pa.Table.from_pandas(df)
# From polars
table = polars_df.to_arrow()

# Serialize to IPC bytes
sink = pa.BufferOutputStream()
writer = ipc.new_stream(sink, table.schema)
writer.write_table(table)
writer.close()
raw_bytes = sink.getvalue().to_pybytes()

This replaces to_parquet() (fastparquet, pandas-only) with a backend-agnostic path.

What the JS side looks like

import { tableFromIPC } from '@uwdata/flechette';

// Infinite scroll (binary buffer from model.send)
const table = tableFromIPC(buffer);
const rows = table.toArray(); // [{col1: val, col2: val}, ...]

// Summary stats (base64 payload)
const table = tableFromIPC(b64ToArrayBuffer(payload.data));
const rows = table.toArray();

Mixed-type columns (summary stats)

Current approach: pre-JSON-encode each cell into a string column, then JSON.parse on JS side. Same strategy works identically with Arrow IPC.

Proposed plan

Phase 1 (minimal, unblocks optional pandas)

Swap Python serialization from fastparquet → pyarrow. Keep hyparquet on JS side. pyarrow-produced Parquet is readable by hyparquet. Zero JS changes.

Phase 2 (performance + clean architecture)

Switch transport from Parquet to Arrow IPC. Replace hyparquet with flechette (+4 KB bundle). Enables zero-copy typed arrays and faster deserialization.

Bundle impact

+4 KB gzipped (Phase 2 only). Phase 1 has zero JS impact.

Files affected

Python (both phases):

  • buckaroo/serialization_utils.py — main serialization logic
  • buckaroo/buckaroo_widget.py — calls to_parquet() for infinite scroll
  • pyproject.toml — move fastparquet to [pandas] extra

JS (Phase 2 only):

  • packages/buckaroo-js-core/src/components/DFViewerParts/resolveDFData.ts
  • packages/js/widget.tsx — buffer handling in infinite scroll
  • packages/buckaroo-js-core/package.json — swap hyparquet → flechette

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions