-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Problem
Summary stats are serialized via sd_to_parquet_b64() in serialization_utils.py. Because each column's stats dict has mixed types (strings for dtype, floats for mean, lists for histogram_bins), the current approach JSON-encodes every cell value to a string before writing to Parquet:
json_sd[col] = {k: _json_encode_cell(v) for k, v in stats.items()}The JS side then JSON.parse's every cell back in parseParquetRow():
parsed[key] = JSON.parse(val);This defeats the purpose of Parquet. We get none of the benefits (type preservation, compact binary encoding for numbers) because every value is a JSON string stuffed inside a Parquet string column. It also creates a fragile JSON-in-binary-in-base64 layer cake, and parseParquetRow has to handle the ambiguity of "is this string a real string, or a JSON-encoded number/list?"
This is the root cause of several typing edge cases — BigInt handling, histogram bin arrays, and mixed-type detection all have special-case logic that exists because we threw away type information during encoding.
Proposed fix: one Parquet column per cell
Instead of encoding the summary stats as a DataFrame where columns map to DataFrame columns and rows map to stat names (creating mixed-type columns), transpose the layout:
Current layout (mixed types per column):
index | a | b | c
------+------------+------------+-----------
dtype | "int64" | "float64" | "object"
mean | 42.5 | 3.14 | null
histogram_bins | [1,2,3] | [0.1,0.2] | null
Proposed layout (one column per cell, homogeneous types):
a__dtype: ["int64"] → string column
a__mean: [42.5] → float column
a__histogram_bins: [[1,2,3]] → still needs JSON, but only for list/object values
b__dtype: ["float64"] → string column
b__mean: [3.14] → float column
...
Each Parquet column has exactly one row and a single type. Scalars (numbers, strings, bools) go through Parquet natively — no JSON encoding needed. Only genuinely complex values (lists, dicts) still need JSON encoding, and those can be explicitly tagged rather than requiring JSON.parse on every cell.
On the JS side, the decoder would pivot the wide single-row DataFrame back into the {col: {stat: value}} structure that the components expect. This is a simple reshape, not a per-cell type-guessing parse.
Benefits
- Numbers stay as numbers through the entire pipeline — no
JSON.parse("42.5")→numberround-trip - Parquet compresses typed columns better than JSON strings
- Removes the ambiguity in
parseParquetRowabout whether a string is a real string or encoded JSON - Histogram bins (arrays of floats) could potentially use Parquet's native LIST type instead of JSON strings
- Smaller payload for numeric-heavy summary stats
Files involved
buckaroo/serialization_utils.py—sd_to_parquet_b64(),_json_encode_cell()packages/buckaroo-js-core/src/components/DFViewerParts/resolveDFData.ts—parseParquetRow()