Replace JSON-in-Parquet summary stats encoding with one-column-per-cell layout

## Problem

Summary stats are serialized via `sd_to_parquet_b64()` in `serialization_utils.py`. Because each column's stats dict has mixed types (strings for `dtype`, floats for `mean`, lists for `histogram_bins`), the current approach JSON-encodes every cell value to a string before writing to Parquet:

```python
json_sd[col] = {k: _json_encode_cell(v) for k, v in stats.items()}
```

The JS side then `JSON.parse`'s every cell back in `parseParquetRow()`:

```typescript
parsed[key] = JSON.parse(val);
```

This defeats the purpose of Parquet. We get none of the benefits (type preservation, compact binary encoding for numbers) because every value is a JSON string stuffed inside a Parquet string column. It also creates a fragile JSON-in-binary-in-base64 layer cake, and `parseParquetRow` has to handle the ambiguity of "is this string a real string, or a JSON-encoded number/list?"

This is the root cause of several typing edge cases — BigInt handling, histogram bin arrays, and mixed-type detection all have special-case logic that exists because we threw away type information during encoding.

## Proposed fix: one Parquet column per cell

Instead of encoding the summary stats as a DataFrame where columns map to DataFrame columns and rows map to stat names (creating mixed-type columns), transpose the layout:

**Current layout** (mixed types per column):
```
index | a          | b          | c
------+------------+------------+-----------
dtype | "int64"    | "float64"  | "object"
mean  | 42.5       | 3.14       | null
histogram_bins | [1,2,3] | [0.1,0.2] | null
```

**Proposed layout** (one column per cell, homogeneous types):
```
a__dtype: ["int64"]              → string column
a__mean: [42.5]                  → float column  
a__histogram_bins: [[1,2,3]]     → still needs JSON, but only for list/object values
b__dtype: ["float64"]            → string column
b__mean: [3.14]                  → float column
...
```

Each Parquet column has exactly one row and a single type. Scalars (numbers, strings, bools) go through Parquet natively — no JSON encoding needed. Only genuinely complex values (lists, dicts) still need JSON encoding, and those can be explicitly tagged rather than requiring `JSON.parse` on every cell.

On the JS side, the decoder would pivot the wide single-row DataFrame back into the `{col: {stat: value}}` structure that the components expect. This is a simple reshape, not a per-cell type-guessing parse.

## Benefits

- Numbers stay as numbers through the entire pipeline — no `JSON.parse("42.5")` → `number` round-trip
- Parquet compresses typed columns better than JSON strings
- Removes the ambiguity in `parseParquetRow` about whether a string is a real string or encoded JSON
- Histogram bins (arrays of floats) could potentially use Parquet's native LIST type instead of JSON strings
- Smaller payload for numeric-heavy summary stats

## Files involved

- `buckaroo/serialization_utils.py` — `sd_to_parquet_b64()`, `_json_encode_cell()`
- `packages/buckaroo-js-core/src/components/DFViewerParts/resolveDFData.ts` — `parseParquetRow()`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace JSON-in-Parquet summary stats encoding with one-column-per-cell layout #646

Problem

Proposed fix: one Parquet column per cell

Benefits

Files involved

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replace JSON-in-Parquet summary stats encoding with one-column-per-cell layout #646

Description

Problem

Proposed fix: one Parquet column per cell

Benefits

Files involved

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions