feat: wide-column layout for summary stats encoding by paddymul · Pull Request #648 · buckaroo-data/buckaroo

paddymul · 2026-03-21T21:37:57Z

Summary

Replace JSON-in-Parquet summary stats encoding with one-column-per-cell layout (Replace JSON-in-Parquet summary stats encoding with one-column-per-cell layout #646)
Each parquet column is col__stat (e.g. a__mean, b__histogram) with a single row
Scalars (numbers, strings, bools) go through parquet natively — no more JSON.stringify/JSON.parse round-trip
Only lists/dicts (histograms, value_counts) are still JSON-encoded
JS pivotWideSummaryStats() converts wide row back to row-based DFData that all downstream consumers expect
NaN → parquet null (via _to_python_native)

Test plan

Python: 9/9 tests pass (test_sd_to_parquet_b64.py)
Python: 57/57 serialization + widget tests pass
JS: 12/12 resolveDFData tests pass (wide-format fixture, pivot, async decode)
JS: 26/26 gridUtils tests pass (extractSDFT, extractPinnedRows unchanged)
Storybook visual check: pinned rows render correctly
CI green

Closes #646

🤖 Generated with Claude Code

github-actions · 2026-03-21T21:39:38Z

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.2.dev23394190560

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.2.dev23394190560

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.13.2.dev23394190560" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9133c54e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-21T21:43:42Z

+        // JSON-parse string values that are JSON arrays/objects
+        if (typeof val === 'string') {
            try {
-                parsed[key] = JSON.parse(val);
+                const parsed = JSON.parse(val);
+                if (typeof parsed === 'object' && parsed !== null) {
+                    statCols[stat][col] = parsed;
+                    continue;


Preserve scalar strings that happen to be valid JSON

In the wide summary-stats format, Python now writes scalar strings natively, so values like most_freq='{"a":1}' or most_freq='[]' arrive here as ordinary strings. This block unconditionally JSON.parses any string whose parse result is an object/array, which silently turns those legitimate categorical values into objects/arrays on the JS side. That changes the displayed summary stats for string columns whose frequent values look like JSON.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-21T21:43:42Z

+function isWideFormat(rows: any[]): boolean {
+    if (rows.length !== 1) return false;
+    const keys = Object.keys(rows[0]);
+    return keys.some(k => k.indexOf('__') !== -1);


Narrow wide-format detection before pivoting single-row payloads

resolveDFData*() is shared by summary stats and ordinary df_data parquet payloads. With this heuristic, any one-row table whose column names contain __ (for example a user column like user__id) is treated as wide summary stats and passed into pivotWideSummaryStats(), which rewrites the row into {index: 'id', user: ...} and drops the original shape. Small static embeds and other one-row parquet tables will therefore decode incorrectly.

Useful? React with 👍 / 👎.

Replace JSON-in-Parquet summary stats with one-column-per-cell layout. Each parquet column is named col__stat (e.g. a__mean, b__histogram). Scalars go through parquet natively; only lists/dicts are JSON-encoded. JS side pivots the wide single-row back to row-based DFData that all downstream consumers (extractSDFT, extractPinnedRows, AG-Grid) expect. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- pyarrow already handles numpy scalars (float64, int64, bool_, nan) - Replace _to_python_native with _is_complex_for_parquet check - Fix pd.Series.to_dict() crash on unhashable values (fall back to to_list) - Update _resolve_all_stats test helpers to handle wide-column format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Use pd.DataFrame.to_parquet(index=False) instead of pa.table() — avoids per-column pyarrow array overhead that caused perf test failure - NaN becomes null through pandas parquet path (update tests) - Fix third copy of _resolve_all_stats in test_widget_weird_types.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The native-type approach (pa.table with 215 heterogeneous columns) was 18x slower than the old JSON-string approach due to pyarrow per-column overhead. Keep JSON encoding for all cells (same speed as old code) but in the wide col__stat layout. JS pivotWideSummaryStats JSON-parses all string values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Use pa.table() + pq.write_table() instead of pd.DataFrame.to_parquet() - Skip pre-converting Series/ndarray — let _json_encode_cell handle them via default=str (same as old code, no perf regression) - Fix TS2588: use let instead of const for val in pivot loop - Regenerate test fixture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Static embed Playwright tests failed because main DataFrame parquet (non-wide, multiple rows) was no longer being JSON-parsed. Restored parseParquetRow for the non-wide code path. Also use pa.table() + pq.write_table() directly for wide format (6.8ms vs 56ms through pandas) to avoid perf regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Temporary debug commit to understand why static embed Playwright tests fail (AG-Grid never renders). Logs artifact shape, decode results, and captures browser console/errors in the first test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The async path was skipping parseParquetRow for non-wide data, so BigInt values from hyparquet survived into the rendered data and caused "Do not know how to serialize a BigInt" on JSON.stringify. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Python: sd_to_parquet_b64 now returns layout='wide', _df_to_parquet_b64_tagged returns layout='row' - TS: isWideFormat uses the layout tag when present, falls back to the __ heuristic for old payloads without the tag - Fixes false positive where a one-row table with __ in column names (e.g. user__id) would be misdetected as wide summary stats - Fixes misleading docstring that said scalars go through parquet natively (they're all JSON-encoded via _json_encode_cell) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

No pre-existing payloads lack the layout field, so the __-based heuristic was unnecessary complexity and a false-positive risk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

paddymul temporarily deployed to testpypi March 21, 2026 21:38 — with GitHub Actions Inactive

chatgpt-codex-connector Bot reviewed Mar 21, 2026

View reviewed changes

paddymul temporarily deployed to testpypi March 21, 2026 21:48 — with GitHub Actions Inactive

paddymul and others added 2 commits March 21, 2026 18:05

paddymul force-pushed the feat/wide-column-summary-stats branch from f55ce9a to f6ba18c Compare March 21, 2026 22:05

paddymul temporarily deployed to testpypi March 21, 2026 22:06 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi March 21, 2026 22:16 — with GitHub Actions Inactive

paddymul and others added 2 commits March 21, 2026 18:23

paddymul temporarily deployed to testpypi March 21, 2026 22:35 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi March 21, 2026 22:57 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi March 21, 2026 23:35 — with GitHub Actions Inactive

fix: add explicit return type to FloatingTooltip to fix TS2742 build …

7e4cfe6

…error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

paddymul temporarily deployed to testpypi March 21, 2026 23:39 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi March 22, 2026 00:06 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi March 22, 2026 00:23 — with GitHub Actions Inactive

fix: remove isWideFormat heuristic fallback, rely on layout tag only

88b34cd

No pre-existing payloads lack the layout field, so the __-based heuristic was unnecessary complexity and a false-positive risk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

paddymul temporarily deployed to testpypi March 22, 2026 02:47 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: wide-column layout for summary stats encoding#648

feat: wide-column layout for summary stats encoding#648
paddymul wants to merge 11 commits into
mainfrom
feat/wide-column-summary-stats

paddymul commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot Mar 21, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paddymul commented Mar 21, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 TestPyPI package published

MCP server for Claude Code

📖 Docs preview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Mar 21, 2026 •

edited

Loading