Skip to content

feat: wide-column layout for summary stats encoding#648

Open
paddymul wants to merge 11 commits intomainfrom
feat/wide-column-summary-stats
Open

feat: wide-column layout for summary stats encoding#648
paddymul wants to merge 11 commits intomainfrom
feat/wide-column-summary-stats

Conversation

@paddymul
Copy link
Collaborator

Summary

  • Replace JSON-in-Parquet summary stats encoding with one-column-per-cell layout (Replace JSON-in-Parquet summary stats encoding with one-column-per-cell layout #646)
  • Each parquet column is col__stat (e.g. a__mean, b__histogram) with a single row
  • Scalars (numbers, strings, bools) go through parquet natively — no more JSON.stringify/JSON.parse round-trip
  • Only lists/dicts (histograms, value_counts) are still JSON-encoded
  • JS pivotWideSummaryStats() converts wide row back to row-based DFData that all downstream consumers expect
  • NaN → parquet null (via _to_python_native)

Test plan

  • Python: 9/9 tests pass (test_sd_to_parquet_b64.py)
  • Python: 57/57 serialization + widget tests pass
  • JS: 12/12 resolveDFData tests pass (wide-format fixture, pivot, async decode)
  • JS: 26/26 gridUtils tests pass (extractSDFT, extractPinnedRows unchanged)
  • Storybook visual check: pinned rows render correctly
  • CI green

Closes #646

🤖 Generated with Claude Code

@github-actions
Copy link
Contributor

github-actions bot commented Mar 21, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.2.dev23394190560

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.2.dev23394190560

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.13.2.dev23394190560" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9133c54e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +63 to +69
// JSON-parse string values that are JSON arrays/objects
if (typeof val === 'string') {
try {
parsed[key] = JSON.parse(val);
const parsed = JSON.parse(val);
if (typeof parsed === 'object' && parsed !== null) {
statCols[stat][col] = parsed;
continue;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve scalar strings that happen to be valid JSON

In the wide summary-stats format, Python now writes scalar strings natively, so values like most_freq='{"a":1}' or most_freq='[]' arrive here as ordinary strings. This block unconditionally JSON.parses any string whose parse result is an object/array, which silently turns those legitimate categorical values into objects/arrays on the JS side. That changes the displayed summary stats for string columns whose frequent values look like JSON.

Useful? React with 👍 / 👎.

Comment on lines +102 to +105
function isWideFormat(rows: any[]): boolean {
if (rows.length !== 1) return false;
const keys = Object.keys(rows[0]);
return keys.some(k => k.indexOf('__') !== -1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Narrow wide-format detection before pivoting single-row payloads

resolveDFData*() is shared by summary stats and ordinary df_data parquet payloads. With this heuristic, any one-row table whose column names contain __ (for example a user column like user__id) is treated as wide summary stats and passed into pivotWideSummaryStats(), which rewrites the row into {index: 'id', user: ...} and drops the original shape. Small static embeds and other one-row parquet tables will therefore decode incorrectly.

Useful? React with 👍 / 👎.

paddymul and others added 2 commits March 21, 2026 18:05
Replace JSON-in-Parquet summary stats with one-column-per-cell layout.
Each parquet column is named col__stat (e.g. a__mean, b__histogram).
Scalars go through parquet natively; only lists/dicts are JSON-encoded.

JS side pivots the wide single-row back to row-based DFData that all
downstream consumers (extractSDFT, extractPinnedRows, AG-Grid) expect.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- pyarrow already handles numpy scalars (float64, int64, bool_, nan)
- Replace _to_python_native with _is_complex_for_parquet check
- Fix pd.Series.to_dict() crash on unhashable values (fall back to to_list)
- Update _resolve_all_stats test helpers to handle wide-column format

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@paddymul paddymul force-pushed the feat/wide-column-summary-stats branch from f55ce9a to f6ba18c Compare March 21, 2026 22:05
- Use pd.DataFrame.to_parquet(index=False) instead of pa.table() —
  avoids per-column pyarrow array overhead that caused perf test failure
- NaN becomes null through pandas parquet path (update tests)
- Fix third copy of _resolve_all_stats in test_widget_weird_types.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
paddymul and others added 2 commits March 21, 2026 18:23
The native-type approach (pa.table with 215 heterogeneous columns) was
18x slower than the old JSON-string approach due to pyarrow per-column
overhead. Keep JSON encoding for all cells (same speed as old code) but
in the wide col__stat layout. JS pivotWideSummaryStats JSON-parses all
string values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use pa.table() + pq.write_table() instead of pd.DataFrame.to_parquet()
- Skip pre-converting Series/ndarray — let _json_encode_cell handle them
  via default=str (same as old code, no perf regression)
- Fix TS2588: use let instead of const for val in pivot loop
- Regenerate test fixture

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Static embed Playwright tests failed because main DataFrame parquet
(non-wide, multiple rows) was no longer being JSON-parsed. Restored
parseParquetRow for the non-wide code path.

Also use pa.table() + pq.write_table() directly for wide format
(6.8ms vs 56ms through pandas) to avoid perf regression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Temporary debug commit to understand why static embed Playwright tests
fail (AG-Grid never renders). Logs artifact shape, decode results, and
captures browser console/errors in the first test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…error

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The async path was skipping parseParquetRow for non-wide data, so
BigInt values from hyparquet survived into the rendered data and
caused "Do not know how to serialize a BigInt" on JSON.stringify.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Python: sd_to_parquet_b64 now returns layout='wide',
  _df_to_parquet_b64_tagged returns layout='row'
- TS: isWideFormat uses the layout tag when present, falls back to
  the __ heuristic for old payloads without the tag
- Fixes false positive where a one-row table with __ in column names
  (e.g. user__id) would be misdetected as wide summary stats
- Fixes misleading docstring that said scalars go through parquet
  natively (they're all JSON-encoded via _json_encode_cell)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No pre-existing payloads lack the layout field, so the __-based
heuristic was unnecessary complexity and a false-positive risk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace JSON-in-Parquet summary stats encoding with one-column-per-cell layout

1 participant