feat: wide-column layout for summary stats encoding#648
feat: wide-column layout for summary stats encoding#648
Conversation
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.2.dev23394190560or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.13.2.dev23394190560MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.13.2.dev23394190560" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e9133c54e3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
packages/buckaroo-js-core/src/components/DFViewerParts/resolveDFData.ts
Outdated
Show resolved
Hide resolved
| // JSON-parse string values that are JSON arrays/objects | ||
| if (typeof val === 'string') { | ||
| try { | ||
| parsed[key] = JSON.parse(val); | ||
| const parsed = JSON.parse(val); | ||
| if (typeof parsed === 'object' && parsed !== null) { | ||
| statCols[stat][col] = parsed; | ||
| continue; |
There was a problem hiding this comment.
Preserve scalar strings that happen to be valid JSON
In the wide summary-stats format, Python now writes scalar strings natively, so values like most_freq='{"a":1}' or most_freq='[]' arrive here as ordinary strings. This block unconditionally JSON.parses any string whose parse result is an object/array, which silently turns those legitimate categorical values into objects/arrays on the JS side. That changes the displayed summary stats for string columns whose frequent values look like JSON.
Useful? React with 👍 / 👎.
| function isWideFormat(rows: any[]): boolean { | ||
| if (rows.length !== 1) return false; | ||
| const keys = Object.keys(rows[0]); | ||
| return keys.some(k => k.indexOf('__') !== -1); |
There was a problem hiding this comment.
Narrow wide-format detection before pivoting single-row payloads
resolveDFData*() is shared by summary stats and ordinary df_data parquet payloads. With this heuristic, any one-row table whose column names contain __ (for example a user column like user__id) is treated as wide summary stats and passed into pivotWideSummaryStats(), which rewrites the row into {index: 'id', user: ...} and drops the original shape. Small static embeds and other one-row parquet tables will therefore decode incorrectly.
Useful? React with 👍 / 👎.
Replace JSON-in-Parquet summary stats with one-column-per-cell layout. Each parquet column is named col__stat (e.g. a__mean, b__histogram). Scalars go through parquet natively; only lists/dicts are JSON-encoded. JS side pivots the wide single-row back to row-based DFData that all downstream consumers (extractSDFT, extractPinnedRows, AG-Grid) expect. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- pyarrow already handles numpy scalars (float64, int64, bool_, nan) - Replace _to_python_native with _is_complex_for_parquet check - Fix pd.Series.to_dict() crash on unhashable values (fall back to to_list) - Update _resolve_all_stats test helpers to handle wide-column format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f55ce9a to
f6ba18c
Compare
- Use pd.DataFrame.to_parquet(index=False) instead of pa.table() — avoids per-column pyarrow array overhead that caused perf test failure - NaN becomes null through pandas parquet path (update tests) - Fix third copy of _resolve_all_stats in test_widget_weird_types.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The native-type approach (pa.table with 215 heterogeneous columns) was 18x slower than the old JSON-string approach due to pyarrow per-column overhead. Keep JSON encoding for all cells (same speed as old code) but in the wide col__stat layout. JS pivotWideSummaryStats JSON-parses all string values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use pa.table() + pq.write_table() instead of pd.DataFrame.to_parquet() - Skip pre-converting Series/ndarray — let _json_encode_cell handle them via default=str (same as old code, no perf regression) - Fix TS2588: use let instead of const for val in pivot loop - Regenerate test fixture Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Static embed Playwright tests failed because main DataFrame parquet (non-wide, multiple rows) was no longer being JSON-parsed. Restored parseParquetRow for the non-wide code path. Also use pa.table() + pq.write_table() directly for wide format (6.8ms vs 56ms through pandas) to avoid perf regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Temporary debug commit to understand why static embed Playwright tests fail (AG-Grid never renders). Logs artifact shape, decode results, and captures browser console/errors in the first test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The async path was skipping parseParquetRow for non-wide data, so BigInt values from hyparquet survived into the rendered data and caused "Do not know how to serialize a BigInt" on JSON.stringify. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Python: sd_to_parquet_b64 now returns layout='wide', _df_to_parquet_b64_tagged returns layout='row' - TS: isWideFormat uses the layout tag when present, falls back to the __ heuristic for old payloads without the tag - Fixes false positive where a one-row table with __ in column names (e.g. user__id) would be misdetected as wide summary stats - Fixes misleading docstring that said scalars go through parquet natively (they're all JSON-encoded via _json_encode_cell) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No pre-existing payloads lack the layout field, so the __-based heuristic was unnecessary complexity and a false-positive risk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
col__stat(e.g.a__mean,b__histogram) with a single rowJSON.stringify/JSON.parseround-trippivotWideSummaryStats()converts wide row back to row-basedDFDatathat all downstream consumers expect_to_python_native)Test plan
test_sd_to_parquet_b64.py)resolveDFDatatests pass (wide-format fixture, pivot, async decode)gridUtilstests pass (extractSDFT, extractPinnedRows unchanged)Closes #646
🤖 Generated with Claude Code