-
Notifications
You must be signed in to change notification settings - Fork 16
test: add DOM integration tests for summary stats #647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
paddymul
wants to merge
1
commit into
main
Choose a base branch
from
test/summary-stats-dom-tests
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| # Plan: Replace JSON-in-Parquet summary stats with wide-column layout (#646) | ||
|
|
||
| ## Context | ||
|
|
||
| Summary stats serialization (`sd_to_parquet_b64`) JSON-encodes every cell value to a string before writing to parquet, then JS `JSON.parse`'s every cell back. This defeats parquet's type preservation — numbers, bools, and strings all become JSON strings stuffed in string columns. The fix: flatten to one parquet column per cell (`a__mean`, `a__histogram`, etc.) so scalars go through parquet natively, and only lists/dicts still need JSON encoding. | ||
|
|
||
| ## Approach | ||
|
|
||
| Replace `sd_to_parquet_b64` in-place — no legacy code, no backwards compat format tag. Python flattens the stats dict to `{col__stat: [value]}` (single-row, many columns). JS decodes parquet, pivots the wide row back to the row-based `DFData` that all consumers already expect. The format tag stays `parquet_b64` (same as before). | ||
|
|
||
| ## Files to modify | ||
|
|
||
| ### Python: `buckaroo/serialization_utils.py` | ||
| 1. Add `_to_python_native(val)` — convert numpy scalars to Python builtins for pyarrow | ||
| 2. Add `_sd_to_parquet_b64_wide(sd)`: | ||
| - Rename columns to a,b,c via `to_chars()` (reuse existing `old_col_new_col` logic) | ||
| - For each `(col, stats_dict)`, for each `(stat, value)`: | ||
| - Column name = `f"{short_col}__{stat}"` | ||
| - If value is list/dict/tuple → JSON-encode to string | ||
| - If value is numpy scalar → convert to Python native | ||
| - If value is None/NaN → store as `None` (pyarrow handles nulls natively) | ||
| - Otherwise → store as-is | ||
| - Build `pa.table()` directly (single-row, one column per cell) | ||
| - Write parquet, base64 encode | ||
| - Return `{'format': 'parquet_b64_wide', 'data': '...'}` | ||
| 3. Replace `sd_to_parquet_b64` body with the wide-column implementation (no legacy fallback) | ||
|
|
||
| ### TypeScript: `packages/buckaroo-js-core/src/components/DFViewerParts/resolveDFData.ts` | ||
| 1. Replace `parseParquetRow()` with `pivotWideSummaryStats(wideRow)` — splits column names on first `__`, groups by stat name, produces `DFData` rows like `{index: "mean", level_0: "mean", a: 42, b: 33}` | ||
| 2. For complex values (strings that are JSON arrays/objects), JSON.parse them during pivot | ||
| 3. Update `resolveDFData()` and `resolveDFDataAsync()` to call `pivotWideSummaryStats` on the single decoded row | ||
|
|
||
| ### No changes needed in downstream consumers | ||
| - `extractSDFT()`, `extractPinnedRows()`, AG-Grid pinned rows, `Styler.tsx` — all receive `DFData` after the pivot, same shape as before | ||
| - All Python callers of `sd_to_parquet_b64` — same function signature, same tagged return format | ||
|
|
||
| ## Tests | ||
|
|
||
| ### Pre-check: DOM integration test | ||
| Before making any changes, verify an existing DOM/integration test checks that summary stat rows (e.g. mean, dtype) appear in the rendered grid. If no such test exists, add one. Run it green before proceeding. | ||
|
|
||
| ### Python: `tests/unit/test_sd_to_parquet_b64.py` | ||
| - Rewrite tests for the new wide format (single-row, `col__stat` columns) | ||
| - Scalar values are now native types in parquet (not JSON strings) | ||
| - Histogram columns are still JSON strings (complex types) | ||
| - None/NaN values are explicit nulls in parquet | ||
| - Remove old round-trip tests that assert JSON-encoded cells | ||
|
|
||
| ### TypeScript: `resolveDFData.test.ts` | ||
| - Regenerate `test-fixtures/summary_stats_parquet_b64.json` with new wide format | ||
| - Add test for `pivotWideSummaryStats` directly | ||
| - Remove old `parseParquetRow` tests | ||
| - Verify async decode produces correct DFData shape | ||
|
|
||
| ## Verification | ||
|
|
||
| 1. Run DOM integration test green BEFORE changes | ||
| 2. `pytest tests/unit/test_sd_to_parquet_b64.py -vv` | ||
| 3. `cd packages/buckaroo-js-core && pnpm test` | ||
| 4. Run DOM integration test green AFTER changes | ||
| 5. Full test suite: `pytest -vv tests/unit/ && cd packages && pnpm test` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet_b64This plan contradicts its own compatibility story: lines 9 and 35 say downstream consumers keep using the existing
parquet_b64tag, but this step changes the returned tag toparquet_b64_wide. That would break every current consumer that dispatches on the tag, including the TS type guard inpackages/buckaroo-js-core/src/components/DFViewerParts/resolveDFData.ts:7-14and Python callers that log/check forparquet_b64(for examplebuckaroo/lazy_infinite_polars_widget.py:721-722).Useful? React with 👍 / 👎.