feat(xorq-stats): compile_batch_expr for portable Phase-1 aggregate expressions#766
Open
paddymul wants to merge 2 commits into
Open
feat(xorq-stats): compile_batch_expr for portable Phase-1 aggregate expressions#766paddymul wants to merge 2 commits into
paddymul wants to merge 2 commits into
Conversation
… as portable expressions
Adds `XorqStatPipeline.compile_batch_expr(table) -> (expr, errors)` so callers
can extract the Phase-1 batched-aggregate expression without executing it.
Pass an `xo.table(schema, name=...)` UnboundTable to get a portable, reusable
stat expression that can be cataloged / shipped / rebound to any source later.
The result shape is `(1 row) x (1 + N_batch_results)` with columns
`__total_length__` and `<col>|<stat>` — same naming the internal Phase-1
result reader has always used; now promoted to a class constant
(`TOTAL_LENGTH_KEY`) so external callers don't hard-code it.
Refactors `process_table`'s Phase-1 build loop into `_build_batch_agg_exprs`,
shared by both the public method and the execution path. No behaviour change
to `process_table`: construction-time failures still land in the per-column
accumulator as `Err`, just routed through `StatError`.
Histograms are intentionally excluded — they're Phase 2, parameterised on
scalar min/max from Phase 1, so they can't be folded into one expression.
Computed Python stats (`non_null_count`, `nan_per`, `distinct_per`, `_type`,
`typing_stats`) are also out — they're pure Python on resolved scalars.
Rebind pattern documented in the docstring:
unbound = xo.table(schema, name="t")
expr, _ = pipeline.compile_batch_expr(unbound)
bound = expr.op().replace({unbound.op(): real_source.op()}).to_expr()
df = bound.execute()
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26006465187or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26006465187MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.2.dev26006465187" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
- Assert batch funcs provide exactly one stat key — surfaces the long-standing implicit constraint (the named aggregate column uses ``sf.provides[0].name``; multiple provides would silently drop after the first). - Clarify in process_table why construction_errors aren't appended to all_errors directly — they reach the caller via resolve_accumulator. - compile_batch_expr docstring: note the table.aggregate wrapper is already applied, and that the rebind source must be schema-compatible. - test_returns_unbound_when_given_unbound: walk the op tree via ``op.find(UnboundTable)`` instead of substring-matching repr(expr), so the test survives ibis repr-format changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
XorqStatPipeline.compile_batch_expr(table) -> (expr, errors)— a way to extract the Phase-1 batched-aggregate expression without executing it. Pass anxo.table(schema, name=...)UnboundTable and you get a portable, reusable summary-stats expression you can catalog, ship across processes, or rebind to any source later.Motivating use case: xorq/buckaroo users want to save the summary-stats config as an unbound expression so it can live alongside other catalog entries instead of being trapped inside
process_table's execute loop.What's in the expression
Shape:
(1 row) x (1 + N_batch_results). Columns:__total_length__— table-level row count (promoted toXorqStatPipeline.TOTAL_LENGTH_KEY)<col>|<stat>for every batch-phase (col, stat) pair that survived the column filterOnly Phase-1 batched stats are folded in (
null_count,min,max,distinct_count,mean,std,median). Histograms and pure-Python computed stats (non_null_count,nan_per,distinct_per,_type,typing_stats) are not — histograms need scalar min/max from Phase 1, and the computed stats are Python on resolved scalars.Internals
Refactors
process_table's Phase-1 build loop into_build_batch_agg_exprs(table)returning(agg_exprs, batch_items, errors). Shared by the new public method and the execution path. No behaviour change toprocess_table— construction-time failures still land in the per-column accumulator asErr, just routed throughStatErrornow.Usage
Test plan
pytest tests/unit/test_xorq_compile_batch_expr.py— 7 new tests pass (unbound stays unbound, column naming, no histogram, rebind matchesprocess_tablebaseline, real-table input, construction-error surfacing,process_tableregression)pytest tests/unit/test_xorq_*.py— 79 existing xorq tests pass🤖 Generated with Claude Code