feat(xorq-stats): compile_batch_expr for portable Phase-1 aggregate expressions by paddymul · Pull Request #766 · buckaroo-data/buckaroo

paddymul · 2026-05-17T19:00:09Z

Summary

Adds XorqStatPipeline.compile_batch_expr(table) -> (expr, errors) — a way to extract the Phase-1 batched-aggregate expression without executing it. Pass an xo.table(schema, name=...) UnboundTable and you get a portable, reusable summary-stats expression you can catalog, ship across processes, or rebind to any source later.

Motivating use case: xorq/buckaroo users want to save the summary-stats config as an unbound expression so it can live alongside other catalog entries instead of being trapped inside process_table's execute loop.

What's in the expression

Shape: (1 row) x (1 + N_batch_results). Columns:

__total_length__ — table-level row count (promoted to XorqStatPipeline.TOTAL_LENGTH_KEY)
<col>|<stat> for every batch-phase (col, stat) pair that survived the column filter

Only Phase-1 batched stats are folded in (null_count, min, max, distinct_count, mean, std, median). Histograms and pure-Python computed stats (non_null_count, nan_per, distinct_per, _type, typing_stats) are not — histograms need scalar min/max from Phase 1, and the computed stats are Python on resolved scalars.

Internals

Refactors process_table's Phase-1 build loop into _build_batch_agg_exprs(table) returning (agg_exprs, batch_items, errors). Shared by the new public method and the execution path. No behaviour change to process_table — construction-time failures still land in the per-column accumulator as Err, just routed through StatError now.

Usage

import xorq.api as xo
from buckaroo.customizations.xorq_stats_v2 import XORQ_STATS_V2
from buckaroo.pluggable_analysis_framework.xorq_stat_pipeline import XorqStatPipeline

schema = {"a": "float64", "b": "string", "c": "int64"}
unbound = xo.table(schema, name="t")

pipeline = XorqStatPipeline(XORQ_STATS_V2)
expr, errors = pipeline.compile_batch_expr(unbound)

# rebind to any source with a compatible schema and execute:
bound = expr.op().replace({unbound.op(): real_source.op()}).to_expr()
df = bound.execute()

Test plan

pytest tests/unit/test_xorq_compile_batch_expr.py — 7 new tests pass (unbound stays unbound, column naming, no histogram, rebind matches process_table baseline, real-table input, construction-error surfacing, process_table regression)
pytest tests/unit/test_xorq_*.py — 79 existing xorq tests pass
ruff clean
CI green

🤖 Generated with Claude Code

… as portable expressions Adds `XorqStatPipeline.compile_batch_expr(table) -> (expr, errors)` so callers can extract the Phase-1 batched-aggregate expression without executing it. Pass an `xo.table(schema, name=...)` UnboundTable to get a portable, reusable stat expression that can be cataloged / shipped / rebound to any source later. The result shape is `(1 row) x (1 + N_batch_results)` with columns `__total_length__` and `<col>|<stat>` — same naming the internal Phase-1 result reader has always used; now promoted to a class constant (`TOTAL_LENGTH_KEY`) so external callers don't hard-code it. Refactors `process_table`'s Phase-1 build loop into `_build_batch_agg_exprs`, shared by both the public method and the execution path. No behaviour change to `process_table`: construction-time failures still land in the per-column accumulator as `Err`, just routed through `StatError`. Histograms are intentionally excluded — they're Phase 2, parameterised on scalar min/max from Phase 1, so they can't be folded into one expression. Computed Python stats (`non_null_count`, `nan_per`, `distinct_per`, `_type`, `typing_stats`) are also out — they're pure Python on resolved scalars. Rebind pattern documented in the docstring: unbound = xo.table(schema, name="t") expr, _ = pipeline.compile_batch_expr(unbound) bound = expr.op().replace({unbound.op(): real_source.op()}).to_expr() df = bound.execute() Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-17T19:01:54Z

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26006465187

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26006465187

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.2.dev26006465187" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

- Assert batch funcs provide exactly one stat key — surfaces the long-standing implicit constraint (the named aggregate column uses ``sf.provides[0].name``; multiple provides would silently drop after the first). - Clarify in process_table why construction_errors aren't appended to all_errors directly — they reach the caller via resolve_accumulator. - compile_batch_expr docstring: note the table.aggregate wrapper is already applied, and that the rebind source must be schema-compatible. - test_returns_unbound_when_given_unbound: walk the op tree via ``op.find(UnboundTable)`` instead of substring-matching repr(expr), so the test survives ibis repr-format changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

paddymul temporarily deployed to testpypi May 17, 2026 19:01 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi May 17, 2026 23:58 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(xorq-stats): compile_batch_expr for portable Phase-1 aggregate expressions#766

feat(xorq-stats): compile_batch_expr for portable Phase-1 aggregate expressions#766
paddymul wants to merge 2 commits into
mainfrom
feat/xorq-compile-batch-expr

paddymul commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paddymul commented May 17, 2026

Summary

What's in the expression

Internals

Usage

Test plan

Uh oh!

github-actions Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 TestPyPI package published

MCP server for Claude Code

📖 Docs preview

🎨 Storybook preview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 17, 2026 •

edited

Loading