Skip to content

feat(xorq-stats): compile_batch_expr for portable Phase-1 aggregate expressions#766

Open
paddymul wants to merge 2 commits into
mainfrom
feat/xorq-compile-batch-expr
Open

feat(xorq-stats): compile_batch_expr for portable Phase-1 aggregate expressions#766
paddymul wants to merge 2 commits into
mainfrom
feat/xorq-compile-batch-expr

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Summary

Adds XorqStatPipeline.compile_batch_expr(table) -> (expr, errors) — a way to extract the Phase-1 batched-aggregate expression without executing it. Pass an xo.table(schema, name=...) UnboundTable and you get a portable, reusable summary-stats expression you can catalog, ship across processes, or rebind to any source later.

Motivating use case: xorq/buckaroo users want to save the summary-stats config as an unbound expression so it can live alongside other catalog entries instead of being trapped inside process_table's execute loop.

What's in the expression

Shape: (1 row) x (1 + N_batch_results). Columns:

  • __total_length__ — table-level row count (promoted to XorqStatPipeline.TOTAL_LENGTH_KEY)
  • <col>|<stat> for every batch-phase (col, stat) pair that survived the column filter

Only Phase-1 batched stats are folded in (null_count, min, max, distinct_count, mean, std, median). Histograms and pure-Python computed stats (non_null_count, nan_per, distinct_per, _type, typing_stats) are not — histograms need scalar min/max from Phase 1, and the computed stats are Python on resolved scalars.

Internals

Refactors process_table's Phase-1 build loop into _build_batch_agg_exprs(table) returning (agg_exprs, batch_items, errors). Shared by the new public method and the execution path. No behaviour change to process_table — construction-time failures still land in the per-column accumulator as Err, just routed through StatError now.

Usage

import xorq.api as xo
from buckaroo.customizations.xorq_stats_v2 import XORQ_STATS_V2
from buckaroo.pluggable_analysis_framework.xorq_stat_pipeline import XorqStatPipeline

schema = {"a": "float64", "b": "string", "c": "int64"}
unbound = xo.table(schema, name="t")

pipeline = XorqStatPipeline(XORQ_STATS_V2)
expr, errors = pipeline.compile_batch_expr(unbound)

# rebind to any source with a compatible schema and execute:
bound = expr.op().replace({unbound.op(): real_source.op()}).to_expr()
df = bound.execute()

Test plan

  • pytest tests/unit/test_xorq_compile_batch_expr.py — 7 new tests pass (unbound stays unbound, column naming, no histogram, rebind matches process_table baseline, real-table input, construction-error surfacing, process_table regression)
  • pytest tests/unit/test_xorq_*.py — 79 existing xorq tests pass
  • ruff clean
  • CI green

🤖 Generated with Claude Code

… as portable expressions

Adds `XorqStatPipeline.compile_batch_expr(table) -> (expr, errors)` so callers
can extract the Phase-1 batched-aggregate expression without executing it.
Pass an `xo.table(schema, name=...)` UnboundTable to get a portable, reusable
stat expression that can be cataloged / shipped / rebound to any source later.

The result shape is `(1 row) x (1 + N_batch_results)` with columns
`__total_length__` and `<col>|<stat>` — same naming the internal Phase-1
result reader has always used; now promoted to a class constant
(`TOTAL_LENGTH_KEY`) so external callers don't hard-code it.

Refactors `process_table`'s Phase-1 build loop into `_build_batch_agg_exprs`,
shared by both the public method and the execution path. No behaviour change
to `process_table`: construction-time failures still land in the per-column
accumulator as `Err`, just routed through `StatError`.

Histograms are intentionally excluded — they're Phase 2, parameterised on
scalar min/max from Phase 1, so they can't be folded into one expression.
Computed Python stats (`non_null_count`, `nan_per`, `distinct_per`, `_type`,
`typing_stats`) are also out — they're pure Python on resolved scalars.

Rebind pattern documented in the docstring:

    unbound = xo.table(schema, name="t")
    expr, _ = pipeline.compile_batch_expr(unbound)
    bound = expr.op().replace({unbound.op(): real_source.op()}).to_expr()
    df = bound.execute()

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 17, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26006465187

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26006465187

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.2.dev26006465187" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

- Assert batch funcs provide exactly one stat key — surfaces the
  long-standing implicit constraint (the named aggregate column uses
  ``sf.provides[0].name``; multiple provides would silently drop after
  the first).
- Clarify in process_table why construction_errors aren't appended to
  all_errors directly — they reach the caller via resolve_accumulator.
- compile_batch_expr docstring: note the table.aggregate wrapper is
  already applied, and that the rebind source must be schema-compatible.
- test_returns_unbound_when_given_unbound: walk the op tree via
  ``op.find(UnboundTable)`` instead of substring-matching repr(expr),
  so the test survives ibis repr-format changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant