Skip to content

feat(polars): lazy-thread interpreter pipeline (one collect per run)#769

Open
paddymul wants to merge 2 commits into
mainfrom
feat/polars-lazy-pipeline
Open

feat(polars): lazy-thread interpreter pipeline (one collect per run)#769
paddymul wants to merge 2 commits into
mainfrom
feat/polars-lazy-pipeline

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Summary

The lisp interpreter today re-assigns df after every command — each polars op runs eagerly and materialises a new DataFrame. Switch the polars path to thread a LazyFrame through the pipeline and collect once at exit. Adjacent with_columns / filter calls fuse in polars's optimiser, filters push down, and a multi-op cleaning run pays one materialisation instead of N.

Implementation lives entirely in the autocleaning layer:

  • AutocleaningConfig grows two staticmethod hooks, lazy_enter and lazy_exit, defaulting to identity.
  • NoCleaningConfPl overrides them with df.lazy() on entry and df.collect() if isinstance(df, pl.LazyFrame) else df on exit. The isinstance guard handles GroupBy.transform, which materialises mid-pipeline (lazy().group_by().agg().collect()) — anything downstream of it runs eager, and the exit hook is then a no-op.
  • _run_df_interpreter reads the hooks off the active conf and wraps the interpreter call. The no-op short-circuit short-circuits before lazy_enter, preserving the by-reference identity the traitlets/anywidget path depends on (same load-bearing reason as the existing copy/clone branch).

PolarsAutocleaning.make_origs and PlDfStatsV2 both need a concrete DataFrame, which lazy_exit guarantees. Pandas and xorq confs inherit the identity defaults — pandas is not lazy, xorq is already lazy expr-wise, so neither needs changes.

Same approach extends to xorq's conf when #767 merges (it inherits the no-op defaults — already correct, since ibis exprs are lazy by construction).

Test plan

  • uv run pytest tests/unit/dataflow/autocleaning_pl_test.py -vv — failing assertion expects LazyFrame mid-pipeline; once the fix lands this and the existing 16 tests in the file are green
  • uv run pytest tests/unit/ --ignore=tests/unit/contrib --ignore=tests/unit/file_cache — full unit suite stays green (pandas + polars interpreter paths unaffected)
  • uv run ruff check on touched files — clean

🤖 Generated with Claude Code

Adds a failing assertion that confirms a polars cleaning pipeline
threads a LazyFrame through commands' transforms (so the optimizer
fuses adjacent with_columns/filter calls and materialises once at
exit) instead of re-materialising on each op.

Failing today: probe command sees DataFrame both passes; will pass
once polars conf grows lazy_enter/lazy_exit hooks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40942a5b20

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cleaned, _sd, _gen, _ops = ac.handle_ops_and_clean(
df, cleaning_method='', quick_command_args={}, existing_operations=ops)

assert _RecordTypeCommand.seen_types == ['LazyFrame', 'LazyFrame']
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include the lazy-threading implementation

This assertion does not match the code path exercised by the test: PolarsAutocleaning still inherits _run_df_interpreter from PandasAutocleaning, and configure_buckaroo.buckaroo_transform clones non-pandas inputs with df.clone() rather than converting Polars DataFrame to LazyFrame or collecting once at exit. With Polars installed, _RecordTypeCommand therefore records ['DataFrame', 'DataFrame'], so this new test fails instead of validating the advertised lazy-thread interpreter pipeline.

Useful? React with 👍 / 👎.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 17, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26001421576

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26001421576

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.2.dev26001421576" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Polars commands previously ran eager: each lisp op materialised a new
DataFrame, so an N-op cleaning pipeline paid N times for what polars's
query optimiser can fuse into one plan. Switch the polars autoclean conf
to thread a LazyFrame through the interpreter and collect once at exit.

- `AutocleaningConfig` grows two staticmethod hooks, `lazy_enter` and
  `lazy_exit`, defaulting to identity. Pandas inherits unchanged; xorq
  (when #767 lands) inherits the no-op default — ibis exprs are already
  lazy, so the unified pattern fits both dialects.
- `NoCleaningConfPl` overrides with `df.lazy()` on entry and
  `df.collect() if isinstance(df, pl.LazyFrame) else df` on exit. The
  isinstance guard handles `GroupBy.transform`, which materialises
  mid-pipeline; anything downstream of a groupby runs eager and the
  exit becomes a no-op.
- `_run_df_interpreter` wraps the interpreter call with the hooks. The
  no-op short-circuit fires *before* lazy_enter, preserving the
  by-reference identity contract the traitlets/anywidget init path
  depends on.
- `Search.transform` switches to `df.collect_schema()` to avoid polars's
  PerformanceWarning when handed a LazyFrame.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant