feat(polars): lazy-thread interpreter pipeline (one collect per run)#769
feat(polars): lazy-thread interpreter pipeline (one collect per run)#769paddymul wants to merge 2 commits into
Conversation
Adds a failing assertion that confirms a polars cleaning pipeline threads a LazyFrame through commands' transforms (so the optimizer fuses adjacent with_columns/filter calls and materialises once at exit) instead of re-materialising on each op. Failing today: probe command sees DataFrame both passes; will pass once polars conf grows lazy_enter/lazy_exit hooks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 40942a5b20
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cleaned, _sd, _gen, _ops = ac.handle_ops_and_clean( | ||
| df, cleaning_method='', quick_command_args={}, existing_operations=ops) | ||
|
|
||
| assert _RecordTypeCommand.seen_types == ['LazyFrame', 'LazyFrame'] |
There was a problem hiding this comment.
Include the lazy-threading implementation
This assertion does not match the code path exercised by the test: PolarsAutocleaning still inherits _run_df_interpreter from PandasAutocleaning, and configure_buckaroo.buckaroo_transform clones non-pandas inputs with df.clone() rather than converting Polars DataFrame to LazyFrame or collecting once at exit. With Polars installed, _RecordTypeCommand therefore records ['DataFrame', 'DataFrame'], so this new test fails instead of validating the advertised lazy-thread interpreter pipeline.
Useful? React with 👍 / 👎.
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26001421576or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.2.dev26001421576MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.2.dev26001421576" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
Polars commands previously ran eager: each lisp op materialised a new DataFrame, so an N-op cleaning pipeline paid N times for what polars's query optimiser can fuse into one plan. Switch the polars autoclean conf to thread a LazyFrame through the interpreter and collect once at exit. - `AutocleaningConfig` grows two staticmethod hooks, `lazy_enter` and `lazy_exit`, defaulting to identity. Pandas inherits unchanged; xorq (when #767 lands) inherits the no-op default — ibis exprs are already lazy, so the unified pattern fits both dialects. - `NoCleaningConfPl` overrides with `df.lazy()` on entry and `df.collect() if isinstance(df, pl.LazyFrame) else df` on exit. The isinstance guard handles `GroupBy.transform`, which materialises mid-pipeline; anything downstream of a groupby runs eager and the exit becomes a no-op. - `_run_df_interpreter` wraps the interpreter call with the hooks. The no-op short-circuit fires *before* lazy_enter, preserving the by-reference identity contract the traitlets/anywidget init path depends on. - `Search.transform` switches to `df.collect_schema()` to avoid polars's PerformanceWarning when handed a LazyFrame. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
The lisp interpreter today re-assigns
dfafter every command — each polars op runs eagerly and materialises a new DataFrame. Switch the polars path to thread a LazyFrame through the pipeline and collect once at exit. Adjacentwith_columns/filtercalls fuse in polars's optimiser, filters push down, and a multi-op cleaning run pays one materialisation instead of N.Implementation lives entirely in the autocleaning layer:
AutocleaningConfiggrows two staticmethod hooks,lazy_enterandlazy_exit, defaulting to identity.NoCleaningConfPloverrides them withdf.lazy()on entry anddf.collect() if isinstance(df, pl.LazyFrame) else dfon exit. The isinstance guard handlesGroupBy.transform, which materialises mid-pipeline (lazy().group_by().agg().collect()) — anything downstream of it runs eager, and the exit hook is then a no-op._run_df_interpreterreads the hooks off the active conf and wraps the interpreter call. The no-op short-circuit short-circuits beforelazy_enter, preserving the by-reference identity the traitlets/anywidget path depends on (same load-bearing reason as the existing copy/clone branch).PolarsAutocleaning.make_origsandPlDfStatsV2both need a concrete DataFrame, whichlazy_exitguarantees. Pandas and xorq confs inherit the identity defaults — pandas is not lazy, xorq is already lazy expr-wise, so neither needs changes.Same approach extends to xorq's conf when #767 merges (it inherits the no-op defaults — already correct, since ibis exprs are lazy by construction).
Test plan
uv run pytest tests/unit/dataflow/autocleaning_pl_test.py -vv— failing assertion expectsLazyFramemid-pipeline; once the fix lands this and the existing 16 tests in the file are greenuv run pytest tests/unit/ --ignore=tests/unit/contrib --ignore=tests/unit/file_cache— full unit suite stays green (pandas + polars interpreter paths unaffected)uv run ruff checkon touched files — clean🤖 Generated with Claude Code