refactor: remove sync engine#767
Conversation
Remove the allow_resize column config field and the sync resize fallback paths now that row-count changes belong at workflow boundaries. Enforce size-preserving batch replacement and pre-batch processors, update custom column validation, and revise docs/tests for workflow chaining migration.
|
Fern preview: https://nvidia-preview-pr-767.docs.buildwithfern.com/nemo/datadesigner
|
Greptile SummaryThis PR removes the legacy sync dataset-builder engine entirely, making the async scheduler the sole execution path. The
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py | Sync batch loop, fan-out helpers, and _run_batch sync path removed; build() now always calls _build_async with ClientConcurrencyMode.ASYNC. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py | File deleted — DatasetBatchManager was the sync-only batch manager, with no remaining consumers in the source tree. |
| packages/data-designer-engine/src/data_designer/engine/flags.py | File deleted — DATA_DESIGNER_ASYNC_ENGINE env-var flag removed; no remaining imports of this module across source or test trees. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/processor_runner.py | PRE_BATCH row-count guard moved into _run_stage, sync-only run_pre_batch removed; run_pre_batch_on_df now delegates entirely to _run_stage. Guard behavior is preserved. |
| packages/data-designer/src/data_designer/interface/data_designer.py | _resolve_client_concurrency_mode removed; _create_resource_provider defaults to ASYNC; get_models explicitly passes SYNC. Clean and consistent. |
| packages/data-designer-config/src/data_designer/config/fingerprint.py | CONFIG_HASH_VERSION bumped 1→2 to invalidate all pre-existing checkpoints; normalization algorithm itself is unchanged. |
| packages/data-designer-engine/src/data_designer/engine/resources/resource_provider.py | client_concurrency_mode default changed from env-flag lookup to always ASYNC; flags import removed. |
| packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py | Docstrings cleaned up; get_generation_strategy wraps result in GenerationStrategy(…) for safety; return-type annotations narrowed from … |
| packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py | Sync-only fixtures, _force_sync_engine autouse, and related tests removed; async resume tests updated to drop the flags monkeypatch wrapper. |
| scripts/benchmarks/benchmark_engine_v2.py | Sync vs async comparison mode removed; import path corrected from dataset_builders.artifact_storage to storage.artifact_storage; stale blob_storage=None arg dropped. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[DataDesigner.create / build] --> B[_create_resource_provider\nclient_concurrency_mode=ASYNC]
A --> C[run_readiness_check\nclient_concurrency_mode=ASYNC]
A --> D[DatasetBuilder.build]
D --> E[run_readiness_check\nclient_concurrency_mode=ASYNC]
D --> F[_build_async]
F --> G[AsyncTaskScheduler.run]
G --> H[RowGroupBufferManager]
G --> I[ProcessorRunner\nrun_pre_batch_on_df / run_post_batch]
J[DataDesigner.get_models] --> K[_create_resource_provider\nclient_concurrency_mode=SYNC]
style F fill:#90EE90
style G fill:#90EE90
style D fill:#90EE90
style J fill:#FFFFAA
style K fill:#FFFFAA
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[DataDesigner.create / build] --> B[_create_resource_provider\nclient_concurrency_mode=ASYNC]
A --> C[run_readiness_check\nclient_concurrency_mode=ASYNC]
A --> D[DatasetBuilder.build]
D --> E[run_readiness_check\nclient_concurrency_mode=ASYNC]
D --> F[_build_async]
F --> G[AsyncTaskScheduler.run]
G --> H[RowGroupBufferManager]
G --> I[ProcessorRunner\nrun_pre_batch_on_df / run_post_batch]
J[DataDesigner.get_models] --> K[_create_resource_provider\nclient_concurrency_mode=SYNC]
style F fill:#90EE90
style G fill:#90EE90
style D fill:#90EE90
style J fill:#FFFFAA
style K fill:#FFFFAA
Reviews (3): Last reviewed commit: "Merge main into remove sync engine" | Re-trigger Greptile
johnnygreco
left a comment
There was a problem hiding this comment.
Nice work on this one, @andreatgretel — removing the old engine is a big simplification, and the core builder path looks much easier to reason about now.
Summary
This PR removes the legacy sync dataset-builder path and makes generation/check-model readiness use the async scheduler and async model clients by default. The implementation matches the stated intent for generation, but I found one public helper regression around direct model use and one stale public doc reference to the removed opt-out.
Findings
Warnings — Worth addressing
Design issues, missing error handling, test gaps, or violations of project standards that could cause problems later.
packages/data-designer/src/data_designer/interface/data_designer.py:690 — get_models() now returns facades whose sync methods are unusable
- What:
_create_resource_provider()now always passesClientConcurrencyMode.ASYNC, andget_models()calls that helper unchanged at lines 617-619. Those facades are plainModelFacades, not the_AsyncBridgedModelFacadeused inside async custom columns, so a directmodel.generate()call goes throughModelRequestExecutor.completion()to an async-mode HTTP client and hitsSyncClientUnavailableErrorfrom_get_sync_client(). - Why:
get_models()is a public helper specifically documented for testing custom-column functions outside the full pipeline, and the docs still showresult = my_generator({"name": "Alice"}, None, models)where the generator examples callmodels["..."].generate(...). This breaks that workflow even though the sync dataset engine removal itself is intentional. - Suggestion: Could we keep
get_models()on sync-mode clients, e.g. let_create_resource_provider()accept aclient_concurrency_modeoverride and passClientConcurrencyMode.SYNCfromget_models()? If we want direct testing to be async-only instead, we should update the helper/docs together and add coverage for the new expected usage.
README.md:29 — README still advertises the removed sync-engine fallback
- What: The root README still tells users they can set
DATA_DESIGNER_ASYNC_ENGINE=0to fall back to the legacy sync engine, but this PR deletesengine/flags.pyand removes the code paths that read that environment variable. - Why: After this lands in the major release, users following the README will think a rollback path exists when it no longer does; worse, setting the env var is now silently ignored by generation.
- Suggestion: Update this section to say the async engine is the only execution path, and remove the
DATA_DESIGNER_ASYNC_ENGINE=0fallback language.
Suggestions — Take it or leave it
Style improvements, minor simplifications, or optional enhancements that would improve code quality.
scripts/benchmarks/benchmark_engine_v2.py:667 — Benchmark compare mode still toggles the removed env var
- What: The benchmark script still uses
DATA_DESIGNER_ASYNC_ENGINEto label subprocesses as sync vs async, and--engine syncnow just runs the async engine with a different label. - Why: This can produce misleading speedup numbers by comparing async against async, especially because the script name and output still frame it as a dual-engine comparison.
- Suggestion: Consider removing sync compare mode, making
--engine syncfail with a clear message, or repurposing the script as an async-only benchmark.
What Looks Good
- The core
DatasetBuilderpath is much cleaner with the sync branch removed, and the async scheduler wiring is now the obvious path throughbuild()andbuild_preview(). - Resume behavior kept the important crash-window safeguards: filesystem-derived row-group progress, immutable original targets, and terminal handling after
process_after_generation(). - The docs under
architecture/and Fern mostly track the new async-only execution model, and the focused tests around builder/resume/readiness give good confidence in the main path.
Verdict
Needs changes — I’d address the get_models() regression and README fallback reference before merge. The benchmark cleanup is optional but worth doing while this context is fresh.
This review was generated by an AI assistant.
231a656 to
2c2eb90
Compare
|
@johnnygreco thanks for the review. I pushed
Greptile is green on the latest commit and the active checks are passing. |
johnnygreco
left a comment
There was a problem hiding this comment.
Thanks, @andreatgretel. I rechecked the latest commit (2c2eb90) and the requested changes are addressed: get_models() now uses sync clients for direct custom-column testing, the README no longer advertises the removed async-engine opt-out, and the benchmark script is async-only. I also reran the focused checks locally (ruff check/format and 162 relevant tests) and they passed.\n\nOne separate note: GitHub currently reports the branch as conflicting with the base, so that will still need resolving before merge.
|
@johnnygreco thanks for the review. Could you re-approve? GitHub dismissed it after the base changed. |
📋 Summary
Removes the legacy sync dataset-builder engine and the DATA_DESIGNER_ASYNC_ENGINE opt-out so generation always uses the async scheduler. This is stacked on #766.
🔗 Related Issue
Stacked on #766.
🔄 Changes
🧪 Testing
Ran:
✅ Checklist