refactor: remove sync engine by andreatgretel · Pull Request #767 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-06-23T20:34:20Z

📋 Summary

Removes the legacy sync dataset-builder engine and the DATA_DESIGNER_ASYNC_ENGINE opt-out so generation always uses the async scheduler. This is stacked on #766.

🔗 Related Issue

Stacked on #766.

🔄 Changes

Removed the sync DatasetBuilder batch loop, fan-out helpers, resume path, feature flag module, and obsolete DatasetBatchManager.
Defaulted DataDesigner and ResourceProvider client concurrency to async for generation/check_models.
Updated tests and docs to describe async-only row-group execution.

🧪 Testing

make test passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Ran:

.venv/bin/ruff check --fix .
.venv/bin/ruff format .
.venv/bin/pytest packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py packages/data-designer-engine/tests/engine/test_readiness.py packages/data-designer/tests/interface/test_data_designer.py
.venv/bin/pytest packages/data-designer-engine/tests/engine/column_generators/generators/test_custom.py packages/data-designer-engine/tests/engine/resources/test_resource_provider.py packages/data-designer-engine/tests/engine/models/clients/test_factory.py
git diff --check

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated (if applicable)

Remove the allow_resize column config field and the sync resize fallback paths now that row-count changes belong at workflow boundaries. Enforce size-preserving batch replacement and pre-batch processors, update custom column validation, and revise docs/tests for workflow chaining migration.

github-actions · 2026-06-23T20:36:40Z

Fern preview: https://nvidia-preview-pr-767.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

greptile-apps · 2026-06-23T20:40:06Z

Greptile Summary

This PR removes the legacy sync dataset-builder engine entirely, making the async scheduler the sole execution path. The DATA_DESIGNER_ASYNC_ENGINE feature flag module (flags.py) and DatasetBatchManager are deleted, and all call sites are updated to hard-code ClientConcurrencyMode.ASYNC — with the single deliberate exception of get_models(), which keeps SYNC for direct model queries.

The CONFIG_HASH_VERSION is bumped from 1 → 2, intentionally invalidating all pre-existing checkpoints (including async-engine ones) so users cannot accidentally resume a sync-era run with the new engine.
processor_runner.py is simplified: the PRE_BATCH row-count guard moves into _run_stage itself, and the now-redundant run_pre_batch (sync-path only) is dropped.
Tests, docs, and the benchmark script are updated to match, removing fixtures and monkeypatches that toggled the feature flag.

Confidence Score: 5/5

Safe to merge — the sync engine removal is complete and consistent across source, tests, and tooling with no dangling references.

Every consumer of the old flags module, DatasetBatchManager, and the sync batch loop has been updated or removed. The only remaining explicit SYNC client path is get_models(), which is intentional and tested. The CONFIG_HASH_VERSION bump is a deliberate clean break that prevents accidental resume of pre-refactor checkpoints.

No files require special attention. The grep for residual flags, DatasetBatchManager, and batch_manager references across the source tree came back clean.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py	Sync batch loop, fan-out helpers, and _run_batch sync path removed; build() now always calls _build_async with ClientConcurrencyMode.ASYNC.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py	File deleted — DatasetBatchManager was the sync-only batch manager, with no remaining consumers in the source tree.
packages/data-designer-engine/src/data_designer/engine/flags.py	File deleted — DATA_DESIGNER_ASYNC_ENGINE env-var flag removed; no remaining imports of this module across source or test trees.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/processor_runner.py	PRE_BATCH row-count guard moved into _run_stage, sync-only run_pre_batch removed; run_pre_batch_on_df now delegates entirely to _run_stage. Guard behavior is preserved.
packages/data-designer/src/data_designer/interface/data_designer.py	_resolve_client_concurrency_mode removed; _create_resource_provider defaults to ASYNC; get_models explicitly passes SYNC. Clean and consistent.
packages/data-designer-config/src/data_designer/config/fingerprint.py	CONFIG_HASH_VERSION bumped 1→2 to invalidate all pre-existing checkpoints; normalization algorithm itself is unchanged.
packages/data-designer-engine/src/data_designer/engine/resources/resource_provider.py	client_concurrency_mode default changed from env-flag lookup to always ASYNC; flags import removed.
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py	Docstrings cleaned up; get_generation_strategy wraps result in GenerationStrategy(…) for safety; return-type annotations narrowed from …
packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py	Sync-only fixtures, _force_sync_engine autouse, and related tests removed; async resume tests updated to drop the flags monkeypatch wrapper.
scripts/benchmarks/benchmark_engine_v2.py	Sync vs async comparison mode removed; import path corrected from dataset_builders.artifact_storage to storage.artifact_storage; stale blob_storage=None arg dropped.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[DataDesigner.create / build] --> B[_create_resource_provider\nclient_concurrency_mode=ASYNC]
    A --> C[run_readiness_check\nclient_concurrency_mode=ASYNC]
    A --> D[DatasetBuilder.build]
    D --> E[run_readiness_check\nclient_concurrency_mode=ASYNC]
    D --> F[_build_async]
    F --> G[AsyncTaskScheduler.run]
    G --> H[RowGroupBufferManager]
    G --> I[ProcessorRunner\nrun_pre_batch_on_df / run_post_batch]

    J[DataDesigner.get_models] --> K[_create_resource_provider\nclient_concurrency_mode=SYNC]

    style F fill:#90EE90
    style G fill:#90EE90
    style D fill:#90EE90
    style J fill:#FFFFAA
    style K fill:#FFFFAA

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[DataDesigner.create / build] --> B[_create_resource_provider\nclient_concurrency_mode=ASYNC]
    A --> C[run_readiness_check\nclient_concurrency_mode=ASYNC]
    A --> D[DatasetBuilder.build]
    D --> E[run_readiness_check\nclient_concurrency_mode=ASYNC]
    D --> F[_build_async]
    F --> G[AsyncTaskScheduler.run]
    G --> H[RowGroupBufferManager]
    G --> I[ProcessorRunner\nrun_pre_batch_on_df / run_post_batch]

    J[DataDesigner.get_models] --> K[_create_resource_provider\nclient_concurrency_mode=SYNC]

    style F fill:#90EE90
    style G fill:#90EE90
    style D fill:#90EE90
    style J fill:#FFFFAA
    style K fill:#FFFFAA

_{Reviews (3): Last reviewed commit: "Merge main into remove sync engine" | Re-trigger Greptile}

johnnygreco

Nice work on this one, @andreatgretel — removing the old engine is a big simplification, and the core builder path looks much easier to reason about now.

Summary

This PR removes the legacy sync dataset-builder path and makes generation/check-model readiness use the async scheduler and async model clients by default. The implementation matches the stated intent for generation, but I found one public helper regression around direct model use and one stale public doc reference to the removed opt-out.

Findings

Warnings — Worth addressing

Design issues, missing error handling, test gaps, or violations of project standards that could cause problems later.

packages/data-designer/src/data_designer/interface/data_designer.py:690 — get_models() now returns facades whose sync methods are unusable

What: _create_resource_provider() now always passes ClientConcurrencyMode.ASYNC, and get_models() calls that helper unchanged at lines 617-619. Those facades are plain ModelFacades, not the _AsyncBridgedModelFacade used inside async custom columns, so a direct model.generate() call goes through ModelRequestExecutor.completion() to an async-mode HTTP client and hits SyncClientUnavailableError from _get_sync_client().
Why: get_models() is a public helper specifically documented for testing custom-column functions outside the full pipeline, and the docs still show result = my_generator({"name": "Alice"}, None, models) where the generator examples call models["..."].generate(...). This breaks that workflow even though the sync dataset engine removal itself is intentional.
Suggestion: Could we keep get_models() on sync-mode clients, e.g. let _create_resource_provider() accept a client_concurrency_mode override and pass ClientConcurrencyMode.SYNC from get_models()? If we want direct testing to be async-only instead, we should update the helper/docs together and add coverage for the new expected usage.

README.md:29 — README still advertises the removed sync-engine fallback

What: The root README still tells users they can set DATA_DESIGNER_ASYNC_ENGINE=0 to fall back to the legacy sync engine, but this PR deletes engine/flags.py and removes the code paths that read that environment variable.
Why: After this lands in the major release, users following the README will think a rollback path exists when it no longer does; worse, setting the env var is now silently ignored by generation.
Suggestion: Update this section to say the async engine is the only execution path, and remove the DATA_DESIGNER_ASYNC_ENGINE=0 fallback language.

Suggestions — Take it or leave it

Style improvements, minor simplifications, or optional enhancements that would improve code quality.

scripts/benchmarks/benchmark_engine_v2.py:667 — Benchmark compare mode still toggles the removed env var

What: The benchmark script still uses DATA_DESIGNER_ASYNC_ENGINE to label subprocesses as sync vs async, and --engine sync now just runs the async engine with a different label.
Why: This can produce misleading speedup numbers by comparing async against async, especially because the script name and output still frame it as a dual-engine comparison.
Suggestion: Consider removing sync compare mode, making --engine sync fail with a clear message, or repurposing the script as an async-only benchmark.

What Looks Good

The core DatasetBuilder path is much cleaner with the sync branch removed, and the async scheduler wiring is now the obvious path through build() and build_preview().
Resume behavior kept the important crash-window safeguards: filesystem-derived row-group progress, immutable original targets, and terminal handling after process_after_generation().
The docs under architecture/ and Fern mostly track the new async-only execution model, and the focused tests around builder/resume/readiness give good confidence in the main path.

Verdict

Needs changes — I’d address the get_models() regression and README fallback reference before merge. The benchmark cleanup is optional but worth doing while this context is fresh.

This review was generated by an AI assistant.

andreatgretel · 2026-06-24T12:31:39Z

@johnnygreco thanks for the review. I pushed 2c2eb906 to address the requested changes:

Kept get_models() on sync-mode clients so direct model.generate() usage still works for custom-column testing.
Removed the README fallback text for DATA_DESIGNER_ASYNC_ENGINE=0.
Simplified the benchmark script to async-only behavior so it no longer reports misleading sync-vs-async comparisons.

Greptile is green on the latest commit and the active checks are passing.

johnnygreco

Thanks, @andreatgretel. I rechecked the latest commit (2c2eb90) and the requested changes are addressed: get_models() now uses sync clients for direct custom-column testing, the README no longer advertises the removed async-engine opt-out, and the benchmark script is async-only. I also reran the focused checks locally (ruff check/format and 162 relevant tests) and they passed.\n\nOne separate note: GitHub currently reports the branch as conflicting with the base, so that will still need resolving before merge.

The base branch was changed.

andreatgretel · 2026-06-24T23:09:34Z

@johnnygreco thanks for the review. Could you re-approve? GitHub dismissed it after the base changed.

andreatgretel requested a review from a team as a code owner June 23, 2026 20:34

chore: address bot review nits

6cbf1e5

johnnygreco requested changes Jun 23, 2026

View reviewed changes

andreatgretel added 2 commits June 23, 2026 20:19

refactor: remove sync engine

5dad024

fix: address sync removal review

2c2eb90

andreatgretel force-pushed the andreatgretel/refactor/remove-sync-engine branch from 231a656 to 2c2eb90 Compare June 23, 2026 23:24

johnnygreco previously approved these changes Jun 24, 2026

View reviewed changes

Merge main into remove sync engine

d3cc8d2

andreatgretel changed the base branch from andreatgretel/fix/remove-allow-resize to main June 24, 2026 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: remove sync engine#767

refactor: remove sync engine#767
andreatgretel wants to merge 5 commits into
mainfrom
andreatgretel/refactor/remove-sync-engine

andreatgretel commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

greptile-apps Bot commented Jun 23, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

johnnygreco left a comment

Uh oh!

andreatgretel commented Jun 24, 2026

Uh oh!

johnnygreco left a comment

Uh oh!

andreatgretel commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

andreatgretel commented Jun 23, 2026

📋 Summary

🔗 Related Issue

🔄 Changes

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

greptile-apps Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

johnnygreco left a comment

Choose a reason for hiding this comment

Summary

Findings

Warnings — Worth addressing

Suggestions — Take it or leave it

What Looks Good

Verdict

Uh oh!

andreatgretel commented Jun 24, 2026

Uh oh!

johnnygreco left a comment

Choose a reason for hiding this comment

Uh oh!

andreatgretel commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 23, 2026 •

edited

Loading