docs: add plan for workflow chaining#552
Conversation
Proposes replacing the in-place allow_resize mechanism with a Pipeline class that chains multiple generation stages. Each stage gets a fresh fixed-size tracker, and resize becomes a between-stage concern.
…ntrols, edge cases
greptile-apps (PR #594, r3189904028): `ProviderRepository.load`'s YAML-default `DeprecationWarning` was using `warnings.warn(stacklevel=2)`, which attributes to whichever data_designer frame called `load()` — controllers, services, list/reset commands, agent introspection. Every real call path lands on `data_designer.cli.*`, which falls under Python's default `ignore::DeprecationWarning` filter and is silenced. Audit found two more sites with the same problem: - `DatasetBuilder._resolve_async_compatibility` (`allow_resize` / issue #552) — was using `stacklevel=4` to walk past `_resolve_async_compatibility -> build/build_preview -> interface -> user`. Brittle: any added frame (decorator, async wrapping, the `try/except DeprecationWarning: raise` boundary) shifts attribution silently. The existing test passed only because it used `simplefilter("always") + record=True`, which records warnings regardless of attribution. - `ProviderController._handle_change_default` — was using `stacklevel=2`, which lands on the menu dispatcher in the same controller module. `print_warning` already shows the message visually, but programmatic observers (`pytest.warns`, `filterwarnings("error", ...)`) saw a library-attributed entry that default filters silenced. All three migrated to `warn_at_caller` (the helper from 247fa30) so attribution lands on the user's call site regardless of internal chain shape. `data_designer` is already in `DEFAULT_INTERNAL_PREFIXES`, so the walk escapes the entire library in one pass. Added attribution regression tests at each site asserting `warning.filename == __file__`. A future regression to `warnings.warn(stacklevel=N)` now fails CI instead of silently silencing the user-facing nudge: - `test_load_with_yaml_default_attributes_warning_to_caller` (test_provider_repository.py) - `test_resolve_async_compatibility` extended with the same assertion - `test_handle_change_default_emits_deprecation_warning` rewritten from `pytest.warns(...)` to a `catch_warnings(record=True)` block that filters for the message and asserts `filename == __file__` (`pytest.warns` does not check attribution, so the rewrite is required to actually catch the regression). 3,125 tests pass (548 config + 1,923 engine + 654 interface). Refs #589
* feat(models): deprecate implicit default provider routing Emit DeprecationWarning whenever the legacy "implicit default provider" path is exercised: `ModelConfig.provider=None`, the registry-level `ModelProviderRegistry.default`, the YAML `default:` key in `~/.data-designer/model_providers.yaml`, and the CLI's "Change default provider" workflow. `resolve_model_provider_registry` skips passing `default=` in the single-provider case so the common construction path stays quiet. Multi-provider registries still pass `default` (per `check_implicit_default`) and warn accordingly. Update docs, the package README, and test fixtures to specify `provider=` explicitly on every `ModelConfig`. New tests cover each warning entry point and pin the post-deprecation happy paths. Refs #589 Made-with: Cursor * fix(models): address PR #594 review feedback Greptile P1: ProviderRepository.load emitted its DeprecationWarning inside a `try/except Exception` block. Under `filterwarnings("error", DeprecationWarning)` the warn would raise, the except would swallow it, and `load()` would silently return None (losing the registry). Move the warn outside the catch-all so the strict-warning path no longer drops valid configs. Greptile P2 / johnnygreco: `_warn_on_implicit_provider` and `_warn_on_explicit_default` use `stacklevel=2`, which lands inside pydantic v2's validator dispatch rather than at the user's `ModelConfig(...)` / `ModelProviderRegistry(...)` call. That broke both attribution (the source line was unhelpful) and Python's once-per-location dedup (every call collapsed to the same pydantic-internal key, suppressing all but the first warning). Introduce `data_designer.config.utils.warning_helpers.warn_at_caller`, which walks past the helper, validator, and any pydantic frames to find the user's call site and emits via `warnings.warn_explicit` with the user frame's `__warningregistry__`. Keeps attribution accurate and dedup keyed on the user's (filename, lineno). johnnygreco: align the `provider_repository.py` warning copy with the sibling site in `default_model_settings.py` ("specify provider= explicitly on each ModelConfig instead") so both YAML-default warning sites give the same migration instruction. The previous wording pointed users at "ModelConfig entries" inside `model_providers.yaml`, where ModelConfig entries don't actually live. johnnygreco: dedup the cascade in `DataDesigner.__init__`. With `model_providers=None` and a YAML `default:`, the user previously saw two DeprecationWarnings for the same root cause — `get_default_provider_name()` warns about the YAML key, then `resolve_model_provider_registry(...)` re-warns from `_warn_on_explicit_default`. Suppress the registry-level duplicate in the YAML-fallback branch via `warnings.catch_warnings()` so users see exactly one warning per user action. johnnygreco: tighten `_warn_on_explicit_default` to fire only when `default is not None`. Passing `default=None` explicitly is semantically equivalent to omitting it (caller is opting *out* of a registry-level default), and shouldn't trigger the deprecation nudge. johnnygreco: add a `model_validate({...})` regression test for `ModelConfig` so the deserialization path (legacy on-disk configs) is pinned alongside the construction path. Tests: - Update `test_load_exists` and `test_save` to omit `default=` so the roundtrip stops exercising the deprecated YAML-default path unguarded (Greptile note). - Wrap `test_resolve_model_provider_registry_with_explicit_default`, `test_get_provider`, and `test_init_user_supplied_providers_preserve_first_wins_over_yaml_default` in `pytest.warns` so the suite stays green under `-W error::DeprecationWarning` (Greptile note). - Add `test_explicit_default_none_does_not_emit_deprecation_warning` to pin the tightened predicate. - Add `test_init_yaml_default_emits_single_deprecation_warning` to pin the cascade-dedup behavior. Refs #589 Made-with: Cursor * fix(models): make deprecation warnings visible under default filters andreatgretel (PR #594): the YAML-default warning in `get_default_provider_name` and the registry-default warning emitted from inside DataDesigner helpers were attributing to data_designer library frames, not user code. Python's default filter chain includes `ignore::DeprecationWarning`, so library-attributed entries are silenced — meaning a normal `DataDesigner()` call with a YAML `default:` set showed nothing, and `resolve_model_provider_registry` warnings were similarly invisible. Two related changes: 1. `warn_at_caller`: extend the default skip-list from `("pydantic",)` to `("pydantic", "pydantic_core", "data_designer")` so the walk escapes both pydantic's validator-dispatch frames and data_designer helper frames before attributing. Also tighten the prefix predicate to exact-or-dotted-prefix matching (`name == p or name.startswith(p + ".")`) so e.g. `pydantic_helpers` is not falsely matched as part of `pydantic` (johnnygreco nit). Allow callers to pass a custom `skip_prefixes` for flexibility. Drop the "skip frame 0+1 unconditionally" guard now that prefix matching covers it. 2. `get_default_provider_name`: switch from `warnings.warn(stacklevel=2)` to `warn_at_caller`. The previous stacklevel pointed into `default_model_settings.py`, which is a library file → silenced under default filters. Verified the fix empirically with `python -W default`: warning is now attributed to the user's call site and rendered. johnnygreco (PR #594): add the missing `test_explicit_default_none_does_not_emit_deprecation_warning` regression for the `self.default is not None` predicate landed in the prior round. Tests: - New `test_warning_helpers.py` pins prefix-matching precision (rejects `pydantic_helpers` / `data_designer_other`), default skip-list contents, attribution past skip-prefix frames, and per-call-site dedup behavior. - `test_get_default_provider_name_warning_attributes_to_user_frame` pins andreatgretel's repro for the YAML-default site. - `test_explicit_default_warning_attributes_to_user_frame` pins the multi-frame case: construction goes through `resolve_model_provider_registry`, so the walk has to escape both pydantic and data_designer before landing on the test file. - `test_explicit_default_none_does_not_emit_deprecation_warning` pins johnnygreco's predicate-tightening regression. 3,124 tests pass (540 config + 1,923 engine + 653 interface; +10 net from this round). Refs #589 Made-with: Cursor * fix(models): apply warn_at_caller to remaining deprecation sites greptile-apps (PR #594, r3189904028): `ProviderRepository.load`'s YAML-default `DeprecationWarning` was using `warnings.warn(stacklevel=2)`, which attributes to whichever data_designer frame called `load()` — controllers, services, list/reset commands, agent introspection. Every real call path lands on `data_designer.cli.*`, which falls under Python's default `ignore::DeprecationWarning` filter and is silenced. Audit found two more sites with the same problem: - `DatasetBuilder._resolve_async_compatibility` (`allow_resize` / issue #552) — was using `stacklevel=4` to walk past `_resolve_async_compatibility -> build/build_preview -> interface -> user`. Brittle: any added frame (decorator, async wrapping, the `try/except DeprecationWarning: raise` boundary) shifts attribution silently. The existing test passed only because it used `simplefilter("always") + record=True`, which records warnings regardless of attribution. - `ProviderController._handle_change_default` — was using `stacklevel=2`, which lands on the menu dispatcher in the same controller module. `print_warning` already shows the message visually, but programmatic observers (`pytest.warns`, `filterwarnings("error", ...)`) saw a library-attributed entry that default filters silenced. All three migrated to `warn_at_caller` (the helper from 247fa30) so attribution lands on the user's call site regardless of internal chain shape. `data_designer` is already in `DEFAULT_INTERNAL_PREFIXES`, so the walk escapes the entire library in one pass. Added attribution regression tests at each site asserting `warning.filename == __file__`. A future regression to `warnings.warn(stacklevel=N)` now fails CI instead of silently silencing the user-facing nudge: - `test_load_with_yaml_default_attributes_warning_to_caller` (test_provider_repository.py) - `test_resolve_async_compatibility` extended with the same assertion - `test_handle_change_default_emits_deprecation_warning` rewritten from `pytest.warns(...)` to a `catch_warnings(record=True)` block that filters for the message and asserts `filename == __file__` (`pytest.warns` does not check attribution, so the rewrite is required to actually catch the regression). 3,125 tests pass (548 config + 1,923 engine + 654 interface). Refs #589
…, fingerprint feature available - Update allow_resize framing: now logs DeprecationWarning and falls back to sync (#553), no longer hard-rejected. Async is default as of #592. - Reference DataDesignerConfig.fingerprint() (#587) as the per-stage hash for resume invalidation. - Rename _validate_async_compatibility() to _resolve_async_compatibility() to match current code. - Mark Phase 2 step 1 as done; list the concrete docs that still need updates.
…ant, on-disk handoffs, DAG-ready, acreate sidecar - Resolve in-memory vs on-disk handoff to always-on-disk inside Pipeline; reserve in-memory for to_config_builder() notebook ergonomic. - Add Composability section: parent DataDesigner reuse is a load-bearing API contract for throttle coordination across stages and parallel branches. - Add Engine API surface section: acreate() as a small additive sidecar, independent of chaining v1 but a hard dependency for Phase 4. - Promote DAG semantics from "future work" to "designed-in"; add Phase 4 (parallel branches via asyncio.gather over acreate); demote auto-chaining to Phase 5. - New Resolved decisions section captures the three load-bearing API decisions; trim the Open questions list accordingly. - Mention possible future external orchestration only as a vague composability constraint, no commitment.
- Soften "Door open for external orchestration" - drop throttle-backend-as-seam framing; cross-reference Future considerations. - Make acreate() scope explicit (in-process); cross-process orchestration is not the same problem. - Add Phase 4 scope clarifier - branch parallelism, not stage pipelining. - New Future considerations section: external orchestration (vague, uncommitted) and pipelined execution of dependent stages.
Review: PR #552 — docs: add plan for workflow chainingSummaryThis PR adds a single planning document at
No code changes. FindingsArchitectural alignment — strong
Completeness — mostly complete, with a few areas to tighten before implementation
Feasibility — feasible as structured
Minor observations
VerdictStrong plan. Architecturally aligned with the project's layering and invariants, feasibility-grounded in existing primitives, and thoughtful about forward compatibility (DAG-internal-v1, acreate-as-sidecar). The one place the plan would benefit from a tightening pass before implementation is the stage data contract and fingerprint scope: callback output layout, callback fingerprinting for resume, and inclusion of Recommend approving the plan with a note to resolve items 1–3 above before starting Phase 1 implementation. |
Greptile SummaryThis PR adds a design plan for workflow chaining — a
|
| Filename | Overview |
|---|---|
| plans/workflow-chaining/workflow-chaining.md | Design plan for workflow chaining (Pipeline class, to_config_builder(), allow_resize removal, resume, DAG phases). Well-structured but has a gap: the resume use case assumes a fresh dd.pipeline() can locate prior run artifacts, while the plan never specifies how pipeline-level artifact directories are named or identified across invocations. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["dd.pipeline()\n(holds parent DataDesigner)"] --> B["add_stage('stage-0', config, num_records)"]
B --> C["add_stage('stage-1', config, num_records, after=callback?)"]
C --> D["pipeline.run()"]
D --> E["DataDesigner.create() Stage 0"]
E --> F["Write parquet output\nartifacts/pipeline-name/stage-0-*/"]
F --> G{"Between-stage callback?"}
G -- yes --> H["callback(stage_output_path) returns filtered path"]
G -- no --> I["LocalFileSeedSource points to stage-0 output"]
H --> I
I --> J["DataDesigner.create() Stage 1"]
J --> K["Write parquet output\nartifacts/pipeline-name/stage-1-*/"]
K --> L["PipelineResults dict"]
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
plans/workflow-chaining/workflow-chaining.md:300-303
**Resume identity gap: how does `pipeline_v2` find the prior run?**
The use case creates `pipeline_v2 = dd.pipeline()` as a fresh object (no arguments) and calls `pipeline_v2.run(resume=True)`. For that to skip stage 1, the pipeline must locate `pipeline-metadata.json` from the previous run. The artifact layout section shows `artifacts/pipeline-name/` as the root, but the `dd.pipeline()` factory is never shown accepting a name or artifact path, and the plan doesn't specify how the pipeline-level directory is named.
If that name defaults to anything non-deterministic (timestamp, UUID, or a hash of the `DataDesigner` instance), `pipeline_v2` would silently re-run all stages rather than skipping the completed one — the exact footgun the resume logic is meant to prevent. Either require an explicit name (`dd.pipeline(name="gen-judge")`) that users can repeat across invocations, or derive a deterministic pipeline fingerprint from the combination of stage configs; the choice should be made explicit in the design before Phase 1 ships, because it determines the `pipeline-metadata.json` path structure from day one.
### Issue 2 of 2
plans/workflow-chaining/workflow-chaining.md:363
**DAG fingerprint hash ordering is underspecified**
Phase 4 states that a multi-parent stage's upstream fingerprint is "the hash of all its parents' fingerprints." If the implementation concatenates parent fingerprints in the order they appear in `depends_on`, then `depends_on=["A","B"]` and `depends_on=["B","A"]` produce different fingerprints for the same logical dependency set. The resume logic would then treat re-ordering the `depends_on` list as a reason to invalidate and re-run that stage. The plan should specify that parent fingerprints are sorted (e.g., lexicographically by stage name) before hashing so the fingerprint is stable regardless of declaration order.
Reviews (1): Last reviewed commit: "docs: align plan framing with cross-proc..." | Re-trigger Greptile
| pipeline_v2 = dd.pipeline() | ||
| pipeline_v2.add_stage("generated", config_gen, num_records=1000) | ||
| pipeline_v2.add_stage("judged", config_judge_v2) | ||
| results_v2 = pipeline_v2.run(resume=True) # skips stage 1 |
There was a problem hiding this comment.
Resume identity gap: how does
pipeline_v2 find the prior run?
The use case creates pipeline_v2 = dd.pipeline() as a fresh object (no arguments) and calls pipeline_v2.run(resume=True). For that to skip stage 1, the pipeline must locate pipeline-metadata.json from the previous run. The artifact layout section shows artifacts/pipeline-name/ as the root, but the dd.pipeline() factory is never shown accepting a name or artifact path, and the plan doesn't specify how the pipeline-level directory is named.
If that name defaults to anything non-deterministic (timestamp, UUID, or a hash of the DataDesigner instance), pipeline_v2 would silently re-run all stages rather than skipping the completed one — the exact footgun the resume logic is meant to prevent. Either require an explicit name (dd.pipeline(name="gen-judge")) that users can repeat across invocations, or derive a deterministic pipeline fingerprint from the combination of stage configs; the choice should be made explicit in the design before Phase 1 ships, because it determines the pipeline-metadata.json path structure from day one.
Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/workflow-chaining/workflow-chaining.md
Line: 300-303
Comment:
**Resume identity gap: how does `pipeline_v2` find the prior run?**
The use case creates `pipeline_v2 = dd.pipeline()` as a fresh object (no arguments) and calls `pipeline_v2.run(resume=True)`. For that to skip stage 1, the pipeline must locate `pipeline-metadata.json` from the previous run. The artifact layout section shows `artifacts/pipeline-name/` as the root, but the `dd.pipeline()` factory is never shown accepting a name or artifact path, and the plan doesn't specify how the pipeline-level directory is named.
If that name defaults to anything non-deterministic (timestamp, UUID, or a hash of the `DataDesigner` instance), `pipeline_v2` would silently re-run all stages rather than skipping the completed one — the exact footgun the resume logic is meant to prevent. Either require an explicit name (`dd.pipeline(name="gen-judge")`) that users can repeat across invocations, or derive a deterministic pipeline fingerprint from the combination of stage configs; the choice should be made explicit in the design before Phase 1 ships, because it determines the `pipeline-metadata.json` path structure from day one.
How can I resolve this? If you propose a fix, please make it concise.|
|
||
| - Extend `add_stage()` with an optional `depends_on=[stage_name, ...]` argument; default keeps the linear behavior. | ||
| - `pipeline.run()` walks the resulting DAG, gathering independent branches via `asyncio.gather` over `dd.acreate()` calls. | ||
| - Per-stage fingerprint composition (Phase 3) generalizes naturally: a stage's upstream fingerprint becomes the hash of all its parents' fingerprints. |
There was a problem hiding this comment.
DAG fingerprint hash ordering is underspecified
Phase 4 states that a multi-parent stage's upstream fingerprint is "the hash of all its parents' fingerprints." If the implementation concatenates parent fingerprints in the order they appear in depends_on, then depends_on=["A","B"] and depends_on=["B","A"] produce different fingerprints for the same logical dependency set. The resume logic would then treat re-ordering the depends_on list as a reason to invalidate and re-run that stage. The plan should specify that parent fingerprints are sorted (e.g., lexicographically by stage name) before hashing so the fingerprint is stable regardless of declaration order.
Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/workflow-chaining/workflow-chaining.md
Line: 363
Comment:
**DAG fingerprint hash ordering is underspecified**
Phase 4 states that a multi-parent stage's upstream fingerprint is "the hash of all its parents' fingerprints." If the implementation concatenates parent fingerprints in the order they appear in `depends_on`, then `depends_on=["A","B"]` and `depends_on=["B","A"]` produce different fingerprints for the same logical dependency set. The resume logic would then treat re-ordering the `depends_on` list as a reason to invalidate and re-run that stage. The plan should specify that parent fingerprints are sorted (e.g., lexicographically by stage name) before hashing so the fingerprint is stable regardless of declaration order.
How can I resolve this? If you propose a fix, please make it concise.
Summary
allow_resizeand simplification of sync/async engine convergence (deprecation already shipped in chore: async engine readiness - blockers and polish before default #553).What's in the plan
add_stage(),run(), between-stage callbacks. Reuses the parentDataDesignerso all stages share oneModelRegistry/ThrottleManager.to_config_builder()convenience on results for lightweight notebook chaining.LocalFileSeedSource. In-memoryDataFrameSeedSourceis reserved for theto_config_builder()notebook ergonomic and is explicitly not a Pipeline.depends_on=[...]).acreate()engine sidecar. Small additive async API onDataDesigner. Independent of chaining v1; hard dependency for Phase 4. Enables in-process parallel-independent workflows viaasyncio.gather.allow_resizeremoval following the deprecation already inmainfrom chore: async engine readiness - blockers and polish before default #553.DataDesignerConfig.fingerprint()(feat(config): add deterministic fingerprint for workflow configs #587) composed withnum_records, DD version, and upstream stage fingerprint.Phases
to_config_builder()(can ship independently).acreate()onDataDesigner(independent track; can land before/alongside/after Phase 1).allow_resize(deprecation already shipped in chore: async engine readiness - blockers and polish before default #553; this phase finishes the removal).asyncio.gatheroveracreate(). Hard dependency on the sidecar.Resolved decisions
Pipeline. In-memory mode reserved forto_config_builder().Pipelineis constructed viadd.pipeline()and reuses the parentDataDesigneracross all stages - load-bearing for throttle coordination.Future considerations (uncommitted)
DataDesignerreuse, on-disk handoffs, no new engine surface) compose naturally with such a system.Open questions
Preview support, config serialization for auto-chaining, naming, image/media column forwarding, downstream seeding scope.
No code changes - plan document only.