Skip to content

docs: add plan for workflow chaining#552

Open
andreatgretel wants to merge 7 commits intomainfrom
andreatgretel/docs/workflow-chaining
Open

docs: add plan for workflow chaining#552
andreatgretel wants to merge 7 commits intomainfrom
andreatgretel/docs/workflow-chaining

Conversation

@andreatgretel
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel commented Apr 15, 2026

Summary

  • Adds a design plan for workflow chaining - sequencing multiple generation stages where each stage's output seeds the next.
  • Primary use cases: explode (few seeds -> many records), filter-then-enrich, generate-then-judge, multi-turn construction.
  • Secondary benefit: enables full removal of allow_resize and simplification of sync/async engine convergence (deprecation already shipped in chore: async engine readiness - blockers and polish before default #553).

What's in the plan

  • Pipeline class in the interface layer with add_stage(), run(), between-stage callbacks. Reuses the parent DataDesigner so all stages share one ModelRegistry / ThrottleManager.
  • to_config_builder() convenience on results for lightweight notebook chaining.
  • On-disk handoffs between stages via LocalFileSeedSource. In-memory DataFrameSeedSource is reserved for the to_config_builder() notebook ergonomic and is explicitly not a Pipeline.
  • DAG-shaped internal stage model. v1 accepts linear inputs only; Phase 4 adds parallel branches as an additive API change (depends_on=[...]).
  • acreate() engine sidecar. Small additive async API on DataDesigner. Independent of chaining v1; hard dependency for Phase 4. Enables in-process parallel-independent workflows via asyncio.gather.
  • allow_resize removal following the deprecation already in main from chore: async engine readiness - blockers and polish before default #553.
  • Pre-batch processor resize lockdown (fail-fast on row-count changes).
  • Stage-level checkpointing and resume using DataDesignerConfig.fingerprint() (feat(config): add deterministic fingerprint for workflow configs #587) composed with num_records, DD version, and upstream stage fingerprint.

Phases

  1. Phase 1: Pipeline class + to_config_builder() (can ship independently).
  2. Sidecar: acreate() on DataDesigner (independent track; can land before/alongside/after Phase 1).
  3. Phase 2: Remove allow_resize (deprecation already shipped in chore: async engine readiness - blockers and polish before default #553; this phase finishes the removal).
  4. Phase 3: Stage-level resume via per-stage fingerprints.
  5. Phase 4: DAG-shaped stages with parallel branches via asyncio.gather over acreate(). Hard dependency on the sidecar.
  6. Phase 5 (future): Auto-chaining from a single config.

Resolved decisions

  • Stage handoff is always on disk inside Pipeline. In-memory mode reserved for to_config_builder().
  • DAG semantics are designed-in but v1 accepts linear inputs only. Phase 4 ships parallel branches.
  • Pipeline is constructed via dd.pipeline() and reuses the parent DataDesigner across all stages - load-bearing for throttle coordination.

Future considerations (uncommitted)

  • External orchestration for cross-process / distributed execution. The plan's design choices (parent DataDesigner reuse, on-disk handoffs, no new engine surface) compose naturally with such a system.
  • Pipelined execution of dependent stages (streaming seed sources, edge-streaming resume) - flagged so the stage data contract isn't quietly closed off.

Open questions

Preview support, config serialization for auto-chaining, naming, image/media column forwarding, downstream seeding scope.

No code changes - plan document only.

Proposes replacing the in-place allow_resize mechanism with a Pipeline
class that chains multiple generation stages. Each stage gets a fresh
fixed-size tracker, and resize becomes a between-stage concern.
nabinchha added a commit that referenced this pull request May 5, 2026
greptile-apps (PR #594, r3189904028): `ProviderRepository.load`'s
YAML-default `DeprecationWarning` was using `warnings.warn(stacklevel=2)`,
which attributes to whichever data_designer frame called `load()` —
controllers, services, list/reset commands, agent introspection. Every
real call path lands on `data_designer.cli.*`, which falls under
Python's default `ignore::DeprecationWarning` filter and is silenced.
Audit found two more sites with the same problem:

- `DatasetBuilder._resolve_async_compatibility` (`allow_resize` /
  issue #552) — was using `stacklevel=4` to walk past
  `_resolve_async_compatibility -> build/build_preview -> interface ->
  user`. Brittle: any added frame (decorator, async wrapping, the
  `try/except DeprecationWarning: raise` boundary) shifts attribution
  silently. The existing test passed only because it used
  `simplefilter("always") + record=True`, which records warnings
  regardless of attribution.
- `ProviderController._handle_change_default` — was using
  `stacklevel=2`, which lands on the menu dispatcher in the same
  controller module. `print_warning` already shows the message
  visually, but programmatic observers (`pytest.warns`,
  `filterwarnings("error", ...)`) saw a library-attributed entry that
  default filters silenced.

All three migrated to `warn_at_caller` (the helper from 247fa30) so
attribution lands on the user's call site regardless of internal
chain shape. `data_designer` is already in
`DEFAULT_INTERNAL_PREFIXES`, so the walk escapes the entire library
in one pass.

Added attribution regression tests at each site asserting
`warning.filename == __file__`. A future regression to
`warnings.warn(stacklevel=N)` now fails CI instead of silently
silencing the user-facing nudge:

- `test_load_with_yaml_default_attributes_warning_to_caller`
  (test_provider_repository.py)
- `test_resolve_async_compatibility` extended with the same assertion
- `test_handle_change_default_emits_deprecation_warning` rewritten
  from `pytest.warns(...)` to a `catch_warnings(record=True)` block
  that filters for the message and asserts `filename == __file__`
  (`pytest.warns` does not check attribution, so the rewrite is
  required to actually catch the regression).

3,125 tests pass (548 config + 1,923 engine + 654 interface).

Refs #589
nabinchha added a commit that referenced this pull request May 5, 2026
* feat(models): deprecate implicit default provider routing

Emit DeprecationWarning whenever the legacy "implicit default
provider" path is exercised: `ModelConfig.provider=None`, the
registry-level `ModelProviderRegistry.default`, the YAML
`default:` key in `~/.data-designer/model_providers.yaml`, and
the CLI's "Change default provider" workflow.

`resolve_model_provider_registry` skips passing `default=` in the
single-provider case so the common construction path stays quiet.
Multi-provider registries still pass `default` (per
`check_implicit_default`) and warn accordingly.

Update docs, the package README, and test fixtures to specify
`provider=` explicitly on every `ModelConfig`. New tests cover
each warning entry point and pin the post-deprecation happy paths.

Refs #589

Made-with: Cursor

* fix(models): address PR #594 review feedback

Greptile P1: ProviderRepository.load emitted its DeprecationWarning
inside a `try/except Exception` block. Under
`filterwarnings("error", DeprecationWarning)` the warn would raise,
the except would swallow it, and `load()` would silently return None
(losing the registry). Move the warn outside the catch-all so the
strict-warning path no longer drops valid configs.

Greptile P2 / johnnygreco: `_warn_on_implicit_provider` and
`_warn_on_explicit_default` use `stacklevel=2`, which lands inside
pydantic v2's validator dispatch rather than at the user's
`ModelConfig(...)` / `ModelProviderRegistry(...)` call. That broke
both attribution (the source line was unhelpful) and Python's
once-per-location dedup (every call collapsed to the same
pydantic-internal key, suppressing all but the first warning).
Introduce `data_designer.config.utils.warning_helpers.warn_at_caller`,
which walks past the helper, validator, and any pydantic frames to
find the user's call site and emits via `warnings.warn_explicit` with
the user frame's `__warningregistry__`. Keeps attribution accurate
and dedup keyed on the user's (filename, lineno).

johnnygreco: align the `provider_repository.py` warning copy with the
sibling site in `default_model_settings.py` ("specify provider=
explicitly on each ModelConfig instead") so both YAML-default warning
sites give the same migration instruction. The previous wording
pointed users at "ModelConfig entries" inside `model_providers.yaml`,
where ModelConfig entries don't actually live.

johnnygreco: dedup the cascade in `DataDesigner.__init__`. With
`model_providers=None` and a YAML `default:`, the user previously saw
two DeprecationWarnings for the same root cause —
`get_default_provider_name()` warns about the YAML key, then
`resolve_model_provider_registry(...)` re-warns from
`_warn_on_explicit_default`. Suppress the registry-level duplicate in
the YAML-fallback branch via `warnings.catch_warnings()` so users see
exactly one warning per user action.

johnnygreco: tighten `_warn_on_explicit_default` to fire only when
`default is not None`. Passing `default=None` explicitly is
semantically equivalent to omitting it (caller is opting *out* of a
registry-level default), and shouldn't trigger the deprecation
nudge.

johnnygreco: add a `model_validate({...})` regression test for
`ModelConfig` so the deserialization path (legacy on-disk configs)
is pinned alongside the construction path.

Tests:
- Update `test_load_exists` and `test_save` to omit `default=` so the
  roundtrip stops exercising the deprecated YAML-default path
  unguarded (Greptile note).
- Wrap `test_resolve_model_provider_registry_with_explicit_default`,
  `test_get_provider`, and
  `test_init_user_supplied_providers_preserve_first_wins_over_yaml_default`
  in `pytest.warns` so the suite stays green under
  `-W error::DeprecationWarning` (Greptile note).
- Add `test_explicit_default_none_does_not_emit_deprecation_warning`
  to pin the tightened predicate.
- Add `test_init_yaml_default_emits_single_deprecation_warning` to
  pin the cascade-dedup behavior.

Refs #589

Made-with: Cursor

* fix(models): make deprecation warnings visible under default filters

andreatgretel (PR #594): the YAML-default warning in
`get_default_provider_name` and the registry-default warning emitted
from inside DataDesigner helpers were attributing to data_designer
library frames, not user code. Python's default filter chain includes
`ignore::DeprecationWarning`, so library-attributed entries are
silenced — meaning a normal `DataDesigner()` call with a YAML
`default:` set showed nothing, and `resolve_model_provider_registry`
warnings were similarly invisible. Two related changes:

1. `warn_at_caller`: extend the default skip-list from `("pydantic",)`
   to `("pydantic", "pydantic_core", "data_designer")` so the walk
   escapes both pydantic's validator-dispatch frames and data_designer
   helper frames before attributing. Also tighten the prefix predicate
   to exact-or-dotted-prefix matching (`name == p or
   name.startswith(p + ".")`) so e.g. `pydantic_helpers` is not
   falsely matched as part of `pydantic` (johnnygreco nit). Allow
   callers to pass a custom `skip_prefixes` for flexibility. Drop the
   "skip frame 0+1 unconditionally" guard now that prefix matching
   covers it.

2. `get_default_provider_name`: switch from
   `warnings.warn(stacklevel=2)` to `warn_at_caller`. The previous
   stacklevel pointed into `default_model_settings.py`, which is a
   library file → silenced under default filters. Verified the fix
   empirically with `python -W default`: warning is now attributed to
   the user's call site and rendered.

johnnygreco (PR #594): add the missing
`test_explicit_default_none_does_not_emit_deprecation_warning`
regression for the `self.default is not None` predicate landed in
the prior round.

Tests:
- New `test_warning_helpers.py` pins prefix-matching precision
  (rejects `pydantic_helpers` / `data_designer_other`), default
  skip-list contents, attribution past skip-prefix frames, and
  per-call-site dedup behavior.
- `test_get_default_provider_name_warning_attributes_to_user_frame`
  pins andreatgretel's repro for the YAML-default site.
- `test_explicit_default_warning_attributes_to_user_frame` pins the
  multi-frame case: construction goes through
  `resolve_model_provider_registry`, so the walk has to escape both
  pydantic and data_designer before landing on the test file.
- `test_explicit_default_none_does_not_emit_deprecation_warning`
  pins johnnygreco's predicate-tightening regression.

3,124 tests pass (540 config + 1,923 engine + 653 interface; +10 net
from this round).

Refs #589

Made-with: Cursor

* fix(models): apply warn_at_caller to remaining deprecation sites

greptile-apps (PR #594, r3189904028): `ProviderRepository.load`'s
YAML-default `DeprecationWarning` was using `warnings.warn(stacklevel=2)`,
which attributes to whichever data_designer frame called `load()` —
controllers, services, list/reset commands, agent introspection. Every
real call path lands on `data_designer.cli.*`, which falls under
Python's default `ignore::DeprecationWarning` filter and is silenced.
Audit found two more sites with the same problem:

- `DatasetBuilder._resolve_async_compatibility` (`allow_resize` /
  issue #552) — was using `stacklevel=4` to walk past
  `_resolve_async_compatibility -> build/build_preview -> interface ->
  user`. Brittle: any added frame (decorator, async wrapping, the
  `try/except DeprecationWarning: raise` boundary) shifts attribution
  silently. The existing test passed only because it used
  `simplefilter("always") + record=True`, which records warnings
  regardless of attribution.
- `ProviderController._handle_change_default` — was using
  `stacklevel=2`, which lands on the menu dispatcher in the same
  controller module. `print_warning` already shows the message
  visually, but programmatic observers (`pytest.warns`,
  `filterwarnings("error", ...)`) saw a library-attributed entry that
  default filters silenced.

All three migrated to `warn_at_caller` (the helper from 247fa30) so
attribution lands on the user's call site regardless of internal
chain shape. `data_designer` is already in
`DEFAULT_INTERNAL_PREFIXES`, so the walk escapes the entire library
in one pass.

Added attribution regression tests at each site asserting
`warning.filename == __file__`. A future regression to
`warnings.warn(stacklevel=N)` now fails CI instead of silently
silencing the user-facing nudge:

- `test_load_with_yaml_default_attributes_warning_to_caller`
  (test_provider_repository.py)
- `test_resolve_async_compatibility` extended with the same assertion
- `test_handle_change_default_emits_deprecation_warning` rewritten
  from `pytest.warns(...)` to a `catch_warnings(record=True)` block
  that filters for the message and asserts `filename == __file__`
  (`pytest.warns` does not check attribution, so the rewrite is
  required to actually catch the regression).

3,125 tests pass (548 config + 1,923 engine + 654 interface).

Refs #589
…, fingerprint feature available

- Update allow_resize framing: now logs DeprecationWarning and falls back to sync (#553), no longer hard-rejected. Async is default as of #592.
- Reference DataDesignerConfig.fingerprint() (#587) as the per-stage hash for resume invalidation.
- Rename _validate_async_compatibility() to _resolve_async_compatibility() to match current code.
- Mark Phase 2 step 1 as done; list the concrete docs that still need updates.
…ant, on-disk handoffs, DAG-ready, acreate sidecar

- Resolve in-memory vs on-disk handoff to always-on-disk inside Pipeline; reserve in-memory for to_config_builder() notebook ergonomic.
- Add Composability section: parent DataDesigner reuse is a load-bearing API contract for throttle coordination across stages and parallel branches.
- Add Engine API surface section: acreate() as a small additive sidecar, independent of chaining v1 but a hard dependency for Phase 4.
- Promote DAG semantics from "future work" to "designed-in"; add Phase 4 (parallel branches via asyncio.gather over acreate); demote auto-chaining to Phase 5.
- New Resolved decisions section captures the three load-bearing API decisions; trim the Open questions list accordingly.
- Mention possible future external orchestration only as a vague composability constraint, no commitment.
- Soften "Door open for external orchestration" - drop throttle-backend-as-seam framing; cross-reference Future considerations.
- Make acreate() scope explicit (in-process); cross-process orchestration is not the same problem.
- Add Phase 4 scope clarifier - branch parallelism, not stage pipelining.
- New Future considerations section: external orchestration (vague, uncommitted) and pipelined execution of dependent stages.
@andreatgretel andreatgretel marked this pull request as ready for review May 7, 2026 22:06
@andreatgretel andreatgretel requested a review from a team as a code owner May 7, 2026 22:06
@andreatgretel andreatgretel deployed to agentic-ci May 7, 2026 22:06 — with GitHub Actions Active
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Review: PR #552 — docs: add plan for workflow chaining

Summary

This PR adds a single planning document at plans/workflow-chaining/workflow-chaining.md (+410/-0). It proposes a Pipeline class in the data_designer.interface layer that sequences multiple DataDesigner.create() calls, hands off between stages via on-disk parquet, and reuses the parent DataDesigner's ModelRegistry / ThrottleManager across stages. It also outlines:

No code changes.

Findings

Architectural alignment — strong

  • Layering is respected. Pipeline lives in data_designer.interface; the engine stays ignorant of pipelines; each stage is a regular DatasetBuilder.build() call. Consistent with the interface → engine → config import direction in AGENTS.md.
  • Config layer stays declarative. The plan explicitly keeps Pipeline imperative in v1 (no new config models), preserving the "declarative config, imperative engine" core principle.
  • Throttle invariant is load-bearing and correctly framed. The rationale for reusing a single DataDesigner (one ModelRegistry → one ThrottleManager → one AIMD state) is the right reason to pin this in the API contract now rather than discover it later. The same argument is extended cleanly to Phase 4 branches.
  • Internal-DAG, linear-API-v1 split. Keeping a DAG stage representation from day one while exposing only linear add_stage() avoids a disruptive restructure at Phase 4. Good forward-compat hygiene.
  • acreate() framed as a sidecar, not a pipeline dependency. Correctly identifies that sequential v1 doesn't need async, but Phase 4 does. The asyncio.wrap_future bridge over the existing singleton event loop is the minimal, correct shape.

Completeness — mostly complete, with a few areas to tighten before implementation

  1. Between-stage callback data contract is under-specified. The signature is (stage_output_path: Path) -> Path, but what the returned path must contain (a parquet-files/ subdirectory? a flat data.parquet? any LocalFileSeedSource-compatible layout?) is left implicit. The two examples are inconsistent: the pipeline reads stage_output_path / "parquet-files" but the callbacks write a flat data.parquet under a new directory. Before implementation, nail down exactly what the callback must produce so LocalFileSeedSource can consume it uniformly.

  2. Callback fingerprinting for resume is acknowledged but unresolved. The Resume Safety section lists callbacks as something that "may have changed between runs," but the fingerprint composition (DataDesignerConfig.fingerprint() + num_records + DD version + upstream stage fingerprint) does not capture the callback. A user who edits their filter predicate and reruns with resume=True will silently get stale filtered data. Options worth noting in the plan: hash the callback source, require a user-supplied version string, or document that callbacks invalidate resume. Better to decide now than post-ship.

  3. Stage fingerprint composition is missing sampling/selection. Phase 3 composes fingerprints from config + num_records + DD version + upstream. But add_stage() also takes sampling_strategy and selection_strategy, which change the data consumed by the stage. These should either be folded into the stage fingerprint or explicitly declared non-fingerprinted. Same question applies to allow_empty.

  4. allow_empty=True short-circuit: result-dict semantics unspecified. If stage 2 of 4 empties out, does results["stage_3"] raise KeyError, return None, or return a DatasetCreationResults with an empty dataset? The user-facing behavior of pipeline.run() in this branch isn't spelled out.

  5. Image/media column forwarding is flagged as an open question but unprioritized. Option (c) "document as unsupported in v1" may be fine, but users with image-producing stages will hit this immediately. Worth an explicit call: ship v1 without media forwarding and document the limitation, or require it in v1.

  6. Docs-audit scope may be incomplete. Phase 2 lists three docs files that reference allow_resize (custom_columns.md, plugins/example.md, agent-rollout-ingestion.md). Tutorials, example notebooks, and the skills/data-designer/ skill instructions aren't mentioned. A grep -r allow_resize across docs/, tutorials/ (if present), skills/, and packages/ before Phase 2 would be cheap insurance.

  7. ArtifactStorage bypass. The pipeline "owns its directory layout directly, bypassing ArtifactStorage's default auto-rename behavior." This is a reasonable choice but implies a new path through ArtifactStorage (or around it). Before implementation, confirm whether this requires an additive ArtifactStorage API (e.g., disable_auto_rename=True) or a fully separate layout manager, and whether the latter risks drift from single-run storage conventions.

  8. Migration example for allow_resize users is light. Phase 2 says "Migration path: users with allow_resize=True columns split their config into a pipeline with a stage boundary at the resize column." A concrete before/after example in the plan (or a tracked doc deliverable) would reduce user friction. Currently only listed under "Tests: verify rejection, migration path examples."

  9. acreate() cancellation semantics. asyncio.wrap_future over concurrent.futures.Future leaves cancellation propagation partial (cancellation of the asyncio wrapper doesn't reliably cancel the underlying future). Given the singleton background event loop, this is probably acceptable, but the plan should note whether cancellation is supported or deliberately not, so expectations don't drift during implementation.

Feasibility — feasible as structured

  • Phase 1 is a thin orchestration layer over existing components (DataDesigner.create(), LocalFileSeedSource, ArtifactStorage). No deep engine surgery required.
  • The sidecar acreate() is a ~one-file addition; the heavy lifting (singleton loop, concurrent future) already exists in _build_async.
  • Phase 2's engine simplifications (_cell_resize_mode, _finalize_fan_out resize branch, _resolve_async_compatibility) all unblock once Phase 1 gives users a migration path. Order is sound.
  • Phase 3 depends on Phase 1 metadata format. The plan calls out "the metadata format in phase 1 should record enough information to support it" — good. Worth making the exact metadata schema a deliverable of Phase 1, not a Phase 3 discovery.
  • Phase 4 additively extends add_stage(depends_on=...) without changing existing call sites. Clean.

Minor observations

  • dd.pipeline() factory is named consistently with dd.create(); good.
  • The to_config_builder() scoping (explicitly not a Pipeline, in-memory only, notebook ergonomic) is well-drawn. The plan is unambiguous that on-disk is the production path.
  • "Resolved decisions" section at the end reiterates choices already made in-body; this is slightly redundant but useful for reviewers skimming the plan as a decision log.
  • Naming question ("Pipeline vs Chain vs WorkflowChain") probably worth resolving before Phase 1 ships, since it's user-visible on dd.pipeline().

Verdict

Strong plan. Architecturally aligned with the project's layering and invariants, feasibility-grounded in existing primitives, and thoughtful about forward compatibility (DAG-internal-v1, acreate-as-sidecar). The one place the plan would benefit from a tightening pass before implementation is the stage data contract and fingerprint scope: callback output layout, callback fingerprinting for resume, and inclusion of sampling_strategy/selection_strategy in stage fingerprints. These are small clarifications, not redesigns. Media forwarding and migration-example concreteness are secondary polish items.

Recommend approving the plan with a note to resolve items 1–3 above before starting Phase 1 implementation.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR adds a design plan for workflow chaining — a Pipeline class that sequences multiple DataDesigner generation stages, passing each stage's output on disk as the next stage's seed. It also covers to_config_builder() for lightweight notebook chaining, allow_resize removal, per-stage resume via fingerprinting, and a Phase 4 DAG extension for parallel branches.

  • Pipeline class (Phase 1): dd.pipeline() factory creates a Pipeline that reuses the parent DataDesigner's ModelRegistry/ThrottleManager across stages, with on-disk handoffs via LocalFileSeedSource and optional between-stage callbacks for filtering.
  • Resume (Phase 3): Per-stage fingerprints composed from DataDesignerConfig.fingerprint(), num_records, DD version, and upstream stage fingerprint; the plan has a gap around how a fresh dd.pipeline() invocation locates a prior run's pipeline-metadata.json to actually skip completed stages.
  • DAG / Phase 4: Parallel branches via asyncio.gather over acreate(); parent fingerprint ordering for multi-parent stages is unspecified and could cause spurious cache misses.

Confidence Score: 4/5

Safe to merge as a documentation-only plan; no production code is changed.

The resume design has a concrete gap: a fresh dd.pipeline() with no identifier cannot locate a prior run's pipeline-metadata.json, so the flagship resume use case (generate-then-judge) would silently re-run all stages instead of skipping the completed one. This needs to be resolved before Phase 3 implementation begins, but since this is a plan document it does not block merging.

plans/workflow-chaining/workflow-chaining.md — the resume API section and Phase 4 DAG fingerprint ordering both need clarification before implementation starts.

Important Files Changed

Filename Overview
plans/workflow-chaining/workflow-chaining.md Design plan for workflow chaining (Pipeline class, to_config_builder(), allow_resize removal, resume, DAG phases). Well-structured but has a gap: the resume use case assumes a fresh dd.pipeline() can locate prior run artifacts, while the plan never specifies how pipeline-level artifact directories are named or identified across invocations.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["dd.pipeline()\n(holds parent DataDesigner)"] --> B["add_stage('stage-0', config, num_records)"]
    B --> C["add_stage('stage-1', config, num_records, after=callback?)"]
    C --> D["pipeline.run()"]
    D --> E["DataDesigner.create() Stage 0"]
    E --> F["Write parquet output\nartifacts/pipeline-name/stage-0-*/"]
    F --> G{"Between-stage callback?"}
    G -- yes --> H["callback(stage_output_path) returns filtered path"]
    G -- no --> I["LocalFileSeedSource points to stage-0 output"]
    H --> I
    I --> J["DataDesigner.create() Stage 1"]
    J --> K["Write parquet output\nartifacts/pipeline-name/stage-1-*/"]
    K --> L["PipelineResults dict"]
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
plans/workflow-chaining/workflow-chaining.md:300-303
**Resume identity gap: how does `pipeline_v2` find the prior run?**

The use case creates `pipeline_v2 = dd.pipeline()` as a fresh object (no arguments) and calls `pipeline_v2.run(resume=True)`. For that to skip stage 1, the pipeline must locate `pipeline-metadata.json` from the previous run. The artifact layout section shows `artifacts/pipeline-name/` as the root, but the `dd.pipeline()` factory is never shown accepting a name or artifact path, and the plan doesn't specify how the pipeline-level directory is named.

If that name defaults to anything non-deterministic (timestamp, UUID, or a hash of the `DataDesigner` instance), `pipeline_v2` would silently re-run all stages rather than skipping the completed one — the exact footgun the resume logic is meant to prevent. Either require an explicit name (`dd.pipeline(name="gen-judge")`) that users can repeat across invocations, or derive a deterministic pipeline fingerprint from the combination of stage configs; the choice should be made explicit in the design before Phase 1 ships, because it determines the `pipeline-metadata.json` path structure from day one.

### Issue 2 of 2
plans/workflow-chaining/workflow-chaining.md:363
**DAG fingerprint hash ordering is underspecified**

Phase 4 states that a multi-parent stage's upstream fingerprint is "the hash of all its parents' fingerprints." If the implementation concatenates parent fingerprints in the order they appear in `depends_on`, then `depends_on=["A","B"]` and `depends_on=["B","A"]` produce different fingerprints for the same logical dependency set. The resume logic would then treat re-ordering the `depends_on` list as a reason to invalidate and re-run that stage. The plan should specify that parent fingerprints are sorted (e.g., lexicographically by stage name) before hashing so the fingerprint is stable regardless of declaration order.

Reviews (1): Last reviewed commit: "docs: align plan framing with cross-proc..." | Re-trigger Greptile

Comment on lines +300 to +303
pipeline_v2 = dd.pipeline()
pipeline_v2.add_stage("generated", config_gen, num_records=1000)
pipeline_v2.add_stage("judged", config_judge_v2)
results_v2 = pipeline_v2.run(resume=True) # skips stage 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Resume identity gap: how does pipeline_v2 find the prior run?

The use case creates pipeline_v2 = dd.pipeline() as a fresh object (no arguments) and calls pipeline_v2.run(resume=True). For that to skip stage 1, the pipeline must locate pipeline-metadata.json from the previous run. The artifact layout section shows artifacts/pipeline-name/ as the root, but the dd.pipeline() factory is never shown accepting a name or artifact path, and the plan doesn't specify how the pipeline-level directory is named.

If that name defaults to anything non-deterministic (timestamp, UUID, or a hash of the DataDesigner instance), pipeline_v2 would silently re-run all stages rather than skipping the completed one — the exact footgun the resume logic is meant to prevent. Either require an explicit name (dd.pipeline(name="gen-judge")) that users can repeat across invocations, or derive a deterministic pipeline fingerprint from the combination of stage configs; the choice should be made explicit in the design before Phase 1 ships, because it determines the pipeline-metadata.json path structure from day one.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/workflow-chaining/workflow-chaining.md
Line: 300-303

Comment:
**Resume identity gap: how does `pipeline_v2` find the prior run?**

The use case creates `pipeline_v2 = dd.pipeline()` as a fresh object (no arguments) and calls `pipeline_v2.run(resume=True)`. For that to skip stage 1, the pipeline must locate `pipeline-metadata.json` from the previous run. The artifact layout section shows `artifacts/pipeline-name/` as the root, but the `dd.pipeline()` factory is never shown accepting a name or artifact path, and the plan doesn't specify how the pipeline-level directory is named.

If that name defaults to anything non-deterministic (timestamp, UUID, or a hash of the `DataDesigner` instance), `pipeline_v2` would silently re-run all stages rather than skipping the completed one — the exact footgun the resume logic is meant to prevent. Either require an explicit name (`dd.pipeline(name="gen-judge")`) that users can repeat across invocations, or derive a deterministic pipeline fingerprint from the combination of stage configs; the choice should be made explicit in the design before Phase 1 ships, because it determines the `pipeline-metadata.json` path structure from day one.

How can I resolve this? If you propose a fix, please make it concise.


- Extend `add_stage()` with an optional `depends_on=[stage_name, ...]` argument; default keeps the linear behavior.
- `pipeline.run()` walks the resulting DAG, gathering independent branches via `asyncio.gather` over `dd.acreate()` calls.
- Per-stage fingerprint composition (Phase 3) generalizes naturally: a stage's upstream fingerprint becomes the hash of all its parents' fingerprints.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 DAG fingerprint hash ordering is underspecified

Phase 4 states that a multi-parent stage's upstream fingerprint is "the hash of all its parents' fingerprints." If the implementation concatenates parent fingerprints in the order they appear in depends_on, then depends_on=["A","B"] and depends_on=["B","A"] produce different fingerprints for the same logical dependency set. The resume logic would then treat re-ordering the depends_on list as a reason to invalidate and re-run that stage. The plan should specify that parent fingerprints are sorted (e.g., lexicographically by stage name) before hashing so the fingerprint is stable regardless of declaration order.

Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/workflow-chaining/workflow-chaining.md
Line: 363

Comment:
**DAG fingerprint hash ordering is underspecified**

Phase 4 states that a multi-parent stage's upstream fingerprint is "the hash of all its parents' fingerprints." If the implementation concatenates parent fingerprints in the order they appear in `depends_on`, then `depends_on=["A","B"]` and `depends_on=["B","A"]` produce different fingerprints for the same logical dependency set. The resume logic would then treat re-ordering the `depends_on` list as a reason to invalidate and re-run that stage. The plan should specify that parent fingerprints are sorted (e.g., lexicographically by stage name) before hashing so the fingerprint is stable regardless of declaration order.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant