Skip to content

Add two deeper concept pages: Schema as a Workflow Specification + Comparison to Workflow Languages#185

Merged
dimitri-yatsenko merged 2 commits into
mainfrom
add/schema-as-spec-and-comparison
Jun 14, 2026
Merged

Add two deeper concept pages: Schema as a Workflow Specification + Comparison to Workflow Languages#185
dimitri-yatsenko merged 2 commits into
mainfrom
add/schema-as-spec-and-comparison

Conversation

@dimitri-yatsenko

Copy link
Copy Markdown
Member

Context

The Relational Workflow Model concept page (overview / paradigm) and the
component pages under Concepts > Data Model (Entity Integrity,
Normalization, Computation Model) leave two reader needs unmet:

  1. How is the schema a formal language? An informatics-knowledgeable
    reader asks for the grammar, the typed semantics, the algebra, and the
    machine-readable surface — Hal Stern's question on the June 12 call:
    "Python is not a formal spec — is there a grammar? Can it be published
    as YAML? Is there an API set for it?"
  2. How does DataJoint relate to the workflow-language landscape they
    already know?
    A fair structural comparison against CWL, Snakemake,
    Nextflow, Airflow, Argo, Prefect, and Dagster — and guidance on when
    each fits.

This PR adds two new pages that close those gaps and integrates them
with the existing concept set.

Changes

New pages

  • explanation/schema-as-workflow-specification.md (~1,150 words)

    • Names the Relational Workflow Model as DataJoint's major innovation
      and positions the schema as the formal language expressing it
    • Grammar — annotated DDL excerpt (Scan, AverageFrame,
      SegmentationParam, Segmentation) showing the --- separator, ->
      foreign keys, codec types, tier decoration
    • Semantics — three-condition existence rule for a Computed row,
      make() as a typed function, git-hash code provenance per row
    • The query algebra (brief + link)
    • Types (brief + link)
    • Self-healing operational semantics — populate() brings the world
      into compliance with the schema
    • Machine-readability and export — DOT/Mermaid, YAML/JSON, W3C PROV,
      OpenLineage, PROV-O, workflow-language conversion
    • The schema as control plane — declarative, queryable, enforceable,
      observable (parallel to network routing tables)
  • explanation/comparison-to-workflow-languages.md (~870 words)

    • Fair structural comparison against file-based workflow systems (CWL,
      Snakemake, Nextflow) and task orchestrators (Airflow, Argo, Prefect,
      Dagster), with adjacent categories (data catalogs, lakehouses) noted
      but separated
    • Side-by-side table across nine concerns (data structure, types, FK
      integrity, computation spec, execution order, provenance, drift
      detection, query interface, retry/idempotence)
    • What workflow languages offer, what they omit, DataJoint's deliberate
      trade-off (paraphrased from Yatsenko & Nguyen 2026 Section 5)
    • Convertibility — any CWL workflow translates mechanically to a
      DataJoint schema and back; DataJoint adds the data-structure layer
      that workflow languages omit; GATK WGS example referenced
    • When to choose what, including the "use both" production pattern
      (DataJoint inside an Airflow / Argo / Prefect orchestration)

Integration with existing concept set

  • Nav (mkdocs.yaml): place the two new pages at the end of the
    Data Model group so the progression reads
    paradigm > components > synthesis > comparison:
    RWM > Entity Integrity > Normalization > Computation Model >
    Schema as a Workflow Specification > Comparison to Workflow Languages.
  • Concepts landing page (explanation/index.md): cards added for
    both new pages.
  • FAQ (faq.md): the "Is DataJoint a Workflow Management System?"
    answer overlapped substantively with the new Comparison page; trimmed
    it to a two-paragraph pointer.
  • Data Pipelines (data-pipelines.md): the "Comparing Approaches"
    table was a mini-version of the new Comparison page; trimmed to a
    short paragraph + pointer.

Merge order with PR #184

Both new pages cross-reference the expanded Relational Workflow Model
page from PR #184. Suggested merge order:

  1. PR Expand Relational Workflow Model concept page #184 (expand RWM intro) first
  2. This PR second

If merged in the opposite order, the new pages still resolve their links
correctly — the cross-references just read against the older, shorter
RWM page until #184 lands.

…mparison to Workflow Languages

Two new pages under Concepts > Data Model that follow from the
Relational Workflow Model overview and address the informed-reader
questions the overview page cannot answer in its scope:

1. Schema as a Workflow Specification
   - Names the Relational Workflow Model as DataJoint's major innovation
   - Describes the schema as a formal language: grammar (annotated DDL
     excerpt for the Scan / AverageFrame / SegmentationParam /
     Segmentation pipeline), typed semantics (three-condition existence
     rule for a Computed row), the make() contract recording the git
     hash of the producing code, the five-operator algebra with
     closure, the type system, populate() as the self-healing engine
     that brings the world into compliance with the schema, and
     machine-readability / export pathways (DOT, Mermaid, YAML, JSON,
     W3C PROV, OpenLineage, PROV-O, workflow-language conversion).
   - Closes with the schema-as-control-plane framing (parallel to
     routing tables in a network control plane).

2. Comparison to Workflow Languages
   - Fair, structural comparison against CWL, Snakemake, Nextflow
     (file-based workflows) and Airflow, Argo, Prefect, Dagster (task
     orchestrators). Adjacent categories (data catalogs, lakehouses)
     noted but flagged as solving different problems.
   - Side-by-side table across nine concerns (data structure, types,
     FK integrity, computation, execution order, provenance, drift
     detection, query interface, retry semantics).
   - What workflow languages offer, what they omit, DataJoint's
     deliberate trade-off (paraphrasing Section 5 of Yatsenko & Nguyen
     2026).
   - Convertibility: any CWL workflow translates mechanically to a
     DataJoint schema and back, with the data-structure layer the
     workflow language omits supplied on conversion. GATK WGS pipeline
     used as the empirical reference.
   - "When to choose what" guidance including the "use both" pattern
     (DataJoint inside an Airflow / Argo / Prefect orchestration).

Nav: both pages inserted under Concepts > Data Model after Relational
Workflow Model and before Entity Integrity, in mkdocs.yaml.
…ines

Cohesion pass after adding Schema as a Workflow Specification and
Comparison to Workflow Languages:

- Nav (mkdocs.yaml): move the two new pages to the end of the Data Model
  group so the progression reads paradigm > components > synthesis >
  comparison: Relational Workflow Model > Entity Integrity > Normalization
  > Computation Model > Schema as a Workflow Specification > Comparison
  to Workflow Languages.
- Concepts index (explanation/index.md): add cards for both new pages.
- FAQ (faq.md): the "Is DataJoint a Workflow Management System?" answer
  was duplicating the Comparison page; trim it to a two-paragraph
  pointer to the new page.
- Data Pipelines (data-pipelines.md): the "Comparing Approaches" table
  was a mini version of the new Comparison page; trim to a short
  paragraph + pointer.
@dimitri-yatsenko dimitri-yatsenko merged commit 903f8ef into main Jun 14, 2026
3 checks passed
dimitri-yatsenko added a commit that referenced this pull request Jun 14, 2026
Placeholder for follow-up work after #184 (expand RWM) and #185 (deeper
concept pages) merge. Tracker file outlines what to trim, why, and how to
pick the work up once both upstream PRs land.

No content changes to docs source in this PR. The tracker file is to be
deleted in the same commit that applies the trim.
dimitri-yatsenko added a commit that referenced this pull request Jun 14, 2026
The developed argument lives on the Comparison to Workflow Languages
page (added in #185). The RWM page now mentions the trade-off in one
paragraph and links out, preventing drift between two homes for the
same argument.

Removes the .github/follow-ups/ tracker that scheduled this work.
dimitri-yatsenko added a commit that referenced this pull request Jun 26, 2026
…ucture concrete-first (#186)

* WIP tracker: trim "deliberate trade-off" prose from RWM concept page

Placeholder for follow-up work after #184 (expand RWM) and #185 (deeper
concept pages) merge. Tracker file outlines what to trim, why, and how to
pick the work up once both upstream PRs land.

No content changes to docs source in this PR. The tracker file is to be
deleted in the same commit that applies the trim.

* docs(rwm): trim "deliberate trade-off" prose; link to Comparison page

The developed argument lives on the Comparison to Workflow Languages
page (added in #185). The RWM page now mentions the trade-off in one
paragraph and links out, preventing drift between two homes for the
same argument.

Removes the .github/follow-ups/ tracker that scheduled this work.

* docs(rwm): align worked-example diagram with dj.Diagram notation

Match the conventions from datajoint-python's dj.Diagram
(diagram.py:1017-1082):

- Manual: green rectangle (unchanged)
- Lookup: plaintext — no border/fill (was a filled rectangle)
- Imported: blue stadium-shaped node — closest Mermaid approximation
  to dj.Diagram's ellipse
- Computed: red stadium-shaped node — same

Drop the inline tier-name and make() annotations on each node; tier
is now conveyed by shape and color alone, as in the real diagrams.
A new lead paragraph spells out the convention so the reader can
decode the diagram without a separate legend.

* docs(rwm): restructure concrete-first; reframe as added interpretation

Two structural cleanups on relational-workflow-model.md:

Concrete-first ordering. Open with a tight paragraph naming the
model, then lead with the worked example (diagram + walkthrough).
The historical lineage (Codd/Chen/RWM three interpretations) now
follows the example, placing DataJoint's contribution in context
once the reader has a concrete pipeline to anchor on. The closing
side-by-side reading table moves to the end of the page.

Reframe as interpretation, not departure. Classical relational
concepts (tables, rows, foreign keys, normalization, the query
algebra) apply unchanged; RWM adds a semantic interpretation on
top. Renamed and rewrote two sections to reflect this:

- "Four shifts from the classical relational model"
  → "A semantic interpretation, not a departure"
  Bullets now read additively ("tables also represent workflow
  steps") rather than contrastively ("not merely categories").

- "From transactions to transformations"
  → "Two readings of the same schema"
  Lead-in clarifies both readings hold simultaneously. Column
  header changes from "Traditional view" to "Classical reading."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants