Skip to content

Cross-class reference resolution between derived artifacts #237

@amc-corey-cox

Description

@amc-corey-cox

Motivation

When two derived artifacts must share a name to form a reference — typical case: a slot_definition.range pointing at an enum_definition.name where both were derived from the same source instance — there's no mechanism today to declare that pairing. slot() (133a9e2) addresses intra-class binding but doesn't reach across derivations.

Authors today work around this by writing the same naming expression in both derivations and trusting them to stay consistent. That's fragile and makes the cross-reference invisible to the planner: it can't detect that the two artifacts must agree, can't validate the round-trip, can't optimize execution.

Concrete driver: schema-automator's EML importer (linkml/schema-automator#208), where an EML attribute with an enumeratedDomain derives both a slot (in the parent class's attributes) and an enum (in the schema-level enums map), and the slot's range must equal the enum's name. Same pattern for future XSD/JSON-Schema importers.

Proposed direction

Allow a derivation to publish a named binding scoped to its source instance, and another derivation to consume it:

class_derivations:
  AttributeToEnum:
    populated_from: Attribute
    target_class: enum_definition
    publishes: { enum_name_for: self }
    slot_derivations:
      name: { expr: "<naming expression>" }

  AttributeToSlot:
    populated_from: Attribute
    target_class: slot_definition
    slot_derivations:
      range: { expr: "ref(enum_name_for=self)" }

The binding key (enum_name_for) plus the source-instance identifier (self) form a lookup; the runtime guarantees the consumer sees what the producer published.

Execution model — opt-in scratch store

Cross-references require persistence between derivations, which conflicts with streaming and with the deliberate earlier decision to remove implicit memoization from the runtime. The proposal is opt-in two-pass execution backed by a DuckDB scratch store:

  • Pass 1: stream the source. Producer derivations write bindings into a DuckDB temp table (source_id, binding_key, value). Artifacts that don't depend on cross-refs are emitted normally.
  • Pass 2: consumer derivations resolve ref() calls against the scratch table and emit dependent artifacts.

Properties:

  • Opt-in per trans-spec. Specs that declare no publishes/ref pairs stay single-pass and streamable. The memory cost is paid only by specs that ask for it.
  • Deterministic by construction. Read-before-write can't happen — all writes in pass 1 complete before any reads in pass 2. Missing-binding errors surface as clear diagnostics ("no binding enum_name_for=Attribute_42") rather than silent nulls.
  • Aligned with project direction. linkml-map is already investing in DuckDB (SQL compiler, engine work). The scratch store fits idiomatically; the same infrastructure also serves descendant iteration (populated_from_descendants: iterate source instances regardless of containment path #236).
  • Optimizable. When the dependency graph is tractable — producer and consumer in the same derivation, or unambiguously ordered — the planner can choose single-pass topo-sort instead. Two-pass is the safe default.

Reversibility

Cross-references make structural pairing explicit and therefore reversible by the inverse engine: given a slot with range: foo and an enum with name: foo, the inverse derivation reconstructs the source Attribute. Reversibility of the binding key expression depends on its own invertibility (e.g., slugify is one-way and breaks round-trips; equality / FK-style joins are fine). This inherits linkml-map's existing reversibility-where-lossless principle without adding new rules.

Open questions

  • Source-key scope. Restrict bindings to source-instance identity (must be hashable, drawn from identifier slots), or allow arbitrary computed keys? Lean toward source-instance identity first — keeps the scratch table small and the inverse mechanical.
  • Multiple producers, one consumer. Should a binding key be allowed to be published by more than one derivation? Probably error unless explicitly declared multi-valued.
  • Surface syntax. publishes / ref is the strawman; alternatives (e.g., explicit bindings: section at the spec level) might be cleaner.

References / contrasts

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions