Skip to content

Transform-time validation against target schema constraints #241

@amc-corey-cox

Description

@amc-corey-cox

Motivation

linkml-map's existing strict mode (f2363c3) catches one class of expression bugs: unbound names (typos, stale source references). Row-level errors during transformation already flow through TransformationError + the on_error callback path, with --continue-on-error toggling fail-fast vs collect-and-report at the CLI.

There's a category of error that falls outside both: the transform completed without exceptions, but the output is structurally invalid against the target schema. A trans-spec might:

  • Leave a required: true slot empty because the source field was missing
  • Produce a value for a slot with range: integer from a string-valued expression
  • Emit multiple values into a singular slot, or a scalar into a multivalued one
  • Produce an enum value not in the target enum's permissible values
  • Build a class with no identifier when the target requires one
  • Fail to populate a dictionary_key slot needed for map-keying
  • Violate a pattern: declared on the target slot

Today these surface only after the fact via linkml-validate on the produced output. That works, but it loses source-instance provenance — by the time linkml-validate runs, the error reads "class X has no name" without telling you that the source Attribute with attributeName=' ' at row 47 of the input was the cause. For schema-construction trans-specs (importers like EML, XSD, JSON-Schema) source provenance is the most valuable piece of the diagnostic.

The collect-all-errors flow that the CLI already implements for row-level transformation errors is exactly the right ergonomics for these structural errors too: run the transform once, get all the malformed-input diagnostics in a single report, fix them in one pass.

Proposed direction

Extend transform-time validation to cover target-schema structural constraints, surfacing errors through the existing TransformationError + on_error infrastructure. Scope:

Cardinality / presence

  • required: true slots evaluating to None
  • identifier: true slots evaluating to None
  • Slots used as dictionary_key evaluating to None
  • Scalar slots getting list values
  • multivalued: true slots getting non-iterable scalars

Type and range

  • Range coercion failures (target range is integer, expression produced "abc")
  • Range membership for typed ranges (date strings that don't parse, URIs that don't pattern-match)
  • Enum membership: produced value not in the target enum's permissible values

Structural integrity

Each violation builds a TransformationError populated with the class_derivation, slot_derivation, source row, and a discriminating kind field (or subclass) so consumers can filter. Errors flow through on_error when provided; otherwise they raise (fail-fast), matching the existing semantics. Non-strict mode produces output as-is with violations collected and reported at the end of the run.

Relationship to existing infrastructure

  • TransformationError in transformer/errors.py — already carries the right shape (derivation names, source_row, row_index, cause). Add a kind: str field or small subclass hierarchy distinguishing required_missing, range_mismatch, cardinality_mismatch, enum_violation, pattern_violation, unresolved_reference.
  • on_error callback in transformer/engine.py:59 — already the collect-vs-raise hook. No change needed; validation errors plug into the same path.
  • --continue-on-error CLI flag at cli/cli.py:141 — already wired. Target-validation errors flow through the same path.
  • --strict (f2363c3) — orthogonal: that one is about unbound names in expressions. The naming could become confusing as both grow; possibly rename one for clarity, or document the distinction prominently. See open questions.

Relationship to linkml-validate

This does not replace linkml-validate. The two layers compose:

  • Transform-time validation: structural constraints native to the target schema, evaluated during emission, with source-instance provenance in each error. Covers the cases listed above.
  • Post-transform linkml-validate: semantic / cross-record / dataset-level checks. Anything that depends on the whole output being assembled (uniqueness across records, inverse-slot consistency, complex constraint expressions).

The split is roughly: things you can know inside one derivation invocation → transform-time. Things that require the whole output → linkml-validate.

Reversibility implications

When inverting a trans-spec, the same validation infrastructure applies — the source schema becomes the target. One mechanism covers validation in both directions; no special-casing for reverse runs.

Open questions

  • Validation cost. Running each derivation result through a meta-schema-driven check at emission time has measurable cost. Sample benchmarks before committing to default-on for all violation kinds; some (e.g., enum membership against a 10K-permissible-value enum) may want opt-in.
  • --strict overload. linkml-map now has multiple notions of "strict": expression strict (f2363c3), and target-validation strict (this proposal). A single --strict flag toggling both is convenient but coarse. Consider per-area flags (--expression-strict, --validation-strict) with --strict as a meta-flag enabling all.
  • Output shape on violation in non-strict mode. Two options for the emitted artifact: (a) emit the invalid value anyway (consistent with how runtime errors already behave under continue_on_error — output is as-the-trans-spec-produced-it, errors are advisory), or (b) emit a sentinel / skip the slot. Lean toward (a): predictable, debuggable, and matches existing semantics.
  • Interaction with unrestricted_eval. No interaction expected; validation runs after expression evaluation regardless of eval mode.

References / contrasts

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions