Motivation
linkml-map's existing strict mode (f2363c3) catches one class of expression bugs: unbound names (typos, stale source references). Row-level errors during transformation already flow through TransformationError + the on_error callback path, with --continue-on-error toggling fail-fast vs collect-and-report at the CLI.
There's a category of error that falls outside both: the transform completed without exceptions, but the output is structurally invalid against the target schema. A trans-spec might:
- Leave a
required: true slot empty because the source field was missing
- Produce a value for a slot with
range: integer from a string-valued expression
- Emit multiple values into a singular slot, or a scalar into a multivalued one
- Produce an enum value not in the target enum's permissible values
- Build a class with no identifier when the target requires one
- Fail to populate a
dictionary_key slot needed for map-keying
- Violate a
pattern: declared on the target slot
Today these surface only after the fact via linkml-validate on the produced output. That works, but it loses source-instance provenance — by the time linkml-validate runs, the error reads "class X has no name" without telling you that the source Attribute with attributeName=' ' at row 47 of the input was the cause. For schema-construction trans-specs (importers like EML, XSD, JSON-Schema) source provenance is the most valuable piece of the diagnostic.
The collect-all-errors flow that the CLI already implements for row-level transformation errors is exactly the right ergonomics for these structural errors too: run the transform once, get all the malformed-input diagnostics in a single report, fix them in one pass.
Proposed direction
Extend transform-time validation to cover target-schema structural constraints, surfacing errors through the existing TransformationError + on_error infrastructure. Scope:
Cardinality / presence
required: true slots evaluating to None
identifier: true slots evaluating to None
- Slots used as
dictionary_key evaluating to None
- Scalar slots getting list values
multivalued: true slots getting non-iterable scalars
Type and range
- Range coercion failures (target range is
integer, expression produced "abc")
- Range membership for typed ranges (date strings that don't parse, URIs that don't pattern-match)
- Enum membership: produced value not in the target enum's permissible values
Structural integrity
Each violation builds a TransformationError populated with the class_derivation, slot_derivation, source row, and a discriminating kind field (or subclass) so consumers can filter. Errors flow through on_error when provided; otherwise they raise (fail-fast), matching the existing semantics. Non-strict mode produces output as-is with violations collected and reported at the end of the run.
Relationship to existing infrastructure
TransformationError in transformer/errors.py — already carries the right shape (derivation names, source_row, row_index, cause). Add a kind: str field or small subclass hierarchy distinguishing required_missing, range_mismatch, cardinality_mismatch, enum_violation, pattern_violation, unresolved_reference.
on_error callback in transformer/engine.py:59 — already the collect-vs-raise hook. No change needed; validation errors plug into the same path.
--continue-on-error CLI flag at cli/cli.py:141 — already wired. Target-validation errors flow through the same path.
--strict (f2363c3) — orthogonal: that one is about unbound names in expressions. The naming could become confusing as both grow; possibly rename one for clarity, or document the distinction prominently. See open questions.
Relationship to linkml-validate
This does not replace linkml-validate. The two layers compose:
- Transform-time validation: structural constraints native to the target schema, evaluated during emission, with source-instance provenance in each error. Covers the cases listed above.
- Post-transform
linkml-validate: semantic / cross-record / dataset-level checks. Anything that depends on the whole output being assembled (uniqueness across records, inverse-slot consistency, complex constraint expressions).
The split is roughly: things you can know inside one derivation invocation → transform-time. Things that require the whole output → linkml-validate.
Reversibility implications
When inverting a trans-spec, the same validation infrastructure applies — the source schema becomes the target. One mechanism covers validation in both directions; no special-casing for reverse runs.
Open questions
- Validation cost. Running each derivation result through a meta-schema-driven check at emission time has measurable cost. Sample benchmarks before committing to default-on for all violation kinds; some (e.g., enum membership against a 10K-permissible-value enum) may want opt-in.
--strict overload. linkml-map now has multiple notions of "strict": expression strict (f2363c3), and target-validation strict (this proposal). A single --strict flag toggling both is convenient but coarse. Consider per-area flags (--expression-strict, --validation-strict) with --strict as a meta-flag enabling all.
- Output shape on violation in non-strict mode. Two options for the emitted artifact: (a) emit the invalid value anyway (consistent with how runtime errors already behave under
continue_on_error — output is as-the-trans-spec-produced-it, errors are advisory), or (b) emit a sentinel / skip the slot. Lean toward (a): predictable, debuggable, and matches existing semantics.
- Interaction with
unrestricted_eval. No interaction expected; validation runs after expression evaluation regardless of eval mode.
References / contrasts
Motivation
linkml-map's existing strict mode (
f2363c3) catches one class of expression bugs: unbound names (typos, stale source references). Row-level errors during transformation already flow throughTransformationError+ theon_errorcallback path, with--continue-on-errortoggling fail-fast vs collect-and-report at the CLI.There's a category of error that falls outside both: the transform completed without exceptions, but the output is structurally invalid against the target schema. A trans-spec might:
required: trueslot empty because the source field was missingrange: integerfrom a string-valued expressiondictionary_keyslot needed for map-keyingpattern:declared on the target slotToday these surface only after the fact via
linkml-validateon the produced output. That works, but it loses source-instance provenance — by the time linkml-validate runs, the error reads "class X has no name" without telling you that the sourceAttributewithattributeName=' 'at row 47 of the input was the cause. For schema-construction trans-specs (importers like EML, XSD, JSON-Schema) source provenance is the most valuable piece of the diagnostic.The collect-all-errors flow that the CLI already implements for row-level transformation errors is exactly the right ergonomics for these structural errors too: run the transform once, get all the malformed-input diagnostics in a single report, fix them in one pass.
Proposed direction
Extend transform-time validation to cover target-schema structural constraints, surfacing errors through the existing
TransformationError+on_errorinfrastructure. Scope:Cardinality / presence
required: trueslots evaluating to Noneidentifier: trueslots evaluating to Nonedictionary_keyevaluating to Nonemultivalued: trueslots getting non-iterable scalarsType and range
integer, expression produced"abc")Structural integrity
pattern:declaredref()consumer with no matching producer surfaces here)Each violation builds a
TransformationErrorpopulated with the class_derivation, slot_derivation, source row, and a discriminatingkindfield (or subclass) so consumers can filter. Errors flow throughon_errorwhen provided; otherwise they raise (fail-fast), matching the existing semantics. Non-strict mode produces output as-is with violations collected and reported at the end of the run.Relationship to existing infrastructure
TransformationErrorintransformer/errors.py— already carries the right shape (derivation names, source_row, row_index, cause). Add akind: strfield or small subclass hierarchy distinguishingrequired_missing,range_mismatch,cardinality_mismatch,enum_violation,pattern_violation,unresolved_reference.on_errorcallback intransformer/engine.py:59— already the collect-vs-raise hook. No change needed; validation errors plug into the same path.--continue-on-errorCLI flag atcli/cli.py:141— already wired. Target-validation errors flow through the same path.--strict(f2363c3) — orthogonal: that one is about unbound names in expressions. The naming could become confusing as both grow; possibly rename one for clarity, or document the distinction prominently. See open questions.Relationship to linkml-validate
This does not replace
linkml-validate. The two layers compose:linkml-validate: semantic / cross-record / dataset-level checks. Anything that depends on the whole output being assembled (uniqueness across records, inverse-slot consistency, complex constraint expressions).The split is roughly: things you can know inside one derivation invocation → transform-time. Things that require the whole output → linkml-validate.
Reversibility implications
When inverting a trans-spec, the same validation infrastructure applies — the source schema becomes the target. One mechanism covers validation in both directions; no special-casing for reverse runs.
Open questions
--strictoverload. linkml-map now has multiple notions of "strict": expression strict (f2363c3), and target-validation strict (this proposal). A single--strictflag toggling both is convenient but coarse. Consider per-area flags (--expression-strict,--validation-strict) with--strictas a meta-flag enabling all.continue_on_error— output is as-the-trans-spec-produced-it, errors are advisory), or (b) emit a sentinel / skip the slot. Lean toward (a): predictable, debuggable, and matches existing semantics.unrestricted_eval. No interaction expected; validation runs after expression evaluation regardless of eval mode.References / contrasts
f2363c3— strict expression evaluation (unbound names); orthogonal but related vocabularyTransformationErrorintransformer/errors.py— the infrastructure this extendson_errorcallback intransformer/engine.py:59— existing collect-vs-raise hook--continue-on-errorincli/cli.py:141— existing CLI surfaceslugifyreturns None on unusable input; "must-not-be-None" enforcement for required slots lives here