Transform-time validation against target schema constraints

## Motivation

linkml-map's existing strict mode (`f2363c3`) catches one class of expression bugs: unbound names (typos, stale source references). Row-level errors during transformation already flow through `TransformationError` + the `on_error` callback path, with `--continue-on-error` toggling fail-fast vs collect-and-report at the CLI.

There's a category of error that falls outside both: **the transform completed without exceptions, but the output is structurally invalid against the target schema.** A trans-spec might:

- Leave a `required: true` slot empty because the source field was missing
- Produce a value for a slot with `range: integer` from a string-valued expression
- Emit multiple values into a singular slot, or a scalar into a multivalued one
- Produce an enum value not in the target enum's permissible values
- Build a class with no identifier when the target requires one
- Fail to populate a `dictionary_key` slot needed for map-keying
- Violate a `pattern:` declared on the target slot

Today these surface only after the fact via `linkml-validate` on the produced output. That works, but it loses **source-instance provenance** — by the time linkml-validate runs, the error reads "class X has no name" without telling you that the source `Attribute` with `attributeName='  '` at row 47 of the input was the cause. For schema-construction trans-specs (importers like EML, XSD, JSON-Schema) source provenance is the most valuable piece of the diagnostic.

The collect-all-errors flow that the CLI already implements for row-level transformation errors is exactly the right ergonomics for these structural errors too: run the transform once, get all the malformed-input diagnostics in a single report, fix them in one pass.

## Proposed direction

Extend transform-time validation to cover target-schema structural constraints, surfacing errors through the existing `TransformationError` + `on_error` infrastructure. Scope:

**Cardinality / presence**
- `required: true` slots evaluating to None
- `identifier: true` slots evaluating to None
- Slots used as `dictionary_key` evaluating to None
- Scalar slots getting list values
- `multivalued: true` slots getting non-iterable scalars

**Type and range**
- Range coercion failures (target range is `integer`, expression produced `"abc"`)
- Range membership for typed ranges (date strings that don't parse, URIs that don't pattern-match)
- Enum membership: produced value not in the target enum's permissible values

**Structural integrity**
- Pattern violations on slots with `pattern:` declared
- Unresolved cross-references (when the cross-ref proposal in #237 lands, a `ref()` consumer with no matching producer surfaces here)

Each violation builds a `TransformationError` populated with the class_derivation, slot_derivation, source row, and a discriminating `kind` field (or subclass) so consumers can filter. Errors flow through `on_error` when provided; otherwise they raise (fail-fast), matching the existing semantics. Non-strict mode produces output as-is with violations collected and reported at the end of the run.

## Relationship to existing infrastructure

- **`TransformationError`** in `transformer/errors.py` — already carries the right shape (derivation names, source_row, row_index, cause). Add a `kind: str` field or small subclass hierarchy distinguishing `required_missing`, `range_mismatch`, `cardinality_mismatch`, `enum_violation`, `pattern_violation`, `unresolved_reference`.
- **`on_error` callback** in `transformer/engine.py:59` — already the collect-vs-raise hook. No change needed; validation errors plug into the same path.
- **`--continue-on-error` CLI flag** at `cli/cli.py:141` — already wired. Target-validation errors flow through the same path.
- **`--strict` (f2363c3)** — orthogonal: that one is about unbound names in expressions. The naming could become confusing as both grow; possibly rename one for clarity, or document the distinction prominently. See open questions.

## Relationship to linkml-validate

This does **not** replace `linkml-validate`. The two layers compose:

- **Transform-time validation:** structural constraints native to the target schema, evaluated during emission, with source-instance provenance in each error. Covers the cases listed above.
- **Post-transform `linkml-validate`:** semantic / cross-record / dataset-level checks. Anything that depends on the whole output being assembled (uniqueness across records, inverse-slot consistency, complex constraint expressions).

The split is roughly: things you can know inside one derivation invocation → transform-time. Things that require the whole output → linkml-validate.

## Reversibility implications

When inverting a trans-spec, the same validation infrastructure applies — the source schema becomes the target. One mechanism covers validation in both directions; no special-casing for reverse runs.

## Open questions

- **Validation cost.** Running each derivation result through a meta-schema-driven check at emission time has measurable cost. Sample benchmarks before committing to default-on for all violation kinds; some (e.g., enum membership against a 10K-permissible-value enum) may want opt-in.
- **`--strict` overload.** linkml-map now has multiple notions of "strict": expression strict (f2363c3), and target-validation strict (this proposal). A single `--strict` flag toggling both is convenient but coarse. Consider per-area flags (`--expression-strict`, `--validation-strict`) with `--strict` as a meta-flag enabling all.
- **Output shape on violation in non-strict mode.** Two options for the emitted artifact: (a) emit the invalid value anyway (consistent with how runtime errors already behave under `continue_on_error` — output is as-the-trans-spec-produced-it, errors are advisory), or (b) emit a sentinel / skip the slot. Lean toward (a): predictable, debuggable, and matches existing semantics.
- **Interaction with `unrestricted_eval`.** No interaction expected; validation runs after expression evaluation regardless of eval mode.

## References / contrasts

- `f2363c3` — strict expression evaluation (unbound names); orthogonal but related vocabulary
- `TransformationError` in `transformer/errors.py` — the infrastructure this extends
- `on_error` callback in `transformer/engine.py:59` — existing collect-vs-raise hook
- `--continue-on-error` in `cli/cli.py:141` — existing CLI surface
- #237 — cross-ref resolution; unresolved bindings are a violation kind covered here
- #239 — multi-artifact emission; required-slot violations on secondary artifacts surface through this path
- #242 — `slugify` returns None on unusable input; "must-not-be-None" enforcement for required slots lives here
- schema-automator's EML importer (https://github.com/linkml/schema-automator/issues/208) — concrete consumer wanting transform-time provenance in errors


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform-time validation against target schema constraints #241

Motivation

Proposed direction

Relationship to existing infrastructure

Relationship to linkml-validate

Reversibility implications

Open questions

References / contrasts

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Transform-time validation against target schema constraints #241

Description

Motivation

Proposed direction

Relationship to existing infrastructure

Relationship to linkml-validate

Reversibility implications

Open questions

References / contrasts

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions