Motivation
Trans-specs that emit identifier-shaped slots from human-authored source labels routinely need to sanitize names: replace whitespace and punctuation, fold case, ensure leading-character validity. Concrete driver: schema-automator's EML importer (linkml/schema-automator#208), where EML <entityName> and <attributeName> fields contain spaces, punctuation, and mixed case but the target LinkML class_definition.name / slot_definition.name requires valid identifiers.
The current eval helpers (replace, lower, etc., from 0f803b6) can approximate slugify via 6-10 chained calls, but the chain is fragile and obscures intent. Regex was deliberately kept out for ReDoS reasons; slugify is bounded-time string manipulation and doesn't open that surface.
Proposed direction
Add slugify(s, separator='_') to the safe builtin set in eval_utils.py. Default behavior:
- ASCII-fold (Unicode → ASCII transliteration)
- Lowercase
- Collapse non-alphanumeric runs to the separator
- Strip leading and trailing separators
- Ensure leading character is non-digit (prepend separator if needed)
Optionally include sibling helpers in the same bounded-time class: to_snake(s), to_camel(s), to_pascal(s). Often needed in tandem when trans-specs target schemas with different naming conventions per location.
Return semantics: None on no-extractable-content
slugify returns None when the input has no extractable identifier content — empty string, all-whitespace, all-punctuation, or input that collapses to empty after sanitization. This matches linkml-map's existing expression-layer convention: None is the SQL-style "doesn't apply" signal, propagates through case() arms, and composes with or for fallback chains:
range: { expr: "slugify(attributeName) or slugify(attributeLabel) or 'anonymous'" }
If slugify raised on empty input, the raise would short-circuit the trans-spec rather than letting the or chain do its job. Returning None keeps slugify composable with the rest of linkml-map's expression vocabulary.
Enforcement of "this slot can't be None" lives at the schema layer
Schemas that require non-empty identifiers — class names, slot names, dictionary keys — enforce that requirement at the schema-derivation layer, not inside slugify. Reporting "this source instance has unusable data for required slot X" with full source provenance is the job of transform-time target validation (#241). slugify stays a simple, total string function; structural requirements live where they belong.
Implementation location
slugify could land in linkml-runtime's utility set if there's a reasonable home there, with linkml-map re-exporting it for the eval namespace. Same normalization is useful in schemasheets, schema-automator, and other tooling that today re-roll variants. If linkml-runtime has no obvious home, ship in linkml-map as a standalone module with a clear path to upstream later.
Open questions
- Separator default.
_ vs -. LinkML identifier conventions skew snake-case → _ default. Configurable per-call.
- Unicode policy. ASCII-fold by default (predictable, identifier-safe) with
slugify(s, allow_unicode=True) opt-in for cases where Unicode identifiers are wanted.
- Sibling helpers. Bundle
to_snake / to_camel / to_pascal here, or scope this tight to slugify and follow up separately?
References / contrasts
Motivation
Trans-specs that emit identifier-shaped slots from human-authored source labels routinely need to sanitize names: replace whitespace and punctuation, fold case, ensure leading-character validity. Concrete driver: schema-automator's EML importer (linkml/schema-automator#208), where EML
<entityName>and<attributeName>fields contain spaces, punctuation, and mixed case but the target LinkMLclass_definition.name/slot_definition.namerequires valid identifiers.The current eval helpers (
replace,lower, etc., from 0f803b6) can approximate slugify via 6-10 chained calls, but the chain is fragile and obscures intent. Regex was deliberately kept out for ReDoS reasons;slugifyis bounded-time string manipulation and doesn't open that surface.Proposed direction
Add
slugify(s, separator='_')to the safe builtin set ineval_utils.py. Default behavior:Optionally include sibling helpers in the same bounded-time class:
to_snake(s),to_camel(s),to_pascal(s). Often needed in tandem when trans-specs target schemas with different naming conventions per location.Return semantics: None on no-extractable-content
slugifyreturnsNonewhen the input has no extractable identifier content — empty string, all-whitespace, all-punctuation, or input that collapses to empty after sanitization. This matches linkml-map's existing expression-layer convention:Noneis the SQL-style "doesn't apply" signal, propagates throughcase()arms, and composes withorfor fallback chains:If
slugifyraised on empty input, the raise would short-circuit the trans-spec rather than letting theorchain do its job. ReturningNonekeeps slugify composable with the rest of linkml-map's expression vocabulary.Enforcement of "this slot can't be None" lives at the schema layer
Schemas that require non-empty identifiers — class names, slot names, dictionary keys — enforce that requirement at the schema-derivation layer, not inside slugify. Reporting "this source instance has unusable data for required slot X" with full source provenance is the job of transform-time target validation (#241). slugify stays a simple, total string function; structural requirements live where they belong.
Implementation location
slugifycould land inlinkml-runtime's utility set if there's a reasonable home there, with linkml-map re-exporting it for the eval namespace. Same normalization is useful in schemasheets, schema-automator, and other tooling that today re-roll variants. If linkml-runtime has no obvious home, ship in linkml-map as a standalone module with a clear path to upstream later.Open questions
_vs-. LinkML identifier conventions skew snake-case →_default. Configurable per-call.slugify(s, allow_unicode=True)opt-in for cases where Unicode identifiers are wanted.to_snake/to_camel/to_pascalhere, or scope this tight toslugifyand follow up separately?References / contrasts
0f803b6— safe builtin registry that this extends