Skip to content

Add slugify safe builtin for identifier-shaped expressions #242

@amc-corey-cox

Description

@amc-corey-cox

Motivation

Trans-specs that emit identifier-shaped slots from human-authored source labels routinely need to sanitize names: replace whitespace and punctuation, fold case, ensure leading-character validity. Concrete driver: schema-automator's EML importer (linkml/schema-automator#208), where EML <entityName> and <attributeName> fields contain spaces, punctuation, and mixed case but the target LinkML class_definition.name / slot_definition.name requires valid identifiers.

The current eval helpers (replace, lower, etc., from 0f803b6) can approximate slugify via 6-10 chained calls, but the chain is fragile and obscures intent. Regex was deliberately kept out for ReDoS reasons; slugify is bounded-time string manipulation and doesn't open that surface.

Proposed direction

Add slugify(s, separator='_') to the safe builtin set in eval_utils.py. Default behavior:

  • ASCII-fold (Unicode → ASCII transliteration)
  • Lowercase
  • Collapse non-alphanumeric runs to the separator
  • Strip leading and trailing separators
  • Ensure leading character is non-digit (prepend separator if needed)

Optionally include sibling helpers in the same bounded-time class: to_snake(s), to_camel(s), to_pascal(s). Often needed in tandem when trans-specs target schemas with different naming conventions per location.

Return semantics: None on no-extractable-content

slugify returns None when the input has no extractable identifier content — empty string, all-whitespace, all-punctuation, or input that collapses to empty after sanitization. This matches linkml-map's existing expression-layer convention: None is the SQL-style "doesn't apply" signal, propagates through case() arms, and composes with or for fallback chains:

range: { expr: "slugify(attributeName) or slugify(attributeLabel) or 'anonymous'" }

If slugify raised on empty input, the raise would short-circuit the trans-spec rather than letting the or chain do its job. Returning None keeps slugify composable with the rest of linkml-map's expression vocabulary.

Enforcement of "this slot can't be None" lives at the schema layer

Schemas that require non-empty identifiers — class names, slot names, dictionary keys — enforce that requirement at the schema-derivation layer, not inside slugify. Reporting "this source instance has unusable data for required slot X" with full source provenance is the job of transform-time target validation (#241). slugify stays a simple, total string function; structural requirements live where they belong.

Implementation location

slugify could land in linkml-runtime's utility set if there's a reasonable home there, with linkml-map re-exporting it for the eval namespace. Same normalization is useful in schemasheets, schema-automator, and other tooling that today re-roll variants. If linkml-runtime has no obvious home, ship in linkml-map as a standalone module with a clear path to upstream later.

Open questions

  • Separator default. _ vs -. LinkML identifier conventions skew snake-case → _ default. Configurable per-call.
  • Unicode policy. ASCII-fold by default (predictable, identifier-safe) with slugify(s, allow_unicode=True) opt-in for cases where Unicode identifiers are wanted.
  • Sibling helpers. Bundle to_snake / to_camel / to_pascal here, or scope this tight to slugify and follow up separately?

References / contrasts

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions