Skip to content

bug(schema): _slug mixed ASCII+non-ASCII collision #305

@egouilliard-leyton

Description

@egouilliard-leyton

Context

Discovered during review of the #277 fix (Unicode slug handling).

Description

_slug() strips non-ASCII characters after NFKD normalization, keeping only the ASCII portion. When a concept name contains both ASCII and non-ASCII characters, the non-ASCII part is silently dropped. Two different names that share the same ASCII prefix collide:

_slug("api_世界")  # → "api"
_slug("api_你好")  # → "api"

This produces identical ConceptNode.id values for semantically different concepts.

The pure-non-ASCII case is already handled by #277's hash fallback — this only affects mixed strings where the ASCII portion is non-empty.

Suggested approach

When the ASCII-only slug is shorter than the original (after normalization), append a short hash suffix to disambiguate:

if len(slug) < len(normalised.strip()):
    slug += "_" + hashlib.sha256(normalised.encode()).hexdigest()[:6]

This preserves readability (the ASCII prefix survives) while guaranteeing uniqueness.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions