fix: transliterate non-Latin titles in URL slugs#1526
Open
ahmedqasid wants to merge 1 commit into
Open
Conversation
Pure-Arabic (and other non-Latin) titles previously got stripped by slugify and collapsed to the "topic" fallback, so every Arabic question landed at /questions/<id>/topic. Mirror the existing convertChinese pre-step using go-unidecode so titles in Arabic, Cyrillic, Hebrew, Thai etc. produce a readable ASCII slug. Latin-only and Chinese-only inputs short-circuit and remain byte-identical to the previous output. Gated by a package-level atomic flag (default on) exposed via SetTransliterateNonLatin so an admin toggle can be wired up in a follow-up PR without re-plumbing call sites.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix: transliterate non-Latin titles in URL slugs
Summary
slugify.Slugifyand collapsed to the literal"topic"fallback, so on a live site every Arabic question ended up at/questions/<id>/topic.convertChinesepre-step pattern inpkg/htmltext/htmltext.gousinggithub.com/mozillazg/go-unidecode(same author asgo-pinyinalready in the repo, to minimise new-dep friction).The fix
UrlTitle()now runsconvertNonLatinafterconvertChinese. The detector skips ASCII, Latin-1 Supplement, Latin Extended, and CJK (which is handled by the existing pinyin step), so emoji / punctuation / symbols still flow intoclearEmoji+slugifyunchanged. Only when there are non-Latin letters present does it pay the unidecode cost.Example:
كيف حالك→kyf-hlk(wastopic).Live deployment / real-world verification
This patch has been running in production on ask.namasoft.com (an Apache Answer instance we operate) since deployment, built directly from this branch via
docker compose build. The site has Arabic-language questions, so the fix exercises the affected code path on every page load.Sample question URL on the deployed instance:
Click the link and you'll see the slug is the transliterated Arabic title rather than
topic. No data migration was needed sinceurl_titleis computed on every request fromTitleand never persisted (see "Why this is safe to ship" below).Admin-configurable
The transliteration is gated by a package-level
atomic.Bool(default on, since the current behavior is objectively broken for affected users):htmltext.SetTransliterateNonLatin(enabled bool)htmltext.IsTransliterateNonLatinEnabled() boolThis is deliberately the minimum surface needed to satisfy "the setting must be readable from
UrlTitle()". A follow-up PR can add the admin UI section (Non-Latin Languages Handling) that callsSetTransliterateNonLatinon save and on startup, without having to re-plumb everyhtmltext.UrlTitlecall site throughcontext.Context.Why this is safe to ship
url_titleis not a persisted column. It's not on theQuestionentity ininternal/entity/question_entity.go, no migration has ever added/dropped it, and every call site (internal/service/content/question_service.go,revision_service.go,vote_service.go, etc.) recomputes it fromTitleat response-build time viahtmltext.UrlTitle(...).Test coverage
pkg/htmltext/htmltext_test.go:TestUrlTitleTable(table-driven): empty, pure Latin (unchanged), pure Chinese (unchanged — pins existing pinyin behavior), pure Arabic, mixed Latin+Arabic, emoji-only (still collapses totopicas before), very long Arabic (exercisescutLongTitle's 150-byte cap and UTF-8 boundary safety).TestUrlTitleTransliterationToggle: with the toggle off, Arabic collapses totopic(pre-fix behavior); with it on, transliterates.TestUrlTitleleft untouched.Test plan for reviewers:
go test ./pkg/htmltext/...— all pass locallytopicmain(covered by table tests)Out of scope (intentionally)
Non-Latin Languages Handlingadmin page +SiteType+ service / controller / migration in a follow-up if maintainers want it."topic"empty-result fallback.convertChinesepre-step pattern instead.Issues / discussion
I didn't find an existing upstream issue covering this — happy to be pointed at one if there is.
🤖 Generated with Claude Code