Feature Proposal: Slang & Roman Urdu Normalization for spaCy #13939
MudassarGill
started this conversation in
New Features & Project Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Feature Proposal: Smart Text Normalizer for Roman Urdu & English Slang
Problem Statement
Current NLP pipelines struggle with informal text from social media, WhatsApp chats, and casual conversations. While spaCy handles standard English well, there's no built-in support for:
Roman Urdu is especially important for South Asian languages - millions of people write Urdu using English script on social media, but no major NLP library supports it.
Example
Input:
u kya kar raha hai bhai? idk btw
text
Current spaCy Output (without feature):
u → u
kya → kya
kar → kar
raha → raha
hai → hai
bhai → bhai
idk → idk
btw → btw
text
Expected Output (with feature):
u → you
kya → what
kar → kar
raha → going
hai → is
bhai → brother
idk → I don't know
btw → by the way
text
Proposed Implementation
1. Roman Urdu Dictionary (New File:
spacy/lang/ur/norm_exceptions.py)Beta Was this translation helpful? Give feedback.
All reactions