-
Notifications
You must be signed in to change notification settings - Fork 4
fix: implement number normalizers for German, Italian, enhance Dutch normalization with digit words #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: implement number normalizers for German, Italian, enhance Dutch normalization with digit words #20
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| """German number normalizer using text2num's alpha2digit. | ||
|
|
||
| Converts spelled-out numbers to digits (e.g. zwanzig → 20) and handles | ||
| mixed digit+word forms (e.g. 2 hundert → zwei hundert) before conversion | ||
| so alpha2digit does not misinterpret them. | ||
|
|
||
| A post-pass replaces words alpha2digit leaves unconverted in isolation: | ||
| - 'null' → '0' (alpha2digit skips it standalone) | ||
| - 'zwei' → '2' (alpha2digit skips it standalone and in plain noun phrases) | ||
| 'ein'/'eins' are intentionally excluded — 'ein' is the German indefinite | ||
| article and cannot be safely replaced without context. | ||
| """ | ||
|
|
||
| import re | ||
|
|
||
| from text_to_num import alpha2digit | ||
|
|
||
| _DIGIT_TO_GERMAN: dict[str, str] = { | ||
| "0": "null", | ||
| "1": "ein", | ||
| "2": "zwei", | ||
| "3": "drei", | ||
| "4": "vier", | ||
| "5": "fünf", | ||
| "6": "sechs", | ||
| "7": "sieben", | ||
| "8": "acht", | ||
| "9": "neun", | ||
| } | ||
|
|
||
| _RE_MIXED_NUMBER = re.compile( | ||
| r"\b(\d+)\s+(hundert|tausend|millionen?|milliarden?|billionen?)\b", | ||
| re.IGNORECASE, | ||
| ) | ||
|
|
||
| _RE_ZWEI = re.compile(r"\bzwei\b", re.IGNORECASE) | ||
| _RE_NULL = re.compile(r"\bnull\b", re.IGNORECASE) | ||
|
|
||
|
|
||
| def _normalize_mixed_numbers(text: str) -> str: | ||
| """Convert '2 hundert' → 'zwei hundert' so alpha2digit yields 200, not '2 100'.""" | ||
|
|
||
| def replace(match: re.Match) -> str: | ||
| number = match.group(1) | ||
| multiplier = match.group(2) | ||
| if len(number) == 1 and number in _DIGIT_TO_GERMAN: | ||
| return f"{_DIGIT_TO_GERMAN[number]} {multiplier}" | ||
|
Comment on lines
+31
to
+47
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix mixed digit + German scale preprocessing for singular scale words. Line 32 misses valid singular 🐛 Proposed fix+_FEMININE_SINGULAR_SCALES = {"million", "milliarde", "billion"}
+
_RE_MIXED_NUMBER = re.compile(
- r"\b(\d+)\s+(hundert|tausend|millionen?|milliarden?|billionen?)\b",
+ r"\b(\d+)\s+(hundert|tausend|million(?:en)?|milliarde(?:n)?|billion(?:en)?)\b",
re.IGNORECASE,
) number = match.group(1)
multiplier = match.group(2)
if len(number) == 1 and number in _DIGIT_TO_GERMAN:
- return f"{_DIGIT_TO_GERMAN[number]} {multiplier}"
+ digit_word = _DIGIT_TO_GERMAN[number]
+ if number == "1" and multiplier.lower() in _FEMININE_SINGULAR_SCALES:
+ digit_word = "eine"
+ return f"{digit_word} {multiplier}"
return match.group(0)🤖 Prompt for AI Agents |
||
| return match.group(0) | ||
|
|
||
| return _RE_MIXED_NUMBER.sub(replace, text) | ||
|
|
||
|
|
||
| def _fix_remaining_words(text: str) -> str: | ||
| """Replace number words alpha2digit did not convert.""" | ||
| text = _RE_ZWEI.sub("2", text) | ||
| text = _RE_NULL.sub("0", text) | ||
| return text | ||
|
|
||
|
|
||
| class GermanNumberNormalizer: | ||
| """Convert German spelled-out numbers to digits via text2num.alpha2digit.""" | ||
|
|
||
| def __call__(self, text: str) -> str: | ||
| text = _normalize_mixed_numbers(text) | ||
| text = alpha2digit(text, "de") | ||
| text = _fix_remaining_words(text) | ||
| return text | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,25 @@ | ||
| from normalization.languages.base import LanguageConfig, LanguageOperators | ||
| from normalization.languages.german.number_normalizer import GermanNumberNormalizer | ||
| from normalization.languages.german.replacements import GERMAN_REPLACEMENTS | ||
| from normalization.languages.german.sentence_replacements import ( | ||
| GERMAN_SENTENCE_REPLACEMENTS, | ||
| ) | ||
| from normalization.languages.registry import register_language | ||
|
|
||
| _GERMAN_DIGIT_WORDS: dict[str, str] = { | ||
| "null": "0", | ||
| "ein": "1", | ||
| "eins": "1", | ||
|
Comment on lines
+9
to
+12
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Keep ambiguous Line 11 makes 🐛 Proposed fix _GERMAN_DIGIT_WORDS: dict[str, str] = {
"null": "0",
- "ein": "1",
"eins": "1",
"zwei": "2", digit_words=_GERMAN_DIGIT_WORDS,
number_words=[
+ "ein",
*_GERMAN_DIGIT_WORDS,
"zehn",Also applies to: 49-50, 79-79 🤖 Prompt for AI Agents |
||
| "zwei": "2", | ||
| "drei": "3", | ||
| "vier": "4", | ||
| "fünf": "5", | ||
| "sechs": "6", | ||
| "sieben": "7", | ||
| "acht": "8", | ||
| "neun": "9", | ||
| } | ||
|
|
||
| GERMAN_CONFIG = LanguageConfig( | ||
| code="de", | ||
| decimal_separator=",", | ||
|
|
@@ -31,13 +46,48 @@ | |
| }, | ||
| filler_words=["äh", "ähm", "hm", "also", "naja", "halt"], | ||
| sentence_replacements=GERMAN_SENTENCE_REPLACEMENTS, | ||
| digit_words=_GERMAN_DIGIT_WORDS, | ||
| number_words=[ | ||
| *_GERMAN_DIGIT_WORDS, | ||
| "zehn", | ||
| "elf", | ||
| "zwölf", | ||
| "dreizehn", | ||
| "vierzehn", | ||
| "fünfzehn", | ||
| "sechzehn", | ||
| "siebzehn", | ||
| "achtzehn", | ||
| "neunzehn", | ||
| "zwanzig", | ||
| "dreißig", | ||
| "vierzig", | ||
| "fünfzig", | ||
| "sechzig", | ||
| "siebzig", | ||
| "achtzig", | ||
| "neunzig", | ||
| "hundert", | ||
| "tausend", | ||
| "million", | ||
| "millionen", | ||
| "milliarde", | ||
| "milliarden", | ||
| "billion", | ||
| "billionen", | ||
| ], | ||
| plus_word="plus", | ||
| ) | ||
|
|
||
|
|
||
| @register_language | ||
| class GermanOperators(LanguageOperators): | ||
| def __init__(self): | ||
| super().__init__(GERMAN_CONFIG) | ||
| self._number_normalizer = GermanNumberNormalizer() | ||
|
|
||
| def get_word_replacements(self) -> dict[str, str]: | ||
| return GERMAN_REPLACEMENTS | ||
|
|
||
| def expand_written_numbers(self, text: str) -> str: | ||
| return self._number_normalizer(text) | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,58 @@ | ||||||||||||||||||
| """Italian number normalizer using text2num's alpha2digit. | ||||||||||||||||||
|
|
||||||||||||||||||
| Converts spelled-out numbers to digits (e.g. venti → 20) and handles | ||||||||||||||||||
| mixed digit+word forms (e.g. 2 cento → due cento) before conversion | ||||||||||||||||||
| so alpha2digit does not misinterpret them. | ||||||||||||||||||
|
|
||||||||||||||||||
| A post-pass replaces words alpha2digit leaves unconverted in isolation: | ||||||||||||||||||
| - 'uno' → '1' | ||||||||||||||||||
| - 'due' → '2' | ||||||||||||||||||
| """ | ||||||||||||||||||
|
|
||||||||||||||||||
| import re | ||||||||||||||||||
|
|
||||||||||||||||||
| from text_to_num import alpha2digit | ||||||||||||||||||
|
|
||||||||||||||||||
| _RE_MIXED_NUMBER = re.compile( | ||||||||||||||||||
| r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b", | ||||||||||||||||||
| re.IGNORECASE, | ||||||||||||||||||
| ) | ||||||||||||||||||
|
Comment on lines
+16
to
+19
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
python - <<'PY'
import re
pat = re.compile(r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b", re.IGNORECASE)
for text in ("2 mila", "1 mille", "2 mil"):
print(f"{text!r}: {bool(pat.search(text))}")
PYRepository: gladiaio/normalization Length of output: 111 🏁 Script executed: find . -type f -name "*.py" -o -name "*.json" | head -20Repository: gladiaio/normalization Length of output: 930 🏁 Script executed: fd -e py -e json | grep -i italian | head -20Repository: gladiaio/normalization Length of output: 312 🏁 Script executed: rg -i "mille|mila" --type py | head -30Repository: gladiaio/normalization Length of output: 355 🏁 Script executed: cat -n normalization/languages/italian/operators.py | head -50Repository: gladiaio/normalization Length of output: 1745 🏁 Script executed: cat -n normalization/languages/italian/number_normalizer.pyRepository: gladiaio/normalization Length of output: 2358 Match Line 17 currently matches Proposed fix _RE_MIXED_NUMBER = re.compile(
- r"\b(\d+)\s+(cento|mila?|milioni?|miliardi?)\b",
+ r"\b(\d+)\s+(cento|mille|mila|milioni?|miliardi?)\b",
re.IGNORECASE,
)📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||
|
|
||||||||||||||||||
| _RE_UNO = re.compile(r"\buno\b", re.IGNORECASE) | ||||||||||||||||||
| _RE_DUE = re.compile(r"\bdue\b", re.IGNORECASE) | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| def _fix_remaining_words(text: str) -> str: | ||||||||||||||||||
| """Replace number words alpha2digit did not convert.""" | ||||||||||||||||||
| text = _RE_UNO.sub("1", text) | ||||||||||||||||||
| text = _RE_DUE.sub("2", text) | ||||||||||||||||||
| return text | ||||||||||||||||||
|
|
||||||||||||||||||
|
|
||||||||||||||||||
| class ItalianNumberNormalizer: | ||||||||||||||||||
| """Convert Italian spelled-out numbers to digits via text2num.alpha2digit. | ||||||||||||||||||
|
|
||||||||||||||||||
| Accepts digit_words (word→digit mapping from LanguageConfig) to derive | ||||||||||||||||||
| the digit→word mapping used for mixed-form pre-passes (e.g. '2 cento' → 'due cento'). | ||||||||||||||||||
| """ | ||||||||||||||||||
|
|
||||||||||||||||||
| def __init__(self, digit_words: dict[str, str]) -> None: | ||||||||||||||||||
| self._digit_to_word = {v: k for k, v in digit_words.items()} | ||||||||||||||||||
|
|
||||||||||||||||||
| def _normalize_mixed_numbers(self, text: str) -> str: | ||||||||||||||||||
| """Convert '2 cento' → 'due cento' so alpha2digit yields 200, not '2 100'.""" | ||||||||||||||||||
|
|
||||||||||||||||||
| def replace(match: re.Match) -> str: | ||||||||||||||||||
| number = match.group(1) | ||||||||||||||||||
| multiplier = match.group(2) | ||||||||||||||||||
| if len(number) == 1 and number in self._digit_to_word: | ||||||||||||||||||
| return f"{self._digit_to_word[number]} {multiplier}" | ||||||||||||||||||
| return match.group(0) | ||||||||||||||||||
|
|
||||||||||||||||||
| return _RE_MIXED_NUMBER.sub(replace, text) | ||||||||||||||||||
|
|
||||||||||||||||||
| def __call__(self, text: str) -> str: | ||||||||||||||||||
| text = self._normalize_mixed_numbers(text) | ||||||||||||||||||
| text = alpha2digit(text, "it") | ||||||||||||||||||
| text = _fix_remaining_words(text) | ||||||||||||||||||
| return text | ||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,12 @@ | ||
| import re | ||
|
|
||
| from normalization.languages.base import LanguageConfig, LanguageOperators | ||
| from normalization.languages.italian.number_normalizer import ItalianNumberNormalizer | ||
| from normalization.languages.italian.replacements import ITALIAN_REPLACEMENTS | ||
| from normalization.languages.italian.sentence_replacements import ( | ||
| ITALIAN_SENTENCE_REPLACEMENTS, | ||
| ) | ||
| from normalization.languages.registry import register_language | ||
|
|
||
| # Single digits 1–9: shared by digit_words and any future time/compound helpers. | ||
| # Single digits 1-9: shared by digit_words and any future time/compound helpers. | ||
| _ONE_TO_NINE: dict[str, str] = { | ||
| "uno": "1", | ||
| "due": "2", | ||
|
|
@@ -17,11 +19,6 @@ | |
| "nove": "9", | ||
| } | ||
|
|
||
| ITALIAN_SENTENCE_REPLACEMENTS: dict[str, str] = { | ||
| # Spoken percentages (“dieci per cento”) → one canonical form aligned with “%” → percento | ||
| "per cento": "percento", | ||
| } | ||
|
|
||
| ITALIAN_CONFIG = LanguageConfig( | ||
| code="it", | ||
| decimal_separator=",", | ||
|
|
@@ -101,16 +98,12 @@ | |
| class ItalianOperators(LanguageOperators): | ||
| def __init__(self): | ||
| super().__init__(ITALIAN_CONFIG) | ||
|
|
||
| def fix_one_word_in_numeric_contexts(self, text: str) -> str: | ||
| text = re.sub(r"(\d+)\s+uno\s+uno\b", r"\1 1 1", text) | ||
| text = re.sub(r"\buno\s+uno\s+(\d)", r"1 1 \1", text) | ||
| text = re.sub(r"(\d+)\s+uno\s+(\d)", r"\1 1 \2", text) | ||
| text = re.sub(r"(\d+)\s+uno\b", r"\1 1", text) | ||
| text = re.sub(r"\b(\d+)uno\b", r"\1 1", text) | ||
| text = re.sub(r"\buno\s+(\d)", r"1 \1", text) | ||
| text = re.sub(r"^uno\s+(?=[a-z])", "1 ", text) | ||
| return text | ||
| self._number_normalizer = ItalianNumberNormalizer( | ||
| ITALIAN_CONFIG.digit_words or {} | ||
| ) | ||
|
|
||
| def get_word_replacements(self) -> dict[str, str]: | ||
| return ITALIAN_REPLACEMENTS | ||
|
|
||
| def expand_written_numbers(self, text: str) -> str: | ||
| return self._number_normalizer(text) | ||
|
Comment on lines
+108
to
+109
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Description: Inspect where the numeric-context step is included relative to n_to_digits.
# Expectation: If fix_one_word_in_numeric_contexts can run for Italian, ItalianOperators should still override it.
rg -n -C3 '\bfix_one_word_in_numeric_contexts\b|\bn_to_digits\b|ExpandWrittenNumbersToDigitsStep|FixOneWordInNumericContextsStep'Repository: gladiaio/normalization Length of output: 7509 🏁 Script executed: grep -n "fix_one_word_in_numeric_contexts" normalization/languages/italian/operators.pyRepository: gladiaio/normalization Length of output: 48 🏁 Script executed: cat -n normalization/languages/italian/operators.py | head -130Repository: gladiaio/normalization Length of output: 3449 Implement
🤖 Prompt for AI Agents |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| ITALIAN_SENTENCE_REPLACEMENTS: dict[str, str] = { | ||
| "per cento": "percento", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: gladiaio/normalization
Length of output: 715
🏁 Script executed:
Repository: gladiaio/normalization
Length of output: 3696
🏁 Script executed:
Repository: gladiaio/normalization
Length of output: 8296
🏁 Script executed:
Repository: gladiaio/normalization
Length of output: 203
🏁 Script executed:
# Read the full implementation of fix_dot_adjacent_number_words cat -n normalization/steps/text/fix_dot_adjacent_number_words.pyRepository: gladiaio/normalization
Length of output: 1597
🏁 Script executed:
# Read the full implementation of protect_plus_word_before_digit_words cat -n normalization/steps/text/protect_plus_word_before_digit_words.pyRepository: gladiaio/normalization
Length of output: 1606
🏁 Script executed:
Repository: gladiaio/normalization
Length of output: 2352
🏁 Script executed:
Repository: gladiaio/normalization
Length of output: 424
🏁 Script executed:
Repository: gladiaio/normalization
Length of output: 2562
🏁 Script executed:
Repository: gladiaio/normalization
Length of output: 227
🏁 Script executed:
Repository: gladiaio/normalization
Length of output: 48
🏁 Script executed:
Repository: gladiaio/normalization
Length of output: 50
🏁 Script executed:
Repository: gladiaio/normalization
Length of output: 48
🏁 Script executed:
# Let's also check the Dutch config to understand the full context cat -n normalization/languages/dutch/operators.pyRepository: gladiaio/normalization
Length of output: 5161
"een"matching in numeric contexts is intentional but warrants test coverage.The functions
fix_dot_adjacent_number_wordsandprotect_plus_word_before_digit_wordsuse word boundaries (\b) in their regex patterns, which provides protection against false positives in arbitrary text. However,"een"will still match when it appears in the specific numeric patterns these steps are designed for:"punt"(dot):"versie punt een"→"versie punt 1"✓ (intended for IPs/versions)"plus"(plus):"plus een"→ converts to phone context marker (intended for+1country codes)In these contexts, the behavior is correct. However, there's a narrow edge case risk: ambiguous sentences like
"Dit kost plus een euro"could be misparsed. Adding explicit tests for"een"in these numeric-context patterns would confirm the behavior is safe for your typical input corpus.🤖 Prompt for AI Agents