You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: ROUGE-1 eval fails for non-English languages (ASCII-only tokenizer)
The default RougeScorer tokenizer uses r'\\w+' regex which only matches
ASCII [a-zA-Z0-9_]. For non-Latin scripts (Thai, Chinese, Japanese,
etc.), this returns zero tokens, causing ROUGE scores of 0.0 even when
the response matches the expected output exactly.
Added _unicode_tokenize function that uses re.UNICODE flag and falls
back to character-level tokenization for non-ASCII scripts.
Closes#3111
0 commit comments