fixes language detection when there are mixing of more then one language#1
fixes language detection when there are mixing of more then one language#1pravj wants to merge 2 commits intolibindic:masterfrom
Conversation
|
can you make sure that this won't break anything ? (I think this will) I think we should detect the language which has more characters, rather than returning error message. What do you think @copyninja ? |
|
It sure will, detect_lang(u'यहಎಂದ') => {u'\u092f\u0939\u0c8e\u0c82\u0ca6': 'kn_IN'}This isn't a valid word! either in Hindi or in Kannada. What happened here is mixing. @santhoshtr being the original author of the module I think you can give more insight on this issue. Can you please share your thoughts here :-) @jishnu7 What do you mean by language which has more characters? |
|
I would say that u'यहಎಂದ' is not a valid test case. If it returns kn_IN or hi_IN , it is not terribly wrong. So I wont recommend returning error. But now that you pointed out this case, a related valid case we need to test is Kannada(just example) text surrounded by punctuation like quotes, parenthesis etc. I guess our current code need some improvement there. |
|
yes you all are right, it will break some modules as 'langdetect.py' is used in almost all of them. |
|
for every module which depends on 'langdetect', the breaking can be handeled with that 'error' dict key |
|
@pravj before jumping to implementation I would suggest read the reply from @santhoshtr . As I said its not really a valid word in either of language and as @santhoshtr said returning error is not recommended. But since you brought up this test case consider testing the case suggested by @santhoshtr i.e. language text surrounded by punctuations paranthesis etc. and see if you can fix that. |
|
as @santhoshtr mentioned about punctuation but module 'langdetect' handles that fine already.. |
actually the previous method to detect language was checking only first letter in a word, hence was giving wrong result in case of a word with having mixed languages.
for example : in previous method
so I tried to fix this, and this is the change