If we are going to use FastText, we should be applying lowercase before language identification. At least in the official lid.175 model, uppercased text completely messes up the identification for mid/low-resource languages, always identifying them as the highest resource language of the script (Russian for cyrillic, English/Spanish/French for latin).
If we are going to use FastText, we should be applying lowercase before language identification. At least in the official lid.175 model, uppercased text completely messes up the identification for mid/low-resource languages, always identifying them as the highest resource language of the script (Russian for cyrillic, English/Spanish/French for latin).