-
Notifications
You must be signed in to change notification settings - Fork 2
Description
How to use GitHub
- Please use the 👍 reaction to show that you are interested into the same feature.
- Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue.
- Subscribe to receive notifications on status change and new comments.
Feature request
Which Nextcloud Version are you currently using: v32.0.0
Is your feature request related to a problem? Please describe.
Large input texts get cut off at some point in the corresponing translation/output text. This seems to vary based on the target language chosen and the max_decoding_length param does not help here much even with high values.
Describe the solution you'd like
Chunking of the input text, maybe in around 100 words, to keep the translation input chunks small and digestable by the model.
Note: split and join of the texts will need some special care depending on the language of the input text, for different separators, RTL languages and no-space languages.
Describe alternatives you've considered
Split the input text by hand.
Additional context
- related: Setting the maximum number of words #68
- this function can be used for this purpose. It uses the same
translate_batchfunction under the hood that we use here now: https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html#ctranslate2.Translator.translate_iterable