Hi there, thanks for this amazing library! It seems perfect for a project I'm working on.
I am training a ColBERT model from scratch using a base BERT model that was pre-trained on my own custom, non-linguistic data.
My data is highly structured and symbolic. A typical sequence looks like this: 0_common 1_rare [MASK] 0_low_frequency. It's essentially a space-delimited list of custom tokens, not a natural language.
I'm looking at the RAGTrainer documentation and I have a question about the language_code parameter. The docs explain that this is used to get relevant processing utilities for languages like English ('en') or Japanese ('ja'), which makes perfect sense for things like sentence splitting.
My question is: What is the best practice for the language_code parameter when working with custom, non-linguistic data like mine?
Should I:
Omit the language_code parameter entirely?
Explicitly set it to language_code=None?
Is there a specific value (e.g., 'any', 'raw', 'none') that signifies raw/generic processing?
My main goal is to ensure that RAGatouille does not apply any language-specific normalization or sentence-splitting logic to my data, and that it's treated simply as a sequence of space-delimited tokens.
Any guidance on this would be greatly appreciated. Thank you!
Hi there, thanks for this amazing library! It seems perfect for a project I'm working on.
I am training a ColBERT model from scratch using a base BERT model that was pre-trained on my own custom, non-linguistic data.
My data is highly structured and symbolic. A typical sequence looks like this: 0_common 1_rare [MASK] 0_low_frequency. It's essentially a space-delimited list of custom tokens, not a natural language.
I'm looking at the RAGTrainer documentation and I have a question about the language_code parameter. The docs explain that this is used to get relevant processing utilities for languages like English ('en') or Japanese ('ja'), which makes perfect sense for things like sentence splitting.
My question is: What is the best practice for the language_code parameter when working with custom, non-linguistic data like mine?
Should I:
Omit the language_code parameter entirely?
Explicitly set it to language_code=None?
Is there a specific value (e.g., 'any', 'raw', 'none') that signifies raw/generic processing?
My main goal is to ensure that RAGatouille does not apply any language-specific normalization or sentence-splitting logic to my data, and that it's treated simply as a sequence of space-delimited tokens.
Any guidance on this would be greatly appreciated. Thank you!