Question: Guidance on using language_code for custom, non-linguistic data

Hi there, thanks for this amazing library! It seems perfect for a project I'm working on.

I am training a ColBERT model from scratch using a base BERT model that was pre-trained on my own custom, non-linguistic data.

My data is highly structured and symbolic. A typical sequence looks like this: 0_common 1_rare [MASK] 0_low_frequency. It's essentially a space-delimited list of custom tokens, not a natural language.

I'm looking at the RAGTrainer documentation and I have a question about the language_code parameter. The docs explain that this is used to get relevant processing utilities for languages like English ('en') or Japanese ('ja'), which makes perfect sense for things like sentence splitting.

My question is: What is the best practice for the language_code parameter when working with custom, non-linguistic data like mine?

Should I:

Omit the language_code parameter entirely?
Explicitly set it to language_code=None?
Is there a specific value (e.g., 'any', 'raw', 'none') that signifies raw/generic processing?
My main goal is to ensure that RAGatouille does not apply any language-specific normalization or sentence-splitting logic to my data, and that it's treated simply as a sequence of space-delimited tokens.

Any guidance on this would be greatly appreciated. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: Guidance on using language_code for custom, non-linguistic data #273

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Question: Guidance on using language_code for custom, non-linguistic data #273

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions