Skip to content

Question: Guidance on using language_code for custom, non-linguistic data #273

@wangbaonan

Description

@wangbaonan

Hi there, thanks for this amazing library! It seems perfect for a project I'm working on.

I am training a ColBERT model from scratch using a base BERT model that was pre-trained on my own custom, non-linguistic data.

My data is highly structured and symbolic. A typical sequence looks like this: 0_common 1_rare [MASK] 0_low_frequency. It's essentially a space-delimited list of custom tokens, not a natural language.

I'm looking at the RAGTrainer documentation and I have a question about the language_code parameter. The docs explain that this is used to get relevant processing utilities for languages like English ('en') or Japanese ('ja'), which makes perfect sense for things like sentence splitting.

My question is: What is the best practice for the language_code parameter when working with custom, non-linguistic data like mine?

Should I:

Omit the language_code parameter entirely?
Explicitly set it to language_code=None?
Is there a specific value (e.g., 'any', 'raw', 'none') that signifies raw/generic processing?
My main goal is to ensure that RAGatouille does not apply any language-specific normalization or sentence-splitting logic to my data, and that it's treated simply as a sequence of space-delimited tokens.

Any guidance on this would be greatly appreciated. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions