Skip to content

Unsupervised LMs for Building Data Driven Language Taxonomy for SEA languages

License

Notifications You must be signed in to change notification settings

SEACrowd/language-taxonomy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

language-taxonomy

Unsupervised LMs for Building Data Driven Language Taxonomy for SEA languages

TO DO

  • implement tokenizer byte fallback
  • log model training progress (train and eval loss) to a file instead of printing
  • make running things more verbose (print out messages what is being done now, etc) (?)
  • decide on default training params (# of decoder blocks, )
  • add cluster labels from agglomerative clustering algorithm to .json files in experiment_metadata/ instead
  • make model training also based on the experiment metadata .json file (model path, training data used, etc)

Running

Experiments are defined by .json files in experiment_metadata/ (create this dir). The name of this file acts as the experiment name. This file acts logs everything (model and tokenizer used, what training and test data is used, generated ppl vectors and their labels, etc)

1. make a basic experiment file, for example: all_languages.json save it to experiment_metadata/

{
    "languages": {
        "eng": {
            "train_data": [
                "data/eng/*train*.txt"
            ],
            "test_data": [
                "data/eng/*test*.txt"
            ]
        },
        "ind": {
            "train_data": [
                "data/ind/*train*.txt"
            ],
            "test_data": [
                "data/ind/*test*.txt"
            ]
        },
        "zho": {
            "train_data": [
                "data/zho/*train*.txt"
            ],
            "test_data": [
                "data/zho/*test*.txt"
            ]
        }
    }
}

2. make a universal tokenizer for the group of languages in all_languages.json

./run_train_tokenizer.sh

make sure that experiment_name is set to the .json file in experiment_metadata/ (in this case all_languages). after running, the all_languages.json should be updated to have the attribute

"tokenizer": "trained_tokenizers/all_languages"

and all_languages is now a valid tokenizer in trained_tokenizers/

3. train models

run for each language

./run_train_model.sh

make sure the args are set correctly, example:

python src/train_model.py chinese_lm \
                            'data/zho/*.txt' \
                            --tokenizer 'trained_tokenizers/all_languages' \
                            --n_embd 256 \
                            --n_layer 2 \
                            --n_head 8 \

4. manually update experiment metadata file (the all_languages.json)

not done automatically to avoid conflicts to experiment metadata file in case we want to train models in parallel

update the .json file for each language to include a "model" key that indicates which model should be used

...
        "eng": {
            "model": "trained_models/english_lm/english_lm_finalmodel",
            "train_data": [
                "data/eng/*train*.txt"
            ],
            "test_data": [
                "data/eng/*test*.txt"
            ]
        }
...

5. run the perplexity vector generator

./run_ppl_vectors.sh

this script only need the name of the experiment (in experiment_metadata/). after running, each lang should have a "ppl_vector" attribute and there's also a "ppl_vector_order" key that tells us the language of each dimension in the perplexity vector

{
    "languages": {
        "eng": {
            ...
            "ppl_vector": [
                149.94302368164062,
                165.8548583984375,
                178.46302795410156
            ]
    ...
    "ppl_vector_order": [
        "eng",
        "ind",
        "zho"
    ]
}

6. run the clustering (will generate a dendrogram in dendrogram/)

after setting up the right arguments:

 ./run_clustering.sh

About

Unsupervised LMs for Building Data Driven Language Taxonomy for SEA languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5