Unsupervised LMs for Building Data Driven Language Taxonomy for SEA languages
- implement tokenizer byte fallback
- log model training progress (train and eval loss) to a file instead of printing
- make running things more verbose (print out messages what is being done now, etc) (?)
- decide on default training params (# of decoder blocks, )
- add cluster labels from agglomerative clustering algorithm to
.jsonfiles inexperiment_metadata/instead - make model training also based on the experiment metadata
.jsonfile (model path, training data used, etc)
Experiments are defined by .json files in experiment_metadata/ (create this dir). The name of this file acts as the experiment name. This file acts logs everything (model and tokenizer used, what training and test data is used, generated ppl vectors and their labels, etc)
{
"languages": {
"eng": {
"train_data": [
"data/eng/*train*.txt"
],
"test_data": [
"data/eng/*test*.txt"
]
},
"ind": {
"train_data": [
"data/ind/*train*.txt"
],
"test_data": [
"data/ind/*test*.txt"
]
},
"zho": {
"train_data": [
"data/zho/*train*.txt"
],
"test_data": [
"data/zho/*test*.txt"
]
}
}
}./run_train_tokenizer.shmake sure that experiment_name is set to the .json file in experiment_metadata/ (in this case all_languages). after running, the all_languages.json should be updated to have the attribute
"tokenizer": "trained_tokenizers/all_languages"and all_languages is now a valid tokenizer in trained_tokenizers/
run for each language
./run_train_model.shmake sure the args are set correctly, example:
python src/train_model.py chinese_lm \
'data/zho/*.txt' \
--tokenizer 'trained_tokenizers/all_languages' \
--n_embd 256 \
--n_layer 2 \
--n_head 8 \not done automatically to avoid conflicts to experiment metadata file in case we want to train models in parallel
update the .json file for each language to include a "model" key that indicates which model should be used
...
"eng": {
"model": "trained_models/english_lm/english_lm_finalmodel",
"train_data": [
"data/eng/*train*.txt"
],
"test_data": [
"data/eng/*test*.txt"
]
}
..../run_ppl_vectors.shthis script only need the name of the experiment (in experiment_metadata/). after running, each lang should have a "ppl_vector" attribute and there's also a "ppl_vector_order" key that tells us the language of each dimension in the perplexity vector
{
"languages": {
"eng": {
...
"ppl_vector": [
149.94302368164062,
165.8548583984375,
178.46302795410156
]
...
"ppl_vector_order": [
"eng",
"ind",
"zho"
]
}after setting up the right arguments:
./run_clustering.sh