Skip to content

Missing key tokenizer in nemo NMT multilingual finetunning #210

@syedhamza671

Description

@syedhamza671

Hello there,

I followed the tutorial for NMT multilingual models finetunning. But in the last step where we run the megatron_nmt_training.py as told in the tutorial t gives an error regarding a key(tokenizer not being found in the pretrained models config file).

I ran this command :

HYDRA_FULL_ERROR=1
python /opt/NeMo/examples/nlp/machine_translation/megatron_nmt_training.py
trainer.precision=32
trainer.devices=1
trainer.max_epochs=5
trainer.max_steps=200000
trainer.val_check_interval=5000
trainer.log_every_n_steps=5000
model.multilingual=True
model.pretrained_model_path=workspace/model/pretrained_ckpt/megatronnmt_any_en_500m.nemo
model.micro_batch_size=1
model.global_batch_size=2
model.encoder_tokenizer.library=sentencepiece
model.decoder_tokenizer.library=sentencepiece
model.encoder_tokenizer.model=workspace/tokenizer/spm_64k_all_32_langs_plus_en_nomoses.model
model.decoder_tokenizer.model=workspace/tokenizer/spm_64k_all_32_langs_plus_en_nomoses.model
model.src_language=['es, pt']
model.tgt_language=en
model.train_ds.src_file_name=workspace/data/train_src_files
model.train_ds.tgt_file_name=workspace/data/train_tgt_files
model.test_ds.src_file_name=workspace/data/en_es_final_es_test_filepath
model.test_ds.tgt_file_name=workspace/data/en_es_final_en_test_filepath
model.validation_ds.src_file_name=workspace/data/val_src_files
model.validation_ds.tgt_file_name=workspace/data/val_tgt_files
model.optim.lr=0.00001
model.train_ds.concat_sampling_probabilities=['0.1, 0.1']
++model.pretrained_language_list=None
+model.optim.sched.warmup_steps=500
~model.optim.sched.warmup_ratio
exp_manager.resume_if_exists=True
exp_manager.resume_ignore_no_checkpoint=True
exp_manager.create_checkpoint_callback=True
exp_manager.checkpoint_callback_params.monitor=val_sacreBLEU_avg
exp_manager.checkpoint_callback_params.mode=max
exp_manager.checkpoint_callback_params.save_top_k=5
+exp_manager.checkpoint_callback_params.save_best_model=true

and it gives this error :

Traceback (most recent call last):
File "/opt/NeMo/examples/nlp/machine_translation/megatron_nmt_training.py", line 113, in main
pretrained_cfg.encoder_tokenizer = pretrained_cfg.tokenizer
omegaconf.errors.ConfigAttributeError: Missing key tokenizer
full_key: tokenizer
object_type=dict

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions