Yang Han1,2, Pengyu Wang1,2, Kai Yu1,2, Xin Chen2, Lu Chen1, 2
1 X-LANCE Lab, Shanghai Jiao Tong University, Shanghai 2 Suzhou Laboratory, Suzhou.
TL; DR: MS-BART is the first to leverage language model for mass spectra structure elucidation by introducing a unified vocabulary and enabling end-to-end pretraining, fine-tuning, and alignment.
conda env create -f environment.yml
conda activate ms-bartYou can download the preprocessed and model weight from the Figshare and put them in data folder.
The folder tree are:
data
├─ CANOPUS
│ ├─ mist
│ ├─ model-weights # The final MS-BART model on CANOPUS dataset
| ├─ pretrain-data # clean pretrain data (filter Tanimoto similarity > 0.5)
| ├─ pretrained-model # pretrain on clean 4M pretrain dataset
│ ├─ train
│ ├─ test
│ ├─ val
├─ MassSpecGym
│ ├─ mist # retrained with clean CANOPUS dataset
│ ├─ model-weights # The final MS-BART model on MassSpecGym dataset
| ├─ pretrain-data
| ├─ pretrained-model
│ ├─ train
│ ├─ test
│ ├─ val
Or you can download the original data preprocee and train from scratch
# Pretrain dataset generation and split 10000 for validation to choose the best model for finetune and alignment
python preprocess/generate_pretrain_data.py
python preprocess/split_pretrain_dataset.py
# generate the fingerprint and split into train/val/test following the raw division
python preprocess/generate_canopus_and_lables.py
python preprocess/generate_mgf_and_lables.py
python preprocess/fp_pred_main.py
python preprocess/prepare_test_data.py
bash scripts/pretrain.shbash scripts/msg/finetune.sh
bash scripts/canopus/finetune.shbash scripts/msg/align.sh
bash scripts/canopus/align.shbash scripts/msg/eval.sh
bash scripts/msg/eval.shIf you have any questions, please reach out to csyanghan@sjtu.edu.cn
