Skip to content

[NeurIPS 2025] MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation

Notifications You must be signed in to change notification settings

OpenDFM/MS-BART

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation (NeurIPS 2025)

Yang Han1,2, Pengyu Wang1,2, Kai Yu1,2, Xin Chen2, Lu Chen1, 2

1 X-LANCE Lab, Shanghai Jiao Tong University, Shanghai 2 Suzhou Laboratory, Suzhou.

arXiv

TL; DR: MS-BART is the first to leverage language model for mass spectra structure elucidation by introducing a unified vocabulary and enabling end-to-end pretraining, fine-tuning, and alignment.

Environment Setup

conda env create -f environment.yml
conda activate ms-bart

Preprocessed Dataset and Model Weights

You can download the preprocessed and model weight from the Figshare and put them in data folder.

The folder tree are:

data
├─ CANOPUS
│  ├─ mist
│  ├─ model-weights # The final MS-BART model on CANOPUS dataset
|  ├─ pretrain-data # clean pretrain data (filter Tanimoto similarity > 0.5)
|  ├─ pretrained-model # pretrain on clean 4M pretrain dataset
│  ├─ train
│  ├─ test
│  ├─ val
├─ MassSpecGym
│  ├─ mist # retrained with clean CANOPUS dataset
│  ├─ model-weights # The final MS-BART model on MassSpecGym dataset
|  ├─ pretrain-data
|  ├─ pretrained-model
│  ├─ train
│  ├─ test
│  ├─ val

Or you can download the original data preprocee and train from scratch

# Pretrain dataset generation and split 10000 for validation to choose the best model for finetune and alignment
python preprocess/generate_pretrain_data.py
python preprocess/split_pretrain_dataset.py

# generate the fingerprint and split into train/val/test following the raw  division
python preprocess/generate_canopus_and_lables.py
python preprocess/generate_mgf_and_lables.py
python preprocess/fp_pred_main.py
python preprocess/prepare_test_data.py

Step1: Unified Multi-Task Pretraining on Reliably Computed Fingerprints

bash scripts/pretrain.sh

Step2: Finetuning on Experimental Spectra

bash scripts/msg/finetune.sh

bash scripts/canopus/finetune.sh

Step3: Contrastive Alignment via Chemical Feedback

bash scripts/msg/align.sh

bash scripts/canopus/align.sh

Evaluation

bash scripts/msg/eval.sh

bash scripts/msg/eval.sh

Contact

If you have any questions, please reach out to csyanghan@sjtu.edu.cn

About

[NeurIPS 2025] MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published