MetGenX is a structure-informed generative model for metabolite annotation based on MS2 spectra. This command-line tool processes a single MS2 spectra file (.msp/.mgf) and outputs generated structures.
-
Python >= 3.10
-
Dependencies:
numpy==1.26.4pandas==2.2.3faiss-cpu==1.8.0transformers==4.42.3torch==2.4.1rdkit==2023.9.5gensim==4.3.2lightgbm==4.5.0pytorch_lightning==2.2.5six==1.16.0more_itertools==10.5.0scipy==1.12.0jpype1==1.5.0
-
The model requires Java SE Development Kit 11.0.23 (JDK 11.0.23)
You can install MetGenX by cloning the repository: download the code from GitHub
$ git clone https://github.com/ZhuMetLab/MetGenX.git
cd MetGenXpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtDownload the following weights and databases from zenodo copy the weights and database dir into the project structure:
MetGenX/ - weights/ - database/
python run.py --spec_path <input_file.mgf> [options]
# Run demo data in positive mode
python run.py --spec_path ./test/demo_positive.mgf --mode Restricted --output ./test/generation_results_restricted.csv| Argument | Type | Default | Description |
|---|---|---|---|
--spec_path |
str |
(Required) | Path to input .mgf file of MS2 spectra |
--polarity |
str |
"positive" |
Spectrum polarity: positive or negative |
--mode |
str |
"Free" |
Generation mode: Free or Restricted |
--output |
str |
"generation_results.csv" |
Output CSV file path |
--db_cutoff |
float |
0.4 |
Similarity cutoff for template filtering |
BEGIN IONS
Name= <Name>
IONMODE= <ionization mode>
PEPMASS= <precurosr m/z>
Formula= <neutral formula>
m/z1 intensity1
m/z2 intensity2
m/z3 intensity3
...
END IONSName: <Name>
IONMODE: <ionization mode>
PEPMASS: <precurosr m/z>
Formula: <neutral formula>
Num Peaks: <number of peaks>
m/z1 intensity1
m/z2 intensity2
m/z3 intensity3
...Here we offer a demo workflow for training and evaluating the model using a publicly accessible dataset, MassSpecGym.
Training MetGenX involves three main steps:
- Preparing your training data
- Constructing the Training Dataset
- Training the Model
Prepare your training data in a directory structure as the following:
{datasetname}/
- MS2_spectra.mgf
- metaData.csv
- splits.tsv (optional)
Note:
- The ID in MS2 data should be the same as the ID in metaData.csv
- If no splits file is provided, the split fold should be included in the metaData.
Modeify the parameters in ./Scripts/01_Construct_dataset.py as follows:
Run the following command:
python ./Scripts/01_Construct_dataset.pyIf the training dataset is constructed successfully, the following files will be generated in the ./results/{datasetname} directory:
- input_dataset.dataset
- Query_index
- temp
- weights
Note: During training of the spectra embedding model, we observed some randomness that may lead to slight differences in the training results. To reproduce our exact experimental results, you can use the pre-trained
SpecEmbed_modelweights provided by us. Place the weights in:# Our pre-trained weights in [./results/MassSpecGym/weights/word2vec/SpecEmbed_model](https://doi.org/10.5281/zenodo.17578304) ./results/{datasetname}/weights/word2vec/SpecEmbed_model
Run the following command:
. /Scripts/02_model_training.shParameters:
| Argument | Type | Default | Description |
|---|---|---|---|
--datasetname |
str |
"MassSpecGym" |
Name of the dataset used for training (consistent with the directory name) |
--path_train |
str |
None |
Path to the training dataset |
--checkpoint_path |
str |
None |
Path to pre-trained checkpoint to initialize the model |
--batch_size |
int |
64 |
Training batch size |
--num_workers |
int |
4 |
Number of DataLoader worker processes |
--lr |
float |
5e-6 |
Learning rate |
--accelerator |
str |
"gpu" |
Training device: 'gpu' or 'cpu' |
Note:
- The pretrained weights can be downloaded from here.(./weights/Pretrained_Weight_MetGenX.pth) This model was pretrained on 2.17 M biological molecules by structure similarity.
- If you want to change the hyperparameters, please modify the parameters in following files:
- Parameters for model:
./weight/generation/config.json - Parameters for database used:
./weight/generation/config_database.json - Parameters for generation process:
./weight/generation/config_generation.json
- Parameters for model:
- The trained model weights will be saved in
./results/{datasetname}/weights/generation/Trained_Weight.pth
The evaluation process of MetGenX has been modified to adapt the standard format of MassSpecGym. If you want to evaluate the model, please install massspecgym package. (this is not necessary for using MetGenX)
Note:
The MassSpecGym package was constructed using Python 3.11. To avoid conflicts, if you want to use it in your current environment, you can run the following commands:
pip install massspecgym --no-deps
pip install -r requirements_msg.txtAlternatively, you can follow the instructions provided by MassSpecGym to create a separate environment for evaluation:
conda create -n massspecgym python==3.11
conda activate massspecgym
pip install massspecgymNote that installing MassSpecGym is not necessary for training or using MetGenX.
To evaluate the model, please run the following command:
. Scripts/MassSpecGym/Evaluation_denovo.sh or
. Scripts/MassSpecGym/Evaluation_retrieval.shThe evaluation supports two modes:
To evaluate the model on database-free mode, the constructed dataset can be directly used.
To evaluate the model on database-restricted mode, the candidate list should be provided. The demo candidate list is available in here. See ./MassSpecGym/MassSpecGym_retrieval_candidates_formula_canoical.json The candidate list should be in the following format:
{"True SMILES": ["Candidate_SMILES1", "Candidate_SMILES2", "Candidate_SMILES3", ...]}
Convert the dataset into the retrieval dataset:
python ./Scripts/MassSpecGym/Convert.py The converted dataset will be saved in ./results/{datasetname}/input_dataset_retrival.dataset
Parameters:
| Argument | Type | Default | Description |
|---|---|---|---|
--datasetname |
str |
None |
Name of the dataset |
--Evaluation_mode |
str |
"denovo" |
Evaluation mode: 'denovo' for database-free evaluation, 'retrieval' for database-restricted evaluation |
--dataset |
str |
(Required) | Path to dataset for evaluation (.dataset). |
--checkpoint |
str |
(Required) | Path to model checkpoint file |
--output_dir |
str |
(Required) | Directory to save test results |
The evaluation results will be saved in ./results/{datasetname}/eval_result/
- Create a one-step model training and evaluation pipeline
- "No formula provided in spectra data"
Ensure each spectrum in the.mgffile includes aformulafield in its metadata.
