MetGenX CLI Manual

MetGenX is a structure-informed generative model for metabolite annotation based on MS2 spectra. This command-line tool processes a single MS2 spectra file (.msp/.mgf) and outputs generated structures.

Requirements

Python >= 3.10
Dependencies:
- numpy==1.26.4
- pandas==2.2.3
- faiss-cpu==1.8.0
- transformers==4.42.3
- torch==2.4.1
- rdkit==2023.9.5
- gensim==4.3.2
- lightgbm==4.5.0
- pytorch_lightning==2.2.5
- six==1.16.0
- more_itertools==10.5.0
- scipy==1.12.0
- jpype1==1.5.0
The model requires Java SE Development Kit 11.0.23 (JDK 11.0.23)

Installation

1. Clone the project

You can install MetGenX by cloning the repository: download the code from GitHub

$ git clone https://github.com/ZhuMetLab/MetGenX.git
cd MetGenX

2. Install Dependencies

python -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate
pip install -r requirements.txt

3. Download Model Weights and Databases

Download the following weights and databases from zenodo copy the weights and database dir into the project structure:

MetGenX/
- weights/
- database/

Usage

Basic Command

python run.py --spec_path <input_file.mgf> [options]

# Run demo data in positive mode
python run.py --spec_path ./test/demo_positive.mgf  --mode Restricted --output ./test/generation_results_restricted.csv

Arguments

Argument	Type	Default	Description
`--spec_path`	`str`	(Required)	Path to input `.mgf` file of MS2 spectra
`--polarity`	`str`	`"positive"`	Spectrum polarity: `positive` or `negative`
`--mode`	`str`	`"Free"`	Generation mode: `Free` or `Restricted`
`--output`	`str`	`"generation_results.csv"`	Output CSV file path
`--db_cutoff`	`float`	`0.4`	Similarity cutoff for template filtering

MS2 spectra format

mgf

BEGIN IONS
Name= <Name>
IONMODE= <ionization mode>
PEPMASS= <precurosr m/z>
Formula= <neutral formula>
m/z1 intensity1
m/z2 intensity2
m/z3 intensity3
...
END IONS

msp

Name: <Name>
IONMODE: <ionization mode>
PEPMASS: <precurosr m/z>
Formula: <neutral formula>
Num Peaks: <number of peaks>
m/z1 intensity1
m/z2 intensity2
m/z3 intensity3
...

Train and evaluate your custom model (optional)

Here we offer a demo workflow for training and evaluating the model using a publicly accessible dataset, MassSpecGym.

Training MetGenX involves three main steps:

Preparing your training data
Constructing the Training Dataset
Training the Model

1. Preparing your training data

Prepare your training data in a directory structure as the following:

 {datasetname}/
- MS2_spectra.mgf
- metaData.csv
- splits.tsv (optional)

Note:

The ID in MS2 data should be the same as the ID in metaData.csv
If no splits file is provided, the split fold should be included in the metaData.

2. Constructing the Training Dataset

Modeify the parameters in ./Scripts/01_Construct_dataset.py as follows:

Run the following command:

  python ./Scripts/01_Construct_dataset.py

If the training dataset is constructed successfully, the following files will be generated in the ./results/{datasetname} directory:

input_dataset.dataset
Query_index
temp
weights

Note: During training of the spectra embedding model, we observed some randomness that may lead to slight differences in the training results. To reproduce our exact experimental results, you can use the pre-trained SpecEmbed_model weights provided by us. Place the weights in:
# Our pre-trained weights in [./results/MassSpecGym/weights/word2vec/SpecEmbed_model](https://doi.org/10.5281/zenodo.17578304)
./results/{datasetname}/weights/word2vec/SpecEmbed_model

3. Training the Model

Run the following command:

  . /Scripts/02_model_training.sh

Parameters:

Argument	Type	Default	Description
`--datasetname`	`str`	`"MassSpecGym"`	Name of the dataset used for training (consistent with the directory name)
`--path_train`	`str`	`None`	Path to the training dataset
`--checkpoint_path`	`str`	`None`	Path to pre-trained checkpoint to initialize the model
`--batch_size`	`int`	`64`	Training batch size
`--num_workers`	`int`	`4`	Number of DataLoader worker processes
`--lr`	`float`	`5e-6`	Learning rate
`--accelerator`	`str`	`"gpu"`	Training device: `'gpu'` or `'cpu'`

Note:

The pretrained weights can be downloaded from here.(./weights/Pretrained_Weight_MetGenX.pth) This model was pretrained on 2.17 M biological molecules by structure similarity.
If you want to change the hyperparameters, please modify the parameters in following files:
- Parameters for model: ./weight/generation/config.json
- Parameters for database used: ./weight/generation/config_database.json
- Parameters for generation process: ./weight/generation/config_generation.json
The trained model weights will be saved in ./results/{datasetname}/weights/generation/Trained_Weight.pth

Model evaluation

The evaluation process of MetGenX has been modified to adapt the standard format of MassSpecGym. If you want to evaluate the model, please install massspecgym package. (this is not necessary for using MetGenX)

Note:

The MassSpecGym package was constructed using Python 3.11. To avoid conflicts, if you want to use it in your current environment, you can run the following commands:

  pip install massspecgym --no-deps
  pip install -r requirements_msg.txt

Alternatively, you can follow the instructions provided by MassSpecGym to create a separate environment for evaluation:

  conda create -n massspecgym python==3.11
  conda activate massspecgym
  pip install massspecgym

Note that installing MassSpecGym is not necessary for training or using MetGenX.

To evaluate the model, please run the following command:

  . Scripts/MassSpecGym/Evaluation_denovo.sh

or

  . Scripts/MassSpecGym/Evaluation_retrieval.sh

The evaluation supports two modes:

1. De novo molecule generation (database-free)

To evaluate the model on database-free mode, the constructed dataset can be directly used.

2. Retrieval molecule generation (database-restricted)

To evaluate the model on database-restricted mode, the candidate list should be provided. The demo candidate list is available in here. See ./MassSpecGym/MassSpecGym_retrieval_candidates_formula_canoical.json The candidate list should be in the following format:

{"True SMILES": ["Candidate_SMILES1", "Candidate_SMILES2", "Candidate_SMILES3", ...]}

Convert the dataset into the retrieval dataset:

  python ./Scripts/MassSpecGym/Convert.py

The converted dataset will be saved in ./results/{datasetname}/input_dataset_retrival.dataset

Parameters:

Argument	Type	Default	Description
`--datasetname`	`str`	`None`	Name of the dataset
`--Evaluation_mode`	`str`	`"denovo"`	Evaluation mode: `'denovo'` for database-free evaluation, `'retrieval'` for database-restricted evaluation
`--dataset`	`str`	(Required)	Path to dataset for evaluation (.dataset).
`--checkpoint`	`str`	(Required)	Path to model checkpoint file
`--output_dir`	`str`	(Required)	Directory to save test results

The evaluation results will be saved in ./results/{datasetname}/eval_result/

To do

Create a one-step model training and evaluation pipeline

Troubleshooting

"No formula provided in spectra data"
Ensure each spectrum in the .mgf file includes a formula field in its metadata.

Maintainers

@ Hongmiao Wang

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Generation		Generation
MS2Tools		MS2Tools
Model		Model
Model_training		Model_training
Rerank		Rerank
Scripts		Scripts
TemplateSearch		TemplateSearch
test		test
README.md		README.md
licenses.md		licenses.md
logo.png		logo.png
requirements.txt		requirements.txt
requirements_msg.txt		requirements_msg.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MetGenX CLI Manual

Requirements

Installation

1. Clone the project

2. Install Dependencies

3. Download Model Weights and Databases

Usage

Basic Command

Arguments

MS2 spectra format

mgf

msp

Train and evaluate your custom model (optional)

1. Preparing your training data

2. Constructing the Training Dataset

3. Training the Model

Model evaluation

1. De novo molecule generation (database-free)

2. Retrieval molecule generation (database-restricted)

To do

Troubleshooting

Maintainers

About

Uh oh!

Releases

Packages

Languages

ZhuMetLab/MetGenX

Folders and files

Latest commit

History

Repository files navigation

MetGenX CLI Manual

Requirements

Installation

1. Clone the project

2. Install Dependencies

3. Download Model Weights and Databases

Usage

Basic Command

Arguments

MS2 spectra format

mgf

msp

Train and evaluate your custom model (optional)

1. Preparing your training data

2. Constructing the Training Dataset

3. Training the Model

Model evaluation

1. De novo molecule generation (database-free)

2. Retrieval molecule generation (database-restricted)

To do

Troubleshooting

Maintainers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages