This directory contains SLURM-compatible files to generate protein language model embeddings from various models. These files run evolvepro/plm/[protein_language_model]/extract.py to generate embeddings in CSV format from a FASTA file containing all single amino acid mutants of interest.
The usage of extract.py files may vary depending on the model. Below is a general example of how to use the scripts:
python evolvepro/plm/[model_name]/extract.py [model_location] [fasta_file] [output_dir] [additional_options]To generate a CSV file of ESM-2 (15B parameter model) embeddings of the last hidden layer from a FASTA file of single amino acid mutants:
python evolvepro/plm/esm/extract.py esm2_t48_15B_UR50D output/dms/brenan.fasta output/plm/esm/brenan --toks_per_batch 512 --include mean --concatenate_dir output/plm/esm/In this example:
esm2_t48_15B_UR50Dis the model nameoutput/dms/brenan.fastais the input FASTA fileoutput/plm/esm/brenanis the output directory for the embeddings--toks_per_batch 512sets the number of tokens per batch--include meanspecifies to include mean pooled embeddings--concatenate_dir output/plm/esm/sets the directory for concatenated results
Note: Adjust --toks_per_batch based on your GPU memory capacity.
Different protein language models may require specific commands or have unique parameters. Refer to the extract.py files in each model's subdirectory and/or the SLURM-compatible files in this directory for model-specific instructions. We recommend using a GPU powered HPC to extract protein language model embeddings, but these SLURM-compatible files can be easily adapted into a command line call.