EnsembleDesign: Messenger RNA Design Minimizing Ensemble Free Energy via Probabilistic Lattice Parsing
This repository hosts the source code for the research paper titled EnsembleDesign: Messenger RNA Design Minimizing Ensemble Free Energy via Probabilistic Lattice Parsing by Ning Dai, Tianshuo Zhou, Wei Yu Tang, David H. Mathews, and Liang Huang, to appear in the Proceedings of ISMB 2025 (Bioinformatics, 2025).
If you use this repository in your research, please cite the above paper as well as the LinearDesign paper.
- Clang 11.0.0 (or above) or GCC 4.8.5 (or above)
- Python 3
Run the following commands:
chmod +x setup.sh
./setup.sh
The mRNA Design program is designed to process protein sequences provided in a FASTA file and apply our mRNA Design algorithm to each sequence. The results will be stored in output_dir, with each sequence having its own subfolder containing the raw outputs from all executions. The optimal solution for each sequence, determined across all runs, will be displayed to the user.
Use the following command format to run the mRNA Design program:
python EnsembleDesign.py [--fasta <path>] [--output_dir <path>] [--beam_size <int>] [--lr <float>] [--epsilon <float>] [--num_iters <int>] [--num_runs <int>] [--num_threads <int>]
--fasta <path>: Specifies the path to the input protein FASTA file. The default is./examples.fasta.--output_dir <path>: Sets the directory for saving output files. The default is./outputs.--beam_size <int>: Determines the beam size for beam pruning. The default is200. A smaller beam size can speed up the process but may result in search errors.--lr <float>: Sets the learning rate for projected gradient descent. The default is0.03.--epsilon <float>: Defines the epsilon parameter for soft-MFE initialization. The default is0.5.--num_iters <int>: Specifies the number of optimization steps. The default is30.--num_runs <int>: Indicates the number of execution runs per sample file. The default is20. More runs increase the likelihood of finding optimal solutions but require more computational resources.--num_threads <int>: Sets the number of threads in the thread pool. The default is8. Adjust this based on your CPU's core count to optimize parallel processing without overloading your system.
To run the program with a specific set of parameters, you can use a command similar to the following (note that we are using smaller parameters to make it run faster):
python EnsembleDesign.py --fasta data/examples.fasta --output_dir ./outputs --beam_size 100 --lr 0.03 --epsilon 0.5 --num_iters 3 --num_runs 2 --num_threads 4
The expected output is:
>seq1|Ensemble Free Energy: -18.0 kcal/mol
AUGAAUACCUAUCAUAUUACUCUGCCGUGGCCGCCGAGUAAUAAUAGGUAU
>seq2|Ensemble Free Energy: -43.71 kcal/mol
AUGAAUGAGUACCAGUUUGUGCUCCCCUAUCCGCCGUCAUUGAAUACUUAUUGGCGGCGGAGGGGGAGCCAGUACUACAUU
>seq3|Ensemble Free Energy: -70.29 kcal/mol
AUGGCGACCGUUCUCCUGGCGCUUCUGGUUUAUUUGGGUGCGCUGGUGGAUGCGUACCCAAUUAAGCCAGAAGCGCCAGGAGAGGACGCCUUCUUAGGG
Note that following LinearDesign, we use -V -d0 -b0 options in LinearPartition to evaluate the ensemble free energy, which uses the Vienna energy model with no dangling ends and no beam search (although the mRNA design code does use beam search).
Alongside the main application, this repository includes additional files that are useful for testing and understanding the capabilities of the mRNA Design tool:
-
data/examples.fasta: This file contains 3 example protein sequences. It is designed to offer a quick and straightforward way to test the functionality of the software with pre-defined input. This file serves as the default input for the tool. -
data/uniprot.fasta: Contains 20 protein sequences from the UniProt database used in our experiments. -
data/covid_spike.fasta: Contains the SARS-CoV-2 spike protein sequence used in our experiments, taken from the LinearDesign paper (Zhang et al., Nature 2023). -
coding_wheel.txt: RNA codon table (i.e., the genetic code). -
codon_usage_freq_table_human.csv: A human codon usage frequency table from the LinearDesign paper, which is used by LinearDesign code to compute the codon adapation index (CAI). It is originally from GenScript (choose human instead of yeast). -
tools/LinearDesign: LinearDesign codebase (baseline, for min MFE design). -
tools/LinearPartition: LinearPartition codebase (to calculate ensemble free energy). Note that we use-V -d0 -b0options to evaluate energies, following LinearDesign.