This project aims to predict expression profiles of pathogenic bacteria under stress by leveraging sequence-based models. Despite the simplicity of prokaryotic genomes, which consist of a single circular chromosome organized into operons, they have not been as extensively studied as eukaryotic genomes. Recent advancements have shown that language models can identify elements of prokaryotic genomes, prompting us to explore their potential for predicting gene expression, similar to efforts in yeast. Our objectives include benchmarking various model architectures using mean squared error, interpreting models to identify predictive regulatory elements, assessing the specificity of these elements under different stress conditions, and identifying co-regulated transcripts. By doing so, we aim to enhance our understanding of bacterial gene regulation, providing insights with potential applications in biotechnology and industries such as food production.
Ensure you have conda installed on your machine.
To create the conda environment and install all necessary packages, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/your-repo.git cd your-repo -
Create the conda environment from the
environment.ymlfile:conda env create -f environment.yml
-
Activate the environment:
conda activate ml4rg
When the environment needs updates (e.g., adding new packages), update your local environment and re-export the environment.yml file as follows:
-
Install the new package:
conda install some_new_package
-
Export the updated environment:
conda env export --name ml4rg > environment.yml
-
Commit and push the updated
environment.ymlfile to the repository:git add environment.yml git commit -m "Update environment.yml with new packages" git push origin main
Run the script sync_drive.sh to build the data:
bash sync_drive.sh.fna Files These are FASTA files containing nucleotide sequences. Each sequence in the file starts with a header line beginning with >, followed by lines of nucleotide sequences.
.gff Files These are General Feature Format files, which contain information about gene features like gene locations, exons, introns, etc. Each line in a GFF file represents one feature with fields separated by tabs.
- Create a directory named
data. - Move the folders
data_expressionanddata_sequences_upstreaminto thedatadirectory.
- Run
load_data.pyto merge the expression data with sequence data. - The resulting merged data is saved as
merged_data.csvinside thedatadirectory.
- Run
process_data.pyto process the merged data. - The processed data is saved as
processed_data.pklin thedatadirectory.
- Open the
model_developmentnotebook located in thenotebooksdirectory. - Train the model and review the results by following the instructions in the notebook.