Forecasting Pathogenic Strain Responses Under Diverse Stress Conditions

This project aims to predict expression profiles of pathogenic bacteria under stress by leveraging sequence-based models. Despite the simplicity of prokaryotic genomes, which consist of a single circular chromosome organized into operons, they have not been as extensively studied as eukaryotic genomes. Recent advancements have shown that language models can identify elements of prokaryotic genomes, prompting us to explore their potential for predicting gene expression, similar to efforts in yeast. Our objectives include benchmarking various model architectures using mean squared error, interpreting models to identify predictive regulatory elements, assessing the specificity of these elements under different stress conditions, and identifying co-regulated transcripts. By doing so, we aim to enhance our understanding of bacterial gene regulation, providing insights with potential applications in biotechnology and industries such as food production.

Prerequisites

Ensure you have conda installed on your machine.

Setting Up the Development Environment

To create the conda environment and install all necessary packages, follow these steps:

Clone the repository:

git clone https://github.com/yourusername/your-repo.git
cd your-repo

Create the conda environment from the environment.yml file:
```
conda env create -f environment.yml
```
Activate the environment:
```
conda activate ml4rg
```

Updating the Environment

When the environment needs updates (e.g., adding new packages), update your local environment and re-export the environment.yml file as follows:

Install the new package:
```
conda install some_new_package
```

Export the updated environment:

conda env export --name ml4rg > environment.yml

Commit and push the updated environment.yml file to the repository:

git add environment.yml
git commit -m "Update environment.yml with new packages"
git push origin main

Building the Data

Run the script sync_drive.sh to build the data:

bash sync_drive.sh

Data Sequences

.fna Files These are FASTA files containing nucleotide sequences. Each sequence in the file starts with a header line beginning with >, followed by lines of nucleotide sequences.

.gff Files These are General Feature Format files, which contain information about gene features like gene locations, exons, introns, etc. Each line in a GFF file represents one feature with fields separated by tabs.

Formatter Status

Steps to Prepare and Analyze Data

1. Organize Data Directory

Create a directory named data.
Move the folders data_expression and data_sequences_upstream into the data directory.

2. Merge Data

Run load_data.py to merge the expression data with sequence data.
The resulting merged data is saved as merged_data.csv inside the data directory.

3. Process Data

Run process_data.py to process the merged data.
The processed data is saved as processed_data.pkl in the data directory.

4. Model Development

Open the model_development notebook located in the notebooks directory.
Train the model and review the results by following the instructions in the notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github/workflows		.github/workflows
evo		evo
models		models
notebooks		notebooks
old_notebooks		old_notebooks
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forecasting Pathogenic Strain Responses Under Diverse Stress Conditions

Table of Contents

Prerequisites

Setting Up the Development Environment

Updating the Environment

Building the Data

Data Sequences

Formatter Status

Steps to Prepare and Analyze Data

1. Organize Data Directory

2. Merge Data

3. Process Data

4. Model Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

memetc/ML4RG_Group1

Folders and files

Latest commit

History

Repository files navigation

Forecasting Pathogenic Strain Responses Under Diverse Stress Conditions

Table of Contents

Prerequisites

Setting Up the Development Environment

Updating the Environment

Building the Data

Data Sequences

Formatter Status

Steps to Prepare and Analyze Data

1. Organize Data Directory

2. Merge Data

3. Process Data

4. Model Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages