omics-brca

This repository provides a complete pipeline for generating DNA sequence datasets with pathogenic and non-pathogenic variants, and for training and evaluating a DNABERT2-based model for variant classification in BRCA1/2 genes.

Python Version

Python 3.12 or later is required.

Setup: Creating a Conda Environment

Clone the repository:

git clone https://github.com/OEAdebayo/omics-brca.git
cd omics-brca

Create and activate a Conda environment:

conda create -n dnabert2 python=3.12
conda activate dnabert2

Install dependencies:
```
pip install -r requirements.txt
```
Clone the DNABERT2 model repository:
```
git clone https://github.com/zhihan1996/DNABERT-2-117M.git
```
- Make sure to set the model_path in your src/omics_brca/config.yml to the path of the cloned dnabert2_117m directory.

Pipeline Overview

This pipeline is designed to be run efficiently on a High-Performance Computing (HPC) cluster using SLURM batch scripts. Fast and reproducible execution is achieved by using the provided batch scripts for each stage.

1. Sequence Generation

Entry point: sequence_gen.py
Batch script: seqGen_submit.sh

This script generates variant sequences starting from the mutated variations located in the data/raw_data directory.
It processes these variants, injects them into the reference CDS, applies windowing, and produces a DNABERT2-ready input TSV file for training.
The configuration for file paths and parameters is managed in config.yml.
Intermediary datasets are stored in data/preprocessed_data, including the final dnabert2_input.tsv before splitting.
The windowed sequences (after cutting into windows of size window_size=1000) are stored in the output/ directory.

To run on an HPC (recommended):

cd src/omics_brca
sbatch seqGen_submit.sh

To run locally (for testing):

cd src/omics_brca
conda activate dnabert2
python sequence_gen.py

2. Model Training and Evaluation

Entry point: train_eval.py
Batch script: training_submit.sh

This script takes the DNABERT2 input format file and trains a DNABERT2-based classifier.
You must provide the path to the DNABERT2 model (cloned in step 4 above) as a string in the src/omics_brca/config.yml under dnabert2_finetuning.model_path.
The script will split the data, fine-tune the model, and evaluate its performance.
The train, test, and dev files (in DNABERT2 input format) are stored in the data/training_data directory.

To run on an HPC (recommended):

cd src/omics_brca
sbatch training_submit.sh

To run locally (for testing):

cd src/omics_brca
conda activate dnabert2
python train_eval.py

Notes

All raw data containing the variants and CDS can be found in the data/raw_data directory.
All configuration (paths, hyperparameters, etc.) is controlled via config.yml.
Intermediary datasets are stored in data/preprocessed_data.
The final dnabert2_input.tsv (before splitting) is in data/preprocessed_data.
The train, test, and dev TSV files are stored in data/training_data.
The windowed sequence results (after cutting to size window_size=1000) are stored in the output/ directory.
The pipeline will generate intermediate and final outputs in the output/ and data/training_data/ directories.
The evaluation results when you run with the HPC cluster will be generated in the specified output log in the bash scripts seqGen_submit.sh ortraining_submit.sh.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
src/omics_brca		src/omics_brca
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

omics-brca

Python Version

Setup: Creating a Conda Environment

Pipeline Overview

1. Sequence Generation

2. Model Training and Evaluation

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

omics-brca

Python Version

Setup: Creating a Conda Environment

Pipeline Overview

1. Sequence Generation

2. Model Training and Evaluation

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages