This repository provides a complete pipeline for generating DNA sequence datasets with pathogenic and non-pathogenic variants, and for training and evaluating a DNABERT2-based model for variant classification in BRCA1/2 genes.
- Python 3.12 or later is required.
-
Clone the repository:
git clone https://github.com/OEAdebayo/omics-brca.git cd omics-brca -
Create and activate a Conda environment:
conda create -n dnabert2 python=3.12 conda activate dnabert2
-
Install dependencies:
pip install -r requirements.txt
-
Clone the DNABERT2 model repository:
git clone https://github.com/zhihan1996/DNABERT-2-117M.git
- Make sure to set the
model_pathin yoursrc/omics_brca/config.ymlto the path of the cloneddnabert2_117mdirectory.
- Make sure to set the
This pipeline is designed to be run efficiently on a High-Performance Computing (HPC) cluster using SLURM batch scripts. Fast and reproducible execution is achieved by using the provided batch scripts for each stage.
Entry point: sequence_gen.py
Batch script: seqGen_submit.sh
- This script generates variant sequences starting from the mutated variations located in the
data/raw_datadirectory. - It processes these variants, injects them into the reference CDS, applies windowing, and produces a DNABERT2-ready input TSV file for training.
- The configuration for file paths and parameters is managed in
config.yml. - Intermediary datasets are stored in
data/preprocessed_data, including the finaldnabert2_input.tsvbefore splitting. - The windowed sequences (after cutting into windows of size
window_size=1000) are stored in theoutput/directory.
To run on an HPC (recommended):
cd src/omics_brca
sbatch seqGen_submit.shTo run locally (for testing):
cd src/omics_brca
conda activate dnabert2
python sequence_gen.pyEntry point: train_eval.py
Batch script: training_submit.sh
- This script takes the DNABERT2 input format file and trains a DNABERT2-based classifier.
- You must provide the path to the DNABERT2 model (cloned in step 4 above) as a string in the
src/omics_brca/config.ymlunderdnabert2_finetuning.model_path. - The script will split the data, fine-tune the model, and evaluate its performance.
- The train, test, and dev files (in DNABERT2 input format) are stored in the
data/training_datadirectory.
To run on an HPC (recommended):
cd src/omics_brca
sbatch training_submit.shTo run locally (for testing):
cd src/omics_brca
conda activate dnabert2
python train_eval.py- All raw data containing the variants and CDS can be found in the
data/raw_datadirectory. - All configuration (paths, hyperparameters, etc.) is controlled via
config.yml. - Intermediary datasets are stored in
data/preprocessed_data. - The final
dnabert2_input.tsv(before splitting) is indata/preprocessed_data. - The train, test, and dev TSV files are stored in
data/training_data. - The windowed sequence results (after cutting to size
window_size=1000) are stored in theoutput/directory. - The pipeline will generate intermediate and final outputs in the
output/anddata/training_data/directories. - The evaluation results when you run with the HPC cluster will be generated in the specified output log in the bash scripts
seqGen_submit.shortraining_submit.sh.
This project is licensed under the MIT License.