Skip to content

OEAdebayo/omics-brca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

omics-brca

This repository provides a complete pipeline for generating DNA sequence datasets with pathogenic and non-pathogenic variants, and for training and evaluating a DNABERT2-based model for variant classification in BRCA1/2 genes.

Python Version

  • Python 3.12 or later is required.

Setup: Creating a Conda Environment

  1. Clone the repository:

    git clone https://github.com/OEAdebayo/omics-brca.git
    cd omics-brca
  2. Create and activate a Conda environment:

    conda create -n dnabert2 python=3.12
    conda activate dnabert2
  3. Install dependencies:

    pip install -r requirements.txt
  4. Clone the DNABERT2 model repository:

    git clone https://github.com/zhihan1996/DNABERT-2-117M.git
    • Make sure to set the model_path in your src/omics_brca/config.yml to the path of the cloned dnabert2_117m directory.

Pipeline Overview

This pipeline is designed to be run efficiently on a High-Performance Computing (HPC) cluster using SLURM batch scripts. Fast and reproducible execution is achieved by using the provided batch scripts for each stage.

1. Sequence Generation

Entry point: sequence_gen.py
Batch script: seqGen_submit.sh

  • This script generates variant sequences starting from the mutated variations located in the data/raw_data directory.
  • It processes these variants, injects them into the reference CDS, applies windowing, and produces a DNABERT2-ready input TSV file for training.
  • The configuration for file paths and parameters is managed in config.yml.
  • Intermediary datasets are stored in data/preprocessed_data, including the final dnabert2_input.tsv before splitting.
  • The windowed sequences (after cutting into windows of size window_size=1000) are stored in the output/ directory.

To run on an HPC (recommended):

cd src/omics_brca
sbatch seqGen_submit.sh

To run locally (for testing):

cd src/omics_brca
conda activate dnabert2
python sequence_gen.py

2. Model Training and Evaluation

Entry point: train_eval.py
Batch script: training_submit.sh

  • This script takes the DNABERT2 input format file and trains a DNABERT2-based classifier.
  • You must provide the path to the DNABERT2 model (cloned in step 4 above) as a string in the src/omics_brca/config.yml under dnabert2_finetuning.model_path.
  • The script will split the data, fine-tune the model, and evaluate its performance.
  • The train, test, and dev files (in DNABERT2 input format) are stored in the data/training_data directory.

To run on an HPC (recommended):

cd src/omics_brca
sbatch training_submit.sh

To run locally (for testing):

cd src/omics_brca
conda activate dnabert2
python train_eval.py

Notes

  • All raw data containing the variants and CDS can be found in the data/raw_data directory.
  • All configuration (paths, hyperparameters, etc.) is controlled via config.yml.
  • Intermediary datasets are stored in data/preprocessed_data.
  • The final dnabert2_input.tsv (before splitting) is in data/preprocessed_data.
  • The train, test, and dev TSV files are stored in data/training_data.
  • The windowed sequence results (after cutting to size window_size=1000) are stored in the output/ directory.
  • The pipeline will generate intermediate and final outputs in the output/ and data/training_data/ directories.
  • The evaluation results when you run with the HPC cluster will be generated in the specified output log in the bash scripts seqGen_submit.sh ortraining_submit.sh.

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors