Skip to content

de-Boer-Lab/dna_active_learning

Repository files navigation

DNA Active Learning

A framework for benchmarking active learning strategies on DNA sequence-to-expression models.

Installation


git clone https://github.com/de-Boer-Lab/dna_active_learning.git
cd dna_active_learning
pip install -e .

Set your project root directory:

python -m dna_active_learning.setup /path/to/your/data/root

Repository Overview


models/ - Implementation of DREAM Challenge model architectures (CNN, RNN, Attention) with training and evaluation utilities.

sequence_selection/ - Active learning strategies including ensemble-based methods, MC Dropout, k-means clustering, and LCMD. Contains the main AL loop implementation.

data/ - Demo datasets.

File Structure

The package expects this directory structure:

{PROJECT_ROOT}/
└── {dataset}/
    ├── round_0/common/
    │   ├── train.txt
    │   └── pool.txt
    ├── round_{i}/{strategy}/{arch}_{seed}/
    │   ├── data/
    │   │   ├── selected.txt
    │   │   ├── train.txt
    │   │   └── pool.txt
    │   └── model/
    │       └── model_best.pth
    └── val.txt

Basic Usage


Run Active Learning Loop

python -m dna_active_learning.sequence_selection.al_loop \
    <dataset> <strategy> <arch> <seed> [OPTIONS]

Example:

# Run 3-round AL experiment with MC Dropout
python -m dna_active_learning.sequence_selection.al_loop \
    yeast mcd cnn 42 --num-rounds 3 --num-selected 20000

Arguments:

  • dataset: Dataset name (yeast or human)
  • strategy: AL strategy (mcd, kmeans, or lcmd)
  • arch: Model architecture (cnn, rnn, or attn)
  • seed: Random seed for reproducibility
  • --num-rounds: Number of AL rounds (default: 3)
  • --num-selected: Sequences to select per round (default: 20,000)
  • --start-round: Resume from specific round (default: 1)

Train DREAM Challenge Model

# Using AL experiment structure
python -m dna_active_learning.models.train_model al <dataset> <arch> \
    --strategy <al_strategy> --round <round_num> --seed <seed>

# Using custom data paths
python -m dna_active_learning.models.train_model custom <dataset> <arch> \
    --train <train_path> --val <val_path> --model-dir <output_dir>

Examples:

# Train within AL structure
python -m dna_active_learning.models.train_model al yeast cnn \
    --strategy random --round 1 --seed 42

# Train with custom paths
python -m dna_active_learning.models.train_model custom human attn \
    --train data/my_train.txt --val data/my_val.txt --model-dir outputs/

Select Sequences with Different Strategies

Ensemble - Disagreement-based selection across multiple models. Includes multi- and same-architecture ensembles with customizable size and configuration.

# Multi-architecture ensemble
python -m dna_active_learning.sequence_selection.ensemble multi \
    yeast all_arch --round 2 --seed 42

# Same-architecture ensemble
python -m dna_active_learning.sequence_selection.ensemble same \
    yeast cnn --round 2 --seeds 1 2 3 4 5

MC Dropout - Uncertainty-based selection using Monte Carlo dropout to identify sequences where the model is most uncertain.

python -m dna_active_learning.sequence_selection.mc_dropout \
    yeast cnn 2 42 --num_passes 50 --num_selected 20000

K-means - Diversity-based selection using k-means clustering in embedding space to choose representative sequences.

python -m dna_active_learning.sequence_selection.diversity_strategies \
    yeast cnn kmeans 2 42 --num_selected 20000

LCMD - Iteratively selects cluster centers by identifying the largest cluster and choosing its furthest point, prioritizing maximally different sequences.

python -m dna_active_learning.sequence_selection.diversity_strategies \
    yeast cnn lcmd 2 42 --num_selected 20000

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages