Skip to content

NetherlandsForensicInstitute/DNANet

Repository files navigation

DNANet

pdm-managed Ruff Python Version

Welcome!

This a Python repository that can be used to analyze DNA profiles using deep learning. It contains functionality to parse .hid files and train and evaluate models. The pre-trained U-Net provided can be used to call alleles in a DNA profile.

If you find this repository useful, please cite

@ARTICLE{Benschop2019,
      title     = "An assessment of the performance of the probabilistic genotyping
                   software {EuroForMix}: Trends in likelihood ratios and analysis
                   of Type {I} \& {II} errors",
      author    = "Benschop, Corina C G and Nijveld, Alwart and Duijs, Francisca E
                   and Sijen, Titia",
      journal   = "Forensic Sci. Int. Genet.",
      volume    =  42,
      pages     = "31--38",
      year      =  2019,
    }

for the data, and

@ARTICLE{de-Wit2025,
    title = {Making AI accessible for forensic DNA profile analysis},
    journal = {Forensic Science International: Genetics},
    volume = {81},
    pages = {103345},
    year = {2026},
    issn = {1872-4973},
    doi = {https://doi.org/10.1016/j.fsigen.2025.103345},
    url = {https://www.sciencedirect.com/science/article/pii/S1872497325001255},
    author = {
        Abel K.J.G. de Wit and Claire D. Wagenaar and Nathalie A.C. Janssen and Brechtje Hoegen 
        and Judith van de Wetering and Huub Hoofs and Simone Ariëns and Corina C.G. Benschop 
        and Rolf J.F. Ypma
        }
}

for the code and model.

For work related to the Data synthetization please cite the following:

@ARTICLE{Taylor2025,
    title = {Simulating realistic short tandem repeat capillary electrophoretic signal using a generative adversarial network},
    journal = {Expert Systems with Applications},
    volume = {280},
    pages = {127536},
    year = {2025},
    doi = {https://doi.org/10.1016/j.eswa.2025.127536},
    author = {D. A. Taylor and M. Humphries}
}

Requirements

Python >= 3.10, <=3.12

Cloning

Currently the repo has exceeded its git lfs quota. This is a current issue, which causes problems in the cloning process since there are files making use of git lfs. To be able to clone the repo without issues, temporarily disable git lfs when cloning. The commands are:

GIT_LFS_SKIP_SMUDGE=1 git clone <REPO_URL> DNANet

cd DNANet
git lfs install --skip-smudge

Setup

Create a virtual environment. We have used pdm and a pyproject.toml file to manage environment dependencies. Ensure you have pdm installed:

$ pip install pdm

Then run the following command to install the dependencies:

$ pdm sync

Git LFS is used to track .pt (model) files. Make sure to install Git LFS on your system. In order to retrieve the files from the remote run the following command:

$ git lfs pull

Synthetic data generation

For instructions on simulating DNA profiles and generating synthetic EPGs, see synthetic_profiles.

HugginFace datasets is used to download the research data. This is done automatically whenever the data is missing from your config's provided root directory. When this is triggered, data is pulled from the "NetherlandsForensicInstitute/DNANet_2p5pMixture_PPF6C_2024" HuggingFace repository.

Code overview

The repository is roughly organized into three sections:

  • Data
  • Models
  • Evaluation

Additionally, you can run a training script, an evaluation script and a cross validation script from the command line.

To load datasets and models or to load settings for the scripts, the code relies on config files that are read via the package confidence (see https://github.com/NetherlandsForensicInstitute/confidence). The config files are located in the config folder, and can be adjusted as desired.

Data

This directory contains all logic to parse a .hid DNA profile into an HIDImage object. Multiple HIDImage's are stored in a HIDDataset. The HIDDataset class is specifically implemented to load the 2p-5p NFI dataset, containing 350 raw hid files and the annotations in .txt files. This raw data is stored in the resources/data folder.

The HIDDataset inherits from the InMemoryDataset and ensures the HIDDataset can be considered as a list of HIDImage's, as the InMemoryDataset is iterable over instances in the ._data attribute. The InMemoryDataset class also contains functionality for shuffling and splitting the dataset.

In a similar fashion, the HIDImage inherits from the Image base class, enforcing the presence of raw data, an annotation and meta information in respectively the data, annotation and meta properties.

To load an HIDImage, you can provide the direct path to the .hid file (and optionally information to load annotations):

from DNAnet.data.data_models.hid_image import HIDImage
from DNAnet.data.data_models import Panel


panel = Panel("resources/data/SGPanel_PPF6C.xml")
image = HIDImage(
  path="resources/data/2p_5p_Dataset_NFI/Raw data .HID files/Mixture dataset 1/Inj5 2017-05-01-09-45-24-128/1A2_A01_01.hid",
  annotations_file="resources/data/2p_5p_Dataset_NFI/txt_annotations_2024/Dataset 1 DTL_AlleleReport.txt",
  panel=panel,
  meta={'annotations_name': '1L_11148_1A2'}
)

The image has a .data attribute containing the numpy array of peak heights and a .annotation attribute containing the binary segmentation of the ground truth location of peaks. These are based on the called alleles present in the annotations file, those can be found in the called_alleles of the .meta attribute.

An HIDDataset can be easily loaded using a config file:

from config_io import load_dataset

hid_dataset = load_dataset("config/data/dnanet_rd.yaml")

or directly by providing arguments:

from DNAnet.data.data_models.hid_dataset import HIDDataset

hid_dataset = HIDDataset(
  root="resources/data/2p_5p_Dataset_NFI/Raw data .HID files",
  panel="resources/data/SGPanel_PPF6C.xml",
  annotations_path="resources/data/2p_5p_Dataset_NFI/txt_annotations_2024",
  hid_to_annotations_path="resources/data/2p_5p_Dataset_NFI/2p_5p_hid_to_annotation.csv",
  limit=10
)

The list of HIDImage's is stored in the ._data attribute of the class.

Note that when loading the 2p-5p R&D dataset without limit, two hid files do not pass data validation, leaving the dataset with 348 images instead of 350.

Models

U-Net

We have implemented a U-Net model to identify peaks in a DNA profile. The U-Net architecture can be found in models.segmentation.unet_architecture.py. To load a trainable version of this exact model and make predictions, we can use:

from DNAnet.data.data_models.hid_dataset import HIDDataset
from config_io import load_dataset, load_model

hid_dataset = load_dataset("config/data/dnanet_rd.yaml")
unet_model = load_model("config/models/unet.yaml")
predictions = unet_model.predict_batch(hid_dataset)

This model creates a binary segmentation, where 1 indicates the presence of a peak and 0 otherwise.

We have also implemented an AlleleCaller (see DNAnet/allele_callers.py) to translate the binary segmentation into called alleles. This step is part of the predict_batch() function of the U-Net and will be applied when apply_allele_caller is set to True in the unet.yaml. The called alleles are stored in the meta attribute of a Prediction object.

A trained U-Net is located in resources/model/current_best_unet. To load the model's weights:

unet_model.load("resources/model/current_best_unet")

Note that allele metrics (DNAnet/evaluation/segmentation/allele_metrics.py) cannot be used on predictions of the U-Net model if no AlleleCaller is applied.

HumanAnalysis

The HumanAnalysis model can be used to analyze the analyst's annotations. It is interesting to compare those with the ground truth donor alleles. For the 2p-5p R&D Dataset, the actual donor alleles are known. By setting ground_truth_as_annotations: True in the dnanet_rd.yaml file, those ground truth donor alleles will be stored in meta['called_alleles] and the analyst annotations in meta['called_alleles_manual'] of the HIDImage when loading the dataset.

When applying the HumanAnalysis model to the dataset, the values in meta['called_alleles_manual'] of the HIDImage will be stored in the meta['called_alleles'] of a Prediction object. This way, the analyst annotations can be compared to the ground truth alleles.

Note that pixel metrics (DNAnet/evaluation/segmentation/pixel_metrics.py) cannot be used on predictions of the HumanAnalysis model this does not predict an image, so the .image attribute of a Prediction will remain None.

Evaluation

To evaluate the U-Net we have implemented a couple of metrics. Metrics to analyse the performance on pixel level and allele level, are located in DNAnet/evaluation/segmentation/pixel_metrics.py and DNAnet/evaluation/segmentation/allele_metrics.py respectively.

To visualize the DNA profiles, their annotations (if present) and/or predictions (if present), you can use the plot_profile() function from visualizations.py. This will plot the profiles one by one.

For example:

from DNAnet.data.data_models.hid_dataset import HIDDataset
from DNAnet.evaluation import pixel_f1_score
from DNAnet.evaluation.visualizations import plot_profile

hid_dataset = HIDDataset(
  root="resources/data/2p_5p_Dataset_NFI/Raw data .HID files",
  panel="resources/data/SGPanel_PPF6C.xml",
  annotations_path="resources/data/2p_5p_Dataset_NFI/txt_annotations_2024",
  hid_to_annotations_path="resources/data/2p_5p_Dataset_NFI/2p_5p_hid_to_annotation.csv",
  limit=10
)
unet_model = load_model("config/models/unet.yaml")
unet_model.load("resources/model/current_best_unet/")
predictions = unet_model.predict_batch(hid_dataset)

print(pixel_f1_score(hid_dataset, predictions))
plot_profile(hid_dataset, predictions)

It is also possible to plot a DNA profile per marker, or to plot a single marker of a DNA profile:

from DNAnet.data.data_models.hid_dataset import HIDDataset
from DNAnet.evaluation.visualizations import plot_profile_markers

hid_dataset = HIDDataset(
  root="resources/data/2p_5p_Dataset_NFI/Raw data .HID files",
  panel="resources/data/SGPanel_PPF6C.xml",
  annotations_path="resources/data/2p_5p_Dataset_NFI/txt_annotations_2024",
  hid_to_annotations_path="resources/data/2p_5p_Dataset_NFI/2p_5p_hid_to_annotation.csv",
  limit=10
)

plot_profile_marker(hid_dataset)
plot_profile_markers(hid_dataset, marker_name='TPOX')

Scripts

We have three scripts than can be run from the command line. To view the arguments of those scripts, run: python <script.py> --help.

train.py

This script can be used to train models with specified settings. The user can provide training parameters in the training config file, see for instance config/training/segmentation.yaml.

Run for example:

python train.py \
  -m unet \  # load a model
  -d dnanet_rd \  # load a dataset
  -t segmentation \  # load training arguments
  -s 0.9 \  # apply a split to leave part for testing/evaluation
  -v 0.1 \  # apply a split for validation during training
  -o output/example_run_train  # write results and the trained model to this folder

evaluate.py

This script can evaluate (trained) models by computing metrics. It is also possible to store the predictions of the model as .json file. Metrics can be provided using an evaluation config file, see for instance: config/evaluation/segmentation.yaml. Metrics will be written to a .txt file.

Run for example:

python evaluate.py \
  -m unet \  # load a model
  -c resources/model/current_best_unet \  # load a checkpoint
  -d dnanet_rd \  # load a dataset
  -e segmentation \  # load evaluation metrics
  -s 0.1 \  # apply splitting
  -o output/example_run_eval \  # output folder to write results to
  -p  # also save the actual predictions

cross_validate.py

This script can be used to apply k-fold cross validation. The dataset will be split into k folds, then k train/test loops will be performed. Metrics will be averaged over those loops.

Run for example:

python evaluate.py \
  -m unet \  # load a model
  -c resources/model/current_best_unet \  # load a checkpoint
  -d dnanet_rd \  # load a dataset
  -t segmentation \  # load training arguments
  -e segmentation \  # load evaluation metrics
  -k 5 \  # number of folds to use
  -o output/example_run_cross_val  # write results to this folder

New: ProvedIt / GlobalFiler support

  • Load ProvedIt GlobalFiler mixtures directly (backwards-compatible with the NFI R&D workflow).
  • Reusable strategies for dataset-specific parsing (DatasetStrategy), kit/panel metadata (Kit), and EPG scaling (EPGScalingStrategy).
  • Utilities to split genotype workbooks into per-contributor CSVs and to train from an already-instantiated dataset.

Quick start: working with ProvedIt

  1. Download a GlobalFiler ProvedIt zip (e.g. PROVEDIt_2-5-Person Profiles_3500 5sec_GF29cycles.zip) from https://lftdi.camden.rutgers.edu/provedit/files/ and extract it. The root should contain both the .hid files and a genotype Excel file.
  2. Extract contributor genotypes into per-sample CSVs (semicolon-separated) that the dataset strategy can load:
from pathlib import Path
from DNAnet.data.strategies.dataset_compatibility.format_conversion import find_genotype_file, individualize_genotypes

# Path to the extracted ProvedIt dataset (contains .hid files and the genotype Excel file)
dataset_root_path = "/Users/amarmesic/Documents/tudelft/thesis/datasets/USE THIS - PROVEDIt_2-5-Person Profiles_3500 5sec_GF29cycles"

# Locate the genotype Excel file inside the root directory
genotype_file_path = find_genotype_file(dataset_root_path)

# Extract the genotype data into individual CSV files that downstream loading uses
# `genotypes_dir` is where we choose to store the genotypes of the contributors of the dataset.
genotypes_dir = Path("resources/data/ProvedIt/individual_genotypes")
individualize_genotypes(
    input_path=genotype_file_path,
    output_dir=genotypes_dir,
)
  1. Instantiate the reusable kit/strategy objects and load the dataset:
from pathlib import Path
from DNAnet.data.data_models import Panel
from DNAnet.data.data_models.hid_dataset import HIDDataset
from DNAnet.data.strategies.dataset_strategy import ProvedItDatasetStrategy
from DNAnet.data.kit_compatibility.kit import GLOBALFILER_KIT
from DNAnet.data.kit_compatibility.scaling_strategy import ProvedItEPGScalingStrategy

panel = Panel(GLOBALFILER_KIT.panel_path)
dataset_strategy = ProvedItDatasetStrategy(
    panel=panel,
    genotypes_path=Path("resources/data/ProvedIt/individual_genotypes"),
)
scaling_strategy = ProvedItEPGScalingStrategy(kit=GLOBALFILER_KIT)

provedit = HIDDataset(
    root="/path/to/PROVEDIt_2-5-Person Profiles_3500 5sec_GF29cycles",
    panel=GLOBALFILER_KIT.panel_path,
    ground_truth_as_annotations=True,  # use contributor genotypes as annotations
    dataset_strategy=dataset_strategy,
    scaling_strategy=scaling_strategy,
    kit=GLOBALFILER_KIT,
)

This keeps legacy behavior intact: if you omit the strategies, the dataset/image loaders fall back to the original NFI R&D logic.

Strategy & kit reference (extensible)

  • Kit, see GLOBALFILER_KIT and POWER_PLEX_FUSION_6C_KIT objects): captures the size standard and panel for a multiplex. Create a new kit with a name, InternalSizeStandard, and a panel XML path.
  • DatasetStrategy defines how to categorize files, parse contributor IDs, and build Marker objects from contributor genotype CSVs. ProvedItDatasetStrategy handles filenames like F07_RD14-0003-30_31-... and reads per-contributor CSVs stored under genotypes_path.
  • EPGScalingStrategy parses the size-standard dye and rescales EPGs. ProvedItEPGScalingStrategy mirrors the ProvedIt parsing pipeline; NfiEPGScalingStrategy preserves the legacy GlobalFiler flow.

To extend to a new kit/dataset, subclass the relevant strategy and wire it up when constructing HIDDataset/HIDImage:

from DNAnet.data.strategies.dataset_compatibility import DatasetStrategy

class MyDatasetStrategy(DatasetStrategy):
    def categorize_file(self, file_name: str): ...
    def get_contributors(self, file_name: str): ...
    def build_marker(self, marker_name, allele_names): ...

and pass dataset_strategy=MyDatasetStrategy(panel, genotypes_path=Path(...)) plus a matching EPGScalingStrategy.

Training directly from an instantiated dataset

When working with custom datasets/strategies, you can skip data configs and train with an in-memory dataset:

from train_with_dataset import run_with_dataset

run_with_dataset(
    dataset=provedit,
    model_config="config/models/unet.yaml",
    training_config="config/training/segmentation.yaml",
    output_dir="output/provedit_unet",
)

scripts/select_ladder_for_images.py

This script is used to select the best ladder for every HIDImage in a dataset. The best ladder is determined by counting how often an annotated peak falls within the allele bin. To do this, we translate the allele name (e.g. 'AMEL-X') to a base pair location via a candidate ladder. Then we translate the base pair locations to pixels (using the image's scaler), so we can retrieve the peak height data for that allele and detect the presence of a peak. The candidate ladder for which most peaks are found, is declared to be the best ladder.

The output of this script is a csv file containing a mapping between the .hid filename stem (e.g. '1A2_A01_01') and the path to the ladder file that resulted in the 'best' ladder. For the high threshold (DTH) 2p-5p NFI data, the results can be found in resources/data/2p_5p_Dataset_NFI/best_ladder_paths_DTH.csv.

Note that for this algorithm, annotated images (having called alleles) are necessary.

About

An open source repository offering tools for parsing DNA profiles and integrating machine learning models.

Topics

Resources

License

Stars

Watchers

Forks

Contributors