NanoSimFormer is a high-fidelity nanopore sequencing signal simulator built on Transformer architectures. It supports both ONT DNA (R10.4.1) and direct-RNA (RNA004) signal simulation, enabling users to generate synthetic POD5 files from references or existing basecalled reads.
- GPU: NVIDIA GPU with CUDA compute capability >= 8.x (e.g., Ampere, Ada, or Hopper GPUs like A100, RTX 3090, RTX 4090, H100)
- Driver: NVIDIA driver version >= 525
NanoSimFormer is compatible with Linux and has been fully tested on Ubuntu 22.04.
We recommend installing NanoSimFormer using the pre-built Docker image. Ensure you have Docker and the NVIDIA Container Toolkit installed by following this tutorial.
docker pull chobits323/nano-sim:v1.1Alternatively, you can install NanoSimFormer in a conda environment.
conda create -n nanosim -c conda-forge python==3.10
conda activate nanosim
pip install nanosimformer -i https://pypi.org/simple --extra-index-url https://download.pytorch.org/whl/cu126
# install flash-attn
wget https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.4.17/flash_attn-2.7.4.post1+cu126torch2.9-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu126torch2.9-cp310-cp310-linux_x86_64.whlNote: The Docker image already contains all required pre-trained model files. For pip/conda installations, download the model checkpoints from Zenodo and unzip them to the package install path, following the instructions below:
# Find the package install path
python -c "import nano_signal_simulator; print(nano_signal_simulator.__path__[0])"
# Example output: /home/xsh/anaconda3/envs/nanosim/lib/python3.10/site-packages/nano_signal_simulator
# Download and unzip model checkpoints
wget https://zenodo.org/records/20364515/files/models.zip
unzip models.zip -d $(python -c "import nano_signal_simulator; print(nano_signal_simulator.__path__[0])")NanoSimFormer provides a unified command-line interface with four subcommands:
python -m nano_signal_simulator <command> [options]
Commands:
simulate Simulate nanopore signals from a reference genome/transcriptome or reads
create_dataset Create a training dataset from aligned BAM + POD5 files
visualize Launch an interactive web viewer for inspecting dataset chunks
train Train a NanoSimFormer model
Use -h with any subcommand for detailed help:
python -m nano_signal_simulator simulate -h
python -m nano_signal_simulator create_dataset -h
python -m nano_signal_simulator visualize -h
python -m nano_signal_simulator train -h# Define your working directory
EXAMPLE_DIR="[WORKING_DIRECTORY_OF_EXAMPLE_DATA]" # absolute path
# Print help
docker run --rm -it --gpus=all -v ${EXAMPLE_DIR}:${EXAMPLE_DIR} --ipc=host \
chobits323/nano-sim:v1.1 python -m nano_signal_simulator simulate -husage: python -m nano_signal_simulator simulate [-h] --input INPUT --output OUTPUT [--prefix PREFIX] [--basecall] [--emit-bam] --mode {Reference,Read} [--coverage COVERAGE] [--sample-reads SAMPLE_READS] [--sample-output SAMPLE_OUTPUT]
[--trans-profile TRANS_PROFILE] [--gpu GPU] [--batch-size BATCH_SIZE] [--config CONFIG] --preset {ont_r1041_dna_5khz,ont_rna004_4khz} [--noise-stdv NOISE_STDV] [--duration-stdv DURATION_STDV]
[--mean-read-length MEAN_READ_LENGTH] [--min-read-length MIN_READ_LENGTH] [--max-read-length MAX_READ_LENGTH] [--length-dist-mode {stat,expon}] [--seed SEED] [--multi-gpu] [--gpus GPUS]
[--version]
Nanopore sequencing signal simulator
options:
-h, --help show this help message and exit
--input INPUT FASTA file path for reference (genome or transcriptome) simulation; FASTQ file path for basecalled read simulation
--output OUTPUT output directory
--prefix PREFIX output prefix (default: simulate)
--basecall enable basecalling simulated reads (default: False)
--emit-bam basecalling simulated reads are stored in BAM files (default: False)
--mode {Reference,Read}
(Reference or Read) simulation mode
--coverage COVERAGE sequencing coverage (default: 1)
--sample-reads SAMPLE_READS
number of reads to simulate (default: -1)
--sample-output SAMPLE_OUTPUT
output sampled reads (FASTA format) from reference (default: None)
--trans-profile TRANS_PROFILE
3-column TSV file for simulating transcripts with specific abundance and truncation (default: None)
--gpu GPU GPU device id (default: 0)
--batch-size BATCH_SIZE
batch size (default: 64)
--config CONFIG model configuration file (default: None)
--preset {ont_r1041_dna_5khz,ont_rna004_4khz}
ont platform preset
--noise-stdv NOISE_STDV
noise sampler standard deviation (default: None)
--duration-stdv DURATION_STDV
duration sampler standard deviation (default: None)
--mean-read-length MEAN_READ_LENGTH
mean read length (default: None)
--min-read-length MIN_READ_LENGTH
min read length (default: 40)
--max-read-length MAX_READ_LENGTH
max read length (default: None)
--length-dist-mode {stat,expon}
simulated read length using exponential distribution or statistical model derived from the HG002 R10.4.1 sample (default: stat)
--seed SEED random seed (default: 42)
--multi-gpu enable multi-GPU inference using all available GPUs (default: False)
--gpus GPUS comma-separated GPU ids for multi-GPU inference, e.g. "0,1,2"
--version show program's version number and exit
Simulate reads from a reference genome (FASTA) given a specific read number or sequencing coverage.
# Simulate 1000 reads from the chromosome 22 reference.
python -m nano_signal_simulator simulate \
--input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
--output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
--mode Reference --sample-reads 1000 --gpu 0 --preset ont_r1041_dna_5khz
# Simulate reads with 0.1x sequencing coverage from the chromosome 22 reference.
python -m nano_signal_simulator simulate \
--input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
--output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
--mode Reference --coverage 0.1 --gpu 0 --preset ont_r1041_dna_5khzAdjust the standard deviation of the amplitude noise or duration samplers to generate simulated signals with varying qualities.
python -m nano_signal_simulator simulate \
--input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
--output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
--mode Reference --sample-reads 1000 --gpu 0 --preset ont_r1041_dna_5khz \
--noise-stdv 1.0 --duration-stdv 0.8NanoSimFormer detects genome circularity via the FASTA header (circular=true/false) to allow simulated reads to seamlessly wrap around the end of the sequences.
Example FASTA format:
>contig_1 circular=true
ATCG...
>contig_2 circular=false
CGAA...
# Simulate 1000 reads from a circular E.coli reference genome (including plasmids)
# utilizing a custom mean read length and an exponential read length distribution.
python -m nano_signal_simulator simulate \
--input ${EXAMPLE_DIR}/DNA_R10.4.1/ecoli.fasta \
--output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
--mode Reference --sample-reads 1000 --gpu 0 --preset ont_r1041_dna_5khz \
--mean-read-length 4394 --length-dist-mode expon Generate signals from basecalled reads provided in a FASTQ file (one-by-one).
Note: Read IDs in FASTQ file must be in UUID format for POD5 compatibility.
python -m nano_signal_simulator simulate \
--input ${EXAMPLE_DIR}/DNA_R10.4.1/example.fastq \
--output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
--mode Read --gpu 0 --preset ont_r1041_dna_5khzThe --basecall option automatically runs Dorado to basecall the simulated reads into a FASTQ file after signal simulation. This option is currently only available for Docker users (Dorado is pre-installed in the Docker image).
For non-Docker users, you can install Dorado separately and run basecalling on the output POD5 file yourself.
Use --emit-bam together with --basecall to output basecalled results in BAM format instead of FASTQ.
# Simulate 1000 reads from the chromosome 22 reference.
python -m nano_signal_simulator simulate \
--input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
--output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
--mode Reference --sample-reads 1000 --gpu 0 --preset ont_r1041_dna_5khz --basecall
# Output basecalled results in BAM format instead of FASTQ
python -m nano_signal_simulator simulate \
--input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
--output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
--mode Reference --sample-reads 1000 --gpu 0 --preset ont_r1041_dna_5khz --basecall --emit-bamUse --multi-gpu to automatically distribute reads across all available GPUs, or --gpus to specify exact GPU IDs.
# Use all available GPUs
python -m nano_signal_simulator simulate \
--input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
--output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
--mode Reference --sample-reads 10000 --preset ont_r1041_dna_5khz --multi-gpu
# Use specific GPUs (e.g., GPU 0 and GPU 2)
python -m nano_signal_simulator simulate \
--input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
--output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
--mode Reference --sample-reads 10000 --preset ont_r1041_dna_5khz --gpus 0,2
# Multi-GPU with basecalling
python -m nano_signal_simulator simulate \
--input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
--output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
--mode Reference --sample-reads 10000 --preset ont_r1041_dna_5khz --multi-gpu --basecallNote: To accelerate the simulation process, you can increase the --batch-size parameter (default: 64) if your GPU has sufficient memory.
For Direct RNA sequencing (DRS), NanoSimFormer requires a 3-column TSV profile defining transcript_name, trunc_start, and trunc_end to simulate realistic transcript abundances and 5'/3' truncations.
Each row in the profile TSV represents the metadata for a single simulated read.
Example TSV Profile:
| transcript_name | trunc_start | trunc_end |
|---|---|---|
| ENST000001 | 10 | 1200 |
python -m nano_signal_simulator simulate \
--input ${EXAMPLE_DIR}/RNA004/trans_ref.fasta \
--trans-profile ${EXAMPLE_DIR}/RNA004/trans_profile.tsv \
--output ${EXAMPLE_DIR}/RNA004/output \
--mode Reference --gpu 0 --preset ont_rna004_4khz --basecallThis command creates training data from real nanopore sequencing data (aligned BAM + POD5 files). It extracts signalβsequence aligned chunks and outputs four .npy files.
- POD5 file β raw signal data
- Aligned BAM file β Dorado basecalled reads aligned to a reference using minimap2, must be indexed (
samtools index) and contain the move-table (by Dorado--emit-movesoption) as well as 'MD' tag (by minimap2--MDoption) - K-mer model table β official ONT k-mer level table for signal refinement (download from ONT)
The dataset consists of four NumPy files in the output directory:
| File | Shape | Dtype | Description |
|---|---|---|---|
signal.npy |
(N, signal_chunk_len) |
float32 |
Z-normalized signal chunks |
sequence.npy |
(N, max_chunk_seq_len) |
int8 |
Encoded bases (A=1, C=2, G=3, T=4, pad=0) |
sequence_length.npy |
(N,) |
int32 |
Actual (unpadded) sequence length per chunk |
dwells.npy |
(N, max_chunk_seq_len) |
int32 |
Per-base dwell times (samples), sum = signal_chunk_len |
usage: python -m nano_signal_simulator create_dataset [-h] --bam BAM --pod5 POD5 --kmer_model KMER_MODEL --output OUTPUT --chemistry {dna_r1041,rna004} [--pa_shift PA_SHIFT] [--pa_scale PA_SCALE] [--min_identity MIN_IDENTITY]
[--min_quality MIN_QUALITY] [--min_align_ratio MIN_ALIGN_RATIO] [--min_aligned_read_len MIN_ALIGNED_READ_LEN] [--min_mapq MIN_MAPQ] [--chrom_ids CHROM_IDS [CHROM_IDS ...]]
[--signal_chunk_len SIGNAL_CHUNK_LEN] [--max_chunk_seq_len MAX_CHUNK_SEQ_LEN] [--min_chunk_seq_len MIN_CHUNK_SEQ_LEN] [--min_dwell MIN_DWELL] [--max_dwell MAX_DWELL]
[--max_chunks MAX_CHUNKS] [--max_reads MAX_READS] [--seed SEED]
Create NanoSimFormer training dataset from aligned BAM and POD5 files.
options:
-h, --help show this help message and exit
Required arguments:
--bam BAM Path to the aligned BAM file (must be indexed with 'samtools index'). (default: None)
--pod5 POD5 Path to the POD5 file or directory containing POD5 files. (default: None)
--kmer_model KMER_MODEL
Path to the official k-mer level table for signal refinement (e.g., 'dna_r10.4.1_e8.2_400bps/9mer_levels_v1.txt'). (default: None)
--output OUTPUT Output directory for the generated .npy dataset files. (default: None)
Chemistry and normalization:
--chemistry {dna_r1041,rna004}
Sequencing chemistry. Determines default signal chunk length, PA normalization values, and read orientation. (default: dna_r1041)
--pa_shift PA_SHIFT Picoampere shift for z-normalization. If not provided, uses chemistry default (DNA R10.4.1: 93.69239463939118, RNA004: 80.8758975922949). (default: None)
--pa_scale PA_SCALE Picoampere scale for z-normalization. If not provided, uses chemistry default (DNA R10.4.1: 23.506745239082388, RNA004: 17.26975967138176). (default: None)
Read quality filters:
--min_identity MIN_IDENTITY
Minimum alignment identity to include a read. (default: 0.99)
--min_quality MIN_QUALITY
Minimum mean Q-score to include a read. (default: 20.0)
--min_align_ratio MIN_ALIGN_RATIO
Minimum alignment ratio (1 - clipped_fraction). (default: 0.995)
--min_aligned_read_len MIN_ALIGNED_READ_LEN
Minimum aligned read length (bases). (default: 1000)
--min_mapq MIN_MAPQ Minimum mapping quality (MAPQ). (default: 1)
--chrom_ids CHROM_IDS [CHROM_IDS ...]
Restrict to specific chromosome IDs (e.g., chr1 chr2). If not specified, all chromosomes are used. (default: None)
Chunk extraction:
--signal_chunk_len SIGNAL_CHUNK_LEN
Fixed signal chunk length in samples. Default: 5000 for DNA R10.4.1 (5 kHz), 4000 for RNA004 (4 kHz). (default: None)
--max_chunk_seq_len MAX_CHUNK_SEQ_LEN
Maximum number of bases per chunk. Default: auto-computed as int(1.25 Γ expected_bases). DNA R10.4.1: 500, RNA004: 162. (default: None)
--min_chunk_seq_len MIN_CHUNK_SEQ_LEN
Minimum number of bases per chunk. Default: auto-computed as int(0.75 Γ expected_bases). DNA R10.4.1: 300, RNA004: 97. (default: None)
--min_dwell MIN_DWELL
Minimum per-base dwell time (signal samples). (default: 4)
--max_dwell MAX_DWELL
Maximum per-base dwell time (signal samples). (default: 256)
--max_chunks MAX_CHUNKS
Maximum number of chunks to keep (randomly subsampled). If not specified, all extracted chunks are saved. (default: None)
--max_reads MAX_READS
Maximum number of reads to process (randomly subsampled after filtering). Useful for quick previews or debugging. If not specified, all filtered reads are used. (default: None)
Execution:
--seed SEED Random seed for reproducible subsampling. (default: 42)
Some parameters are automatically tuned based on sequencing pore type:
| Chemistry | Sample Rate | Seq Speed | Default Signal Chunk | Expected Bases | Min Seq Len | Max Seq Len |
|---|---|---|---|---|---|---|
dna_r1041 |
5000 Hz | 400 bp/s | 5000 | 400 | 300 | 500 |
rna004 |
4000 Hz | 130 bp/s | 4000 | 130 | 97 | 162 |
# DNA R10.4.1 dataset
python -m nano_signal_simulator create_dataset \
--bam /path/to/aligned.bam \
--pod5 /path/to/reads.pod5 \
--kmer_model /path/to/dna_r10.4.1_e8.2_400bps/9mer_levels_v1.txt \
--chemistry dna_r1041 \
--output /path/to/output_dataset
# RNA004 dataset
python -m nano_signal_simulator create_dataset \
--bam /path/to/rna_aligned.bam \
--pod5 /path/to/rna_reads.pod5 \
--kmer_model /path/to/rna004/9mer_levels_v1.txt \
--chemistry rna004 \
--output /path/to/rna_dataset
# Filter by chromosome
python -m nano_signal_simulator create_dataset \
--bam /path/to/aligned.bam \
--pod5 /path/to/reads.pod5 \
--kmer_model /path/to/9mer_levels_v1.txt \
--chemistry dna_r1041 \
--output /path/to/chr22_dataset \
--chrom_ids chr1 chr2 An interactive web-based viewer for inspecting training dataset. Designed for remote machines with SSH port forwarding.
usage: python -m nano_signal_simulator visualize [-h] --data_dir DATA_DIR [--port PORT] [--host HOST]
NanoSimFormer Dataset Visualizer β interactive web viewer for training chunks. Bind to 0.0.0.0 for SSH tunnelling.
options:
-h, --help show this help message and exit
--data_dir DATA_DIR Path to directory containing signal.npy, sequence.npy, sequence_length.npy, and dwells.npy. (default: None)
--port PORT Port to serve on. Forward via: ssh -L <port>:localhost:<port> user@host (default: 8765)
--host HOST Host to bind to. Use 0.0.0.0 for remote access. (default: 0.0.0.0)
# On the remote machine (non-Docker):
python -m nano_signal_simulator visualize --data_dir /path/to/dataset --port 8765
# On your local machine, forward the port:
ssh -L 8765:localhost:8765 user@remote_ip
# Then open in your browser on local machine:
# http://localhost:8765For Docker users running visualization on a remote machine, use Docker port forwarding with -p:
# On the remote machine (Docker):
docker run --rm -it -v /path/to/dataset:/data -p 8765:8765 \
chobits323/nano-sim:v1.1 python -m nano_signal_simulator visualize --data_dir /data --port 8765
# On your local machine, forward the port:
ssh -L 8765:localhost:8765 user@remote_ip
# Then open in your browser on local machine:
# http://localhost:8765Train a NanoSimFormer model using basecaller-guided loss. Training uses multi-GPU Distributed Data Parallel (DDP) by default.
- A prepared dataset from
create_dataset - A pre-trained basecalling model β use
--presetto load official ONT basecalling models (pre-installed in the Docker image; for non-Docker usage, runbonito download --modelsfirst), or provide a custom--calling_modeldirectory
usage: python -m nano_signal_simulator train [-h] --dataset DATASET --output OUTPUT [--preset {dna_r1041,rna004}] [--calling_model CALLING_MODEL] [--caller_name CALLER_NAME] [--batch-size BATCH_SIZE] [--epochs EPOCHS] [--lr LR]
[--save_every_steps SAVE_EVERY_STEPS] [--log_freq LOG_FREQ] [--num_workers NUM_WORKERS] [--seed SEED] [--ngpus NGPUS]
options:
-h, --help show this help message and exit
--dataset DATASET Training dataset directory (default: None)
--output OUTPUT Output directory for logger and weights (default: None)
--preset {dna_r1041,rna004}
Supported sequencing preset (default: None)
--calling_model CALLING_MODEL
Directory of basecalling model (default: None)
--caller_name CALLER_NAME
Class name of implemented basecaller (default: ONTBasecaller)
--batch-size BATCH_SIZE
Total batch size (default: 64)
--epochs EPOCHS Number of epochs (default: 1)
--lr LR Learning rate (default: 0.0002)
--save_every_steps SAVE_EVERY_STEPS
Save checkpoint every steps (default: None)
--log_freq LOG_FREQ Logging step frequency (default: 1)
--num_workers NUM_WORKERS
Number of workers (default: 0)
--seed SEED Seed (default: 42)
--ngpus NGPUS Number of GPUs used (default: 1)
# Train with ONT's basecalling model for DNA R10.4.1
python -m nano_signal_simulator train \
--dataset /path/to/training_dataset \
--output /path/to/training_output \
--preset dna_r1041 \
--batch-size 128 \
--epochs 5 \
--lr 0.0002 \
--ngpus 2
# Save intermediate checkpoints every 5000 steps
python -m nano_signal_simulator train \
--dataset /path/to/training_dataset \
--output /path/to/training_output \
--preset dna_r1041 \
--epochs 5 \
--save_every_steps 5000training_output/
βββ log/ # TensorBoard logs
β βββ events.out.*
βββ weights/ # Model checkpoints
βββ epoch_1.checkpoint.pth
βββ epoch_2.checkpoint.pth
βββ step_5000.checkpoint.pth # (if --save_every_steps is set)
Monitor training:
tensorboard --logdir /path/to/training_output/logNanoSimFormer is designed to be extensible. Here is a step-by-step guide for adapting the model to future ONT sequencing updates (e.g., new pore types or basecalling models).
Prepare training data following the dataset format below (the same format produced by the create_dataset command):
| File | Shape | Dtype | Description |
|---|---|---|---|
signal.npy |
(N, signal_chunk_len) |
float32 |
Z-normalized signal chunks |
sequence.npy |
(N, max_seq_len) |
int8 |
Encoded bases (A=1, C=2, G=3, T/U=4, pad=0) |
sequence_length.npy |
(N,) |
int32 |
Actual (unpadded) sequence length per chunk |
dwells.npy |
(N, max_seq_len) |
int32 |
Per-base dwell times in samples (sum must equal signal_chunk_len) |
As long as your training data follows this format, NanoSimFormer can be retrained for any new pore type or chemistry.
The training pipeline uses a pluggable basecaller interface. To support a new basecaller, implement the BaseCaller abstract class in train.py:
from abc import ABC, abstractmethod
class BaseCaller(ABC):
@abstractmethod
def load_model(self, model_directory) -> nn.Module:
"""Load the pre-trained basecalling model from a directory."""
pass
@abstractmethod
def basecalling_loss(self, model, x, targets, target_lengths):
"""Compute the basecalling loss for the basecaller on the given signal."""
passExample: Adding a new basecaller implementation
class NewBasecaller(BaseCaller):
def __init__(self):
super().__init__()
def load_model(self, model_directory):
# Load your custom model or new official model
config = load_config(os.path.join(model_directory, "config.toml"))
model = NewModel(config)
state_dict = torch.load(os.path.join(model_directory, "weights.pth"), map_location='cpu')
model.load_state_dict(state_dict)
return model
def basecalling_loss(self, model, x, targets, target_lengths):
model.eval()
# Forward pass through the basecaller
logits = model(x)
# Compute CTC loss
loss = torch.nn.functional.ctc_loss(
logits.log_softmax(dim=-1).permute(1, 0, 2),
targets, input_lengths, target_lengths,
)
return lossThen, train with your new basecaller:
python -m nano_signal_simulator train \
--dataset /path/to/dataset \
--output /path/to/output \
--calling_model /path/to/new_calling_model_weight_dir \
--caller_name NewBasecaller \
--epochs 5Some code snippets used to build the model were adapted from torchtune library. We also integrated some preprocessing code snippets from seq2squiggle to handle read sampling at given sequencing coverage. We use bonito for loading official ONT basecalling models and remora for signal-to-sequence alignment during training dataset preparation.
Please cite our publication if you use NanoSimFormer in your work:
@article{nanosimformer,
title={NanoSimFormer: An end-to-end Transformer-based simulator for nanopore sequencing signal data},
author={Xie, Shaohui and Ding, Lulu and Liu, Ling and Zhu, Zexuan},
journal={bioRxiv},
pages={2026--01},
year={2026},
publisher={Cold Spring Harbor Laboratory}
}Copyright 2026 Zexuan Zhu zhuzx@szu.edu.cn.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.