GitHub - BioinfoSZU/NanoSimFormer

NanoSimFormer: An end-to-end Transformer-based nanopore signal simulator with basecaller guidance

NanoSimFormer is a high-fidelity nanopore sequencing signal simulator built on Transformer architectures. It supports both ONT DNA (R10.4.1) and direct-RNA (RNA004) signal simulation, enabling users to generate synthetic POD5 files from references or existing basecalled reads.

🚀 Download and Installation

System Requirements

GPU: NVIDIA GPU with CUDA compute capability >= 8.x (e.g., Ampere, Ada, or Hopper GPUs like A100, RTX 3090, RTX 4090, H100)
Driver: NVIDIA driver version >= 525

NanoSimFormer is compatible with Linux and has been fully tested on Ubuntu 22.04.

Installation via Docker

We recommend installing NanoSimFormer using the pre-built Docker image. Ensure you have Docker and the NVIDIA Container Toolkit installed by following this tutorial.

docker pull chobits323/nano-sim:v1.1

Install from conda and pip

Alternatively, you can install NanoSimFormer in a conda environment.

conda create -n nanosim -c conda-forge python==3.10
conda activate nanosim
pip install nanosimformer -i https://pypi.org/simple --extra-index-url https://download.pytorch.org/whl/cu126

# install flash-attn
wget https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.4.17/flash_attn-2.7.4.post1+cu126torch2.9-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu126torch2.9-cp310-cp310-linux_x86_64.whl

Note: The Docker image already contains all required pre-trained model files. For pip/conda installations, download the model checkpoints from Zenodo and unzip them to the package install path, following the instructions below:

# Find the package install path
python -c "import nano_signal_simulator; print(nano_signal_simulator.__path__[0])"
# Example output: /home/xsh/anaconda3/envs/nanosim/lib/python3.10/site-packages/nano_signal_simulator

# Download and unzip model checkpoints
wget https://zenodo.org/records/20364515/files/models.zip
unzip models.zip -d $(python -c "import nano_signal_simulator; print(nano_signal_simulator.__path__[0])")

⚙️ CLI Overview

NanoSimFormer provides a unified command-line interface with four subcommands:

python -m nano_signal_simulator <command> [options]

Commands:
  simulate          Simulate nanopore signals from a reference genome/transcriptome or reads
  create_dataset    Create a training dataset from aligned BAM + POD5 files
  visualize         Launch an interactive web viewer for inspecting dataset chunks
  train             Train a NanoSimFormer model

Use -h with any subcommand for detailed help:

python -m nano_signal_simulator simulate -h
python -m nano_signal_simulator create_dataset -h
python -m nano_signal_simulator visualize -h
python -m nano_signal_simulator train -h

🧬 Signal Simulation (`simulate`)

Quick Start with Docker

# Define your working directory
EXAMPLE_DIR="[WORKING_DIRECTORY_OF_EXAMPLE_DATA]"  # absolute path

# Print help
docker run --rm -it --gpus=all -v ${EXAMPLE_DIR}:${EXAMPLE_DIR} --ipc=host \
  chobits323/nano-sim:v1.1 python -m nano_signal_simulator simulate -h

Subcommand Options

usage: python -m nano_signal_simulator simulate [-h] --input INPUT --output OUTPUT [--prefix PREFIX] [--basecall] [--emit-bam] --mode {Reference,Read} [--coverage COVERAGE] [--sample-reads SAMPLE_READS] [--sample-output SAMPLE_OUTPUT]
                                                [--trans-profile TRANS_PROFILE] [--gpu GPU] [--batch-size BATCH_SIZE] [--config CONFIG] --preset {ont_r1041_dna_5khz,ont_rna004_4khz} [--noise-stdv NOISE_STDV] [--duration-stdv DURATION_STDV]
                                                [--mean-read-length MEAN_READ_LENGTH] [--min-read-length MIN_READ_LENGTH] [--max-read-length MAX_READ_LENGTH] [--length-dist-mode {stat,expon}] [--seed SEED] [--multi-gpu] [--gpus GPUS]
                                                [--version]

Nanopore sequencing signal simulator

options:
  -h, --help            show this help message and exit
  --input INPUT         FASTA file path for reference (genome or transcriptome) simulation; FASTQ file path for basecalled read simulation
  --output OUTPUT       output directory
  --prefix PREFIX       output prefix (default: simulate)
  --basecall            enable basecalling simulated reads (default: False)
  --emit-bam            basecalling simulated reads are stored in BAM files (default: False)
  --mode {Reference,Read}
                        (Reference or Read) simulation mode
  --coverage COVERAGE   sequencing coverage (default: 1)
  --sample-reads SAMPLE_READS
                        number of reads to simulate (default: -1)
  --sample-output SAMPLE_OUTPUT
                        output sampled reads (FASTA format) from reference (default: None)
  --trans-profile TRANS_PROFILE
                        3-column TSV file for simulating transcripts with specific abundance and truncation (default: None)
  --gpu GPU             GPU device id (default: 0)
  --batch-size BATCH_SIZE
                        batch size (default: 64)
  --config CONFIG       model configuration file (default: None)
  --preset {ont_r1041_dna_5khz,ont_rna004_4khz}
                        ont platform preset
  --noise-stdv NOISE_STDV
                        noise sampler standard deviation (default: None)
  --duration-stdv DURATION_STDV
                        duration sampler standard deviation (default: None)
  --mean-read-length MEAN_READ_LENGTH
                        mean read length (default: None)
  --min-read-length MIN_READ_LENGTH
                        min read length (default: 40)
  --max-read-length MAX_READ_LENGTH
                        max read length (default: None)
  --length-dist-mode {stat,expon}
                        simulated read length using exponential distribution or statistical model derived from the HG002 R10.4.1 sample (default: stat)
  --seed SEED           random seed (default: 42)
  --multi-gpu           enable multi-GPU inference using all available GPUs (default: False)
  --gpus GPUS           comma-separated GPU ids for multi-GPU inference, e.g. "0,1,2"
  --version             show program's version number and exit

DNA Sequencing Simulation Examples (R10.4.1)

Reference-Based Simulation

Simulate reads from a reference genome (FASTA) given a specific read number or sequencing coverage.

# Simulate 1000 reads from the chromosome 22 reference.
python -m nano_signal_simulator simulate \
  --input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
  --output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
  --mode Reference --sample-reads 1000 --gpu 0 --preset ont_r1041_dna_5khz 

# Simulate reads with 0.1x sequencing coverage from the chromosome 22 reference.
python -m nano_signal_simulator simulate \
  --input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
  --output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
  --mode Reference --coverage 0.1 --gpu 0 --preset ont_r1041_dna_5khz

Adjusting Noise and Duration parameters

Adjust the standard deviation of the amplitude noise or duration samplers to generate simulated signals with varying qualities.

python -m nano_signal_simulator simulate \
  --input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
  --output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
  --mode Reference --sample-reads 1000 --gpu 0 --preset ont_r1041_dna_5khz \
  --noise-stdv 1.0 --duration-stdv 0.8

Circular Reference-Based Simulation

NanoSimFormer detects genome circularity via the FASTA header (circular=true/false) to allow simulated reads to seamlessly wrap around the end of the sequences.

Example FASTA format:

>contig_1 circular=true
ATCG...
>contig_2 circular=false
CGAA...

# Simulate 1000 reads from a circular E.coli reference genome (including plasmids) 
# utilizing a custom mean read length and an exponential read length distribution.
python -m nano_signal_simulator simulate \
  --input ${EXAMPLE_DIR}/DNA_R10.4.1/ecoli.fasta \
  --output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
  --mode Reference --sample-reads 1000 --gpu 0 --preset ont_r1041_dna_5khz \
  --mean-read-length 4394 --length-dist-mode expon

Read-based simulation

Generate signals from basecalled reads provided in a FASTQ file (one-by-one).

Note: Read IDs in FASTQ file must be in UUID format for POD5 compatibility.

python -m nano_signal_simulator simulate \
  --input ${EXAMPLE_DIR}/DNA_R10.4.1/example.fastq \
  --output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
  --mode Read --gpu 0 --preset ont_r1041_dna_5khz

Full Pipeline (Signal simulation + Basecalling)

The --basecall option automatically runs Dorado to basecall the simulated reads into a FASTQ file after signal simulation. This option is currently only available for Docker users (Dorado is pre-installed in the Docker image). For non-Docker users, you can install Dorado separately and run basecalling on the output POD5 file yourself. Use --emit-bam together with --basecall to output basecalled results in BAM format instead of FASTQ.

# Simulate 1000 reads from the chromosome 22 reference. 
python -m nano_signal_simulator simulate \
  --input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
  --output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
  --mode Reference --sample-reads 1000 --gpu 0 --preset ont_r1041_dna_5khz --basecall 

# Output basecalled results in BAM format instead of FASTQ
python -m nano_signal_simulator simulate \
  --input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
  --output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
  --mode Reference --sample-reads 1000 --gpu 0 --preset ont_r1041_dna_5khz --basecall --emit-bam

Multi-GPU Inference

Use --multi-gpu to automatically distribute reads across all available GPUs, or --gpus to specify exact GPU IDs.

# Use all available GPUs
python -m nano_signal_simulator simulate \
  --input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
  --output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
  --mode Reference --sample-reads 10000 --preset ont_r1041_dna_5khz --multi-gpu

# Use specific GPUs (e.g., GPU 0 and GPU 2)
python -m nano_signal_simulator simulate \
  --input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
  --output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
  --mode Reference --sample-reads 10000 --preset ont_r1041_dna_5khz --gpus 0,2

# Multi-GPU with basecalling
python -m nano_signal_simulator simulate \
  --input ${EXAMPLE_DIR}/DNA_R10.4.1/chr22.fasta \
  --output ${EXAMPLE_DIR}/DNA_R10.4.1/output \
  --mode Reference --sample-reads 10000 --preset ont_r1041_dna_5khz --multi-gpu --basecall

Note: To accelerate the simulation process, you can increase the --batch-size parameter (default: 64) if your GPU has sufficient memory.

Direct-RNA Sequencing Simulation Examples (RNA004)

Transcriptome Reference-Based simulation

For Direct RNA sequencing (DRS), NanoSimFormer requires a 3-column TSV profile defining transcript_name, trunc_start, and trunc_end to simulate realistic transcript abundances and 5'/3' truncations. Each row in the profile TSV represents the metadata for a single simulated read.

Example TSV Profile:

transcript_name	trunc_start	trunc_end
ENST000001	10	1200

python -m nano_signal_simulator simulate \
  --input ${EXAMPLE_DIR}/RNA004/trans_ref.fasta \
  --trans-profile ${EXAMPLE_DIR}/RNA004/trans_profile.tsv \
  --output ${EXAMPLE_DIR}/RNA004/output \
  --mode Reference --gpu 0 --preset ont_rna004_4khz --basecall

📊 Training Dataset Preparation (`create_dataset`)

This command creates training data from real nanopore sequencing data (aligned BAM + POD5 files). It extracts signal–sequence aligned chunks and outputs four .npy files.

Prerequisites

POD5 file — raw signal data
Aligned BAM file — Dorado basecalled reads aligned to a reference using minimap2, must be indexed (samtools index) and contain the move-table (by Dorado --emit-moves option) as well as 'MD' tag (by minimap2 --MD option)
K-mer model table — official ONT k-mer level table for signal refinement (download from ONT)

Output Format

The dataset consists of four NumPy files in the output directory:

File	Shape	Dtype	Description
`signal.npy`	`(N, signal_chunk_len)`	`float32`	Z-normalized signal chunks
`sequence.npy`	`(N, max_chunk_seq_len)`	`int8`	Encoded bases (A=1, C=2, G=3, T=4, pad=0)
`sequence_length.npy`	`(N,)`	`int32`	Actual (unpadded) sequence length per chunk
`dwells.npy`	`(N, max_chunk_seq_len)`	`int32`	Per-base dwell times (samples), sum = `signal_chunk_len`

Subcommand Options

usage: python -m nano_signal_simulator create_dataset [-h] --bam BAM --pod5 POD5 --kmer_model KMER_MODEL --output OUTPUT --chemistry {dna_r1041,rna004} [--pa_shift PA_SHIFT] [--pa_scale PA_SCALE] [--min_identity MIN_IDENTITY]
                                                      [--min_quality MIN_QUALITY] [--min_align_ratio MIN_ALIGN_RATIO] [--min_aligned_read_len MIN_ALIGNED_READ_LEN] [--min_mapq MIN_MAPQ] [--chrom_ids CHROM_IDS [CHROM_IDS ...]]
                                                      [--signal_chunk_len SIGNAL_CHUNK_LEN] [--max_chunk_seq_len MAX_CHUNK_SEQ_LEN] [--min_chunk_seq_len MIN_CHUNK_SEQ_LEN] [--min_dwell MIN_DWELL] [--max_dwell MAX_DWELL]
                                                      [--max_chunks MAX_CHUNKS] [--max_reads MAX_READS] [--seed SEED]

Create NanoSimFormer training dataset from aligned BAM and POD5 files.

options:
  -h, --help            show this help message and exit

Required arguments:
  --bam BAM             Path to the aligned BAM file (must be indexed with 'samtools index'). (default: None)
  --pod5 POD5           Path to the POD5 file or directory containing POD5 files. (default: None)
  --kmer_model KMER_MODEL
                        Path to the official k-mer level table for signal refinement (e.g., 'dna_r10.4.1_e8.2_400bps/9mer_levels_v1.txt'). (default: None)
  --output OUTPUT       Output directory for the generated .npy dataset files. (default: None)

Chemistry and normalization:
  --chemistry {dna_r1041,rna004}
                        Sequencing chemistry. Determines default signal chunk length, PA normalization values, and read orientation. (default: dna_r1041)
  --pa_shift PA_SHIFT   Picoampere shift for z-normalization. If not provided, uses chemistry default (DNA R10.4.1: 93.69239463939118, RNA004: 80.8758975922949). (default: None)
  --pa_scale PA_SCALE   Picoampere scale for z-normalization. If not provided, uses chemistry default (DNA R10.4.1: 23.506745239082388, RNA004: 17.26975967138176). (default: None)

Read quality filters:
  --min_identity MIN_IDENTITY
                        Minimum alignment identity to include a read. (default: 0.99)
  --min_quality MIN_QUALITY
                        Minimum mean Q-score to include a read. (default: 20.0)
  --min_align_ratio MIN_ALIGN_RATIO
                        Minimum alignment ratio (1 - clipped_fraction). (default: 0.995)
  --min_aligned_read_len MIN_ALIGNED_READ_LEN
                        Minimum aligned read length (bases). (default: 1000)
  --min_mapq MIN_MAPQ   Minimum mapping quality (MAPQ). (default: 1)
  --chrom_ids CHROM_IDS [CHROM_IDS ...]
                        Restrict to specific chromosome IDs (e.g., chr1 chr2). If not specified, all chromosomes are used. (default: None)

Chunk extraction:
  --signal_chunk_len SIGNAL_CHUNK_LEN
                        Fixed signal chunk length in samples. Default: 5000 for DNA R10.4.1 (5 kHz), 4000 for RNA004 (4 kHz). (default: None)
  --max_chunk_seq_len MAX_CHUNK_SEQ_LEN
                        Maximum number of bases per chunk. Default: auto-computed as int(1.25 × expected_bases). DNA R10.4.1: 500, RNA004: 162. (default: None)
  --min_chunk_seq_len MIN_CHUNK_SEQ_LEN
                        Minimum number of bases per chunk. Default: auto-computed as int(0.75 × expected_bases). DNA R10.4.1: 300, RNA004: 97. (default: None)
  --min_dwell MIN_DWELL
                        Minimum per-base dwell time (signal samples). (default: 4)
  --max_dwell MAX_DWELL
                        Maximum per-base dwell time (signal samples). (default: 256)
  --max_chunks MAX_CHUNKS
                        Maximum number of chunks to keep (randomly subsampled). If not specified, all extracted chunks are saved. (default: None)
  --max_reads MAX_READS
                        Maximum number of reads to process (randomly subsampled after filtering). Useful for quick previews or debugging. If not specified, all filtered reads are used. (default: None)

Execution:
  --seed SEED           Random seed for reproducible subsampling. (default: 42)

Some parameters are automatically tuned based on sequencing pore type:

Chemistry	Sample Rate	Seq Speed	Default Signal Chunk	Expected Bases	Min Seq Len	Max Seq Len
`dna_r1041`	5000 Hz	400 bp/s	5000	400	300	500
`rna004`	4000 Hz	130 bp/s	4000	130	97	162

Usage Example

# DNA R10.4.1 dataset
python -m nano_signal_simulator create_dataset \
  --bam /path/to/aligned.bam \
  --pod5 /path/to/reads.pod5 \
  --kmer_model /path/to/dna_r10.4.1_e8.2_400bps/9mer_levels_v1.txt \
  --chemistry dna_r1041 \
  --output /path/to/output_dataset

# RNA004 dataset
python -m nano_signal_simulator create_dataset \
  --bam /path/to/rna_aligned.bam \
  --pod5 /path/to/rna_reads.pod5 \
  --kmer_model /path/to/rna004/9mer_levels_v1.txt \
  --chemistry rna004 \
  --output /path/to/rna_dataset

# Filter by chromosome 
python -m nano_signal_simulator create_dataset \
  --bam /path/to/aligned.bam \
  --pod5 /path/to/reads.pod5 \
  --kmer_model /path/to/9mer_levels_v1.txt \
  --chemistry dna_r1041 \
  --output /path/to/chr22_dataset \
  --chrom_ids chr1 chr2

🔍 Dataset Visualization (`visualize`)

An interactive web-based viewer for inspecting training dataset. Designed for remote machines with SSH port forwarding.

Subcommand Options

usage: python -m nano_signal_simulator visualize [-h] --data_dir DATA_DIR [--port PORT] [--host HOST]

NanoSimFormer Dataset Visualizer — interactive web viewer for training chunks. Bind to 0.0.0.0 for SSH tunnelling.

options:
  -h, --help           show this help message and exit
  --data_dir DATA_DIR  Path to directory containing signal.npy, sequence.npy, sequence_length.npy, and dwells.npy. (default: None)
  --port PORT          Port to serve on. Forward via: ssh -L <port>:localhost:<port> user@host (default: 8765)
  --host HOST          Host to bind to. Use 0.0.0.0 for remote access. (default: 0.0.0.0)

Usage

# On the remote machine (non-Docker):
python -m nano_signal_simulator visualize --data_dir /path/to/dataset --port 8765

# On your local machine, forward the port:
ssh -L 8765:localhost:8765 user@remote_ip

# Then open in your browser on local machine:
# http://localhost:8765

For Docker users running visualization on a remote machine, use Docker port forwarding with -p:

# On the remote machine (Docker):
docker run --rm -it -v /path/to/dataset:/data -p 8765:8765 \
  chobits323/nano-sim:v1.1 python -m nano_signal_simulator visualize --data_dir /data --port 8765

# On your local machine, forward the port:
ssh -L 8765:localhost:8765 user@remote_ip

# Then open in your browser on local machine:
# http://localhost:8765

🏋️ Training (`train`)

Train a NanoSimFormer model using basecaller-guided loss. Training uses multi-GPU Distributed Data Parallel (DDP) by default.

Prerequisites

A prepared dataset from create_dataset
A pre-trained basecalling model — use --preset to load official ONT basecalling models (pre-installed in the Docker image; for non-Docker usage, run bonito download --models first), or provide a custom --calling_model directory

Subcommand Options

usage: python -m nano_signal_simulator train [-h] --dataset DATASET --output OUTPUT [--preset {dna_r1041,rna004}] [--calling_model CALLING_MODEL] [--caller_name CALLER_NAME] [--batch-size BATCH_SIZE] [--epochs EPOCHS] [--lr LR]
                                             [--save_every_steps SAVE_EVERY_STEPS] [--log_freq LOG_FREQ] [--num_workers NUM_WORKERS] [--seed SEED] [--ngpus NGPUS]

options:
  -h, --help            show this help message and exit
  --dataset DATASET     Training dataset directory (default: None)
  --output OUTPUT       Output directory for logger and weights (default: None)
  --preset {dna_r1041,rna004}
                        Supported sequencing preset (default: None)
  --calling_model CALLING_MODEL
                        Directory of basecalling model (default: None)
  --caller_name CALLER_NAME
                        Class name of implemented basecaller (default: ONTBasecaller)
  --batch-size BATCH_SIZE
                        Total batch size (default: 64)
  --epochs EPOCHS       Number of epochs (default: 1)
  --lr LR               Learning rate (default: 0.0002)
  --save_every_steps SAVE_EVERY_STEPS
                        Save checkpoint every steps (default: None)
  --log_freq LOG_FREQ   Logging step frequency (default: 1)
  --num_workers NUM_WORKERS
                        Number of workers (default: 0)
  --seed SEED           Seed (default: 42)
  --ngpus NGPUS         Number of GPUs used (default: 1)

Usage Example

# Train with ONT's basecalling model for DNA R10.4.1
python -m nano_signal_simulator train \
  --dataset /path/to/training_dataset \
  --output /path/to/training_output \
  --preset dna_r1041 \
  --batch-size 128 \
  --epochs 5 \
  --lr 0.0002 \
  --ngpus 2

# Save intermediate checkpoints every 5000 steps
python -m nano_signal_simulator train \
  --dataset /path/to/training_dataset \
  --output /path/to/training_output \
  --preset dna_r1041 \
  --epochs 5 \
  --save_every_steps 5000

Training Output

training_output/
├── log/                 # TensorBoard logs
│   └── events.out.*
└── weights/             # Model checkpoints
    ├── epoch_1.checkpoint.pth
    ├── epoch_2.checkpoint.pth
    └── step_5000.checkpoint.pth   # (if --save_every_steps is set)

Monitor training:

tensorboard --logdir /path/to/training_output/log

🔧 Adapting to New Sequencing Chemistry or Update

NanoSimFormer is designed to be extensible. Here is a step-by-step guide for adapting the model to future ONT sequencing updates (e.g., new pore types or basecalling models).

Step 1: Prepare Training Data

Prepare training data following the dataset format below (the same format produced by the create_dataset command):

File	Shape	Dtype	Description
`signal.npy`	`(N, signal_chunk_len)`	`float32`	Z-normalized signal chunks
`sequence.npy`	`(N, max_seq_len)`	`int8`	Encoded bases (A=1, C=2, G=3, T/U=4, pad=0)
`sequence_length.npy`	`(N,)`	`int32`	Actual (unpadded) sequence length per chunk
`dwells.npy`	`(N, max_seq_len)`	`int32`	Per-base dwell times in samples (sum must equal `signal_chunk_len`)

As long as your training data follows this format, NanoSimFormer can be retrained for any new pore type or chemistry.

Step 2: Implement a New Basecaller Class

The training pipeline uses a pluggable basecaller interface. To support a new basecaller, implement the BaseCaller abstract class in train.py:

from abc import ABC, abstractmethod

class BaseCaller(ABC):
    @abstractmethod
    def load_model(self, model_directory) -> nn.Module:
        """Load the pre-trained basecalling model from a directory."""
        pass

    @abstractmethod
    def basecalling_loss(self, model, x, targets, target_lengths):
        """Compute the basecalling loss for the basecaller on the given signal."""
        pass

Example: Adding a new basecaller implementation

class NewBasecaller(BaseCaller):
    def __init__(self):
        super().__init__()

    def load_model(self, model_directory):
        # Load your custom model or new official model 
        config = load_config(os.path.join(model_directory, "config.toml"))
        model = NewModel(config)
        state_dict = torch.load(os.path.join(model_directory, "weights.pth"), map_location='cpu')
        model.load_state_dict(state_dict)
        return model

    def basecalling_loss(self, model, x, targets, target_lengths):
        model.eval()
        # Forward pass through the basecaller
        logits = model(x)
        # Compute CTC loss
        loss = torch.nn.functional.ctc_loss(
            logits.log_softmax(dim=-1).permute(1, 0, 2),
            targets, input_lengths, target_lengths,
        )
        return loss

Then, train with your new basecaller:

python -m nano_signal_simulator train \
  --dataset /path/to/dataset \
  --output /path/to/output \
  --calling_model /path/to/new_calling_model_weight_dir \
  --caller_name NewBasecaller \
  --epochs 5

🙏 Acknowledgement

Some code snippets used to build the model were adapted from torchtune library. We also integrated some preprocessing code snippets from seq2squiggle to handle read sampling at given sequencing coverage. We use bonito for loading official ONT basecalling models and remora for signal-to-sequence alignment during training dataset preparation.

📖 Citation

Please cite our publication if you use NanoSimFormer in your work:

@article{nanosimformer,
  title={NanoSimFormer: An end-to-end Transformer-based simulator for nanopore sequencing signal data},
  author={Xie, Shaohui and Ding, Lulu and Liu, Ling and Zhu, Zexuan},
  journal={bioRxiv},
  pages={2026--01},
  year={2026},
  publisher={Cold Spring Harbor Laboratory}
}

©️ Copyright

Copyright 2026 Zexuan Zhu zhuzx@szu.edu.cn.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
example		example
nano_signal_simulator		nano_signal_simulator
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

NanoSimFormer: An end-to-end Transformer-based nanopore signal simulator with basecaller guidance

🚀 Download and Installation

System Requirements

Installation via Docker

Install from conda and pip

⚙️ CLI Overview

🧬 Signal Simulation (simulate)

Quick Start with Docker

Subcommand Options

DNA Sequencing Simulation Examples (R10.4.1)

Reference-Based Simulation

Adjusting Noise and Duration parameters

Circular Reference-Based Simulation

Read-based simulation

Full Pipeline (Signal simulation + Basecalling)

Multi-GPU Inference

Direct-RNA Sequencing Simulation Examples (RNA004)

Transcriptome Reference-Based simulation

📊 Training Dataset Preparation (create_dataset)

Prerequisites

Output Format

Subcommand Options

Usage Example

🔍 Dataset Visualization (visualize)

Subcommand Options

Usage

🏋️ Training (train)

Prerequisites

Subcommand Options

Usage Example

Training Output

🔧 Adapting to New Sequencing Chemistry or Update

Step 1: Prepare Training Data

Step 2: Implement a New Basecaller Class

🙏 Acknowledgement

📖 Citation

©️ Copyright

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🧬 Signal Simulation (`simulate`)

📊 Training Dataset Preparation (`create_dataset`)

🔍 Dataset Visualization (`visualize`)

🏋️ Training (`train`)

Packages