cgmlst-dists-py

A high-performance Python implementation of cgmlst-dists for calculating pairwise Hamming distances in cgMLST data.

Installation

Bioconda (recommended)

conda install -c bioconda cgmlst-dists-py

Docker

docker pull ghcr.io/genpat-it/cgmlst-dists-py

From source

git clone https://github.com/genpat-it/cgmlst-dists-py.git
cd cgmlst-dists-py
pip install -r requirements.txt

For GPU support, make sure you have a compatible CUDA Toolkit installed.

Overview

This is an enhanced Python implementation of cgmlst-dists originally developed by Torsten Seemann. It's designed for calculating pairwise Hamming distances for genome profiles in core genome multilocus sequence typing (cgMLST) schemas.

Key features in this version (0.1.3):

GPU Acceleration: Optional CUDA GPU support for dramatically faster calculations (up to 123x speedup)
Vectorized CPU Computation: NumPy-based vectorized distance calculation with multi-threaded parallelism
Optimized Memory Management: Batch processing to handle large datasets efficiently
Multithreaded Processing: Parallelized calculations across CPU cores (numpy releases the GIL)
Intelligent I/O: Chunked file operations for better performance with large files
Advanced Filtering: Quality control via loci and sample completeness thresholds
Automatic System Detection: Optimizes settings based on available hardware
Binary Output Option: For extremely large matrices

Usage

$ python cgmlst-dists.py --help
usage: cgmlst-dists.py [-h] [--input INPUT] [--output OUTPUT] [--skip_input_replacements] 
                       [--input_sep INPUT_SEP] [--output_sep OUTPUT_SEP] [--index_name INDEX_NAME]
                       [--matrix-format {full,lower-tri,upper-tri}] [--num_threads NUM_THREADS] 
                       [--io_threads IO_THREADS] [--max_memory_gb MAX_MEMORY_GB] [--chunk_size CHUNK_SIZE]
                       [--missing_char MISSING_CHAR] [--locus-completeness LOCUS_COMPLETENESS]
                       [--sample-completeness SAMPLE_COMPLETENESS] [--gpu] [--binary-output] [--version]

Calculate pairwise Hamming distances. Version: 0.1.3

options:
  -h, --help            show this help message and exit
  --input INPUT         Path to the input TSV file
  --output OUTPUT       Path to save the output TSV file
  --skip_input_replacements
                        Skip input replacements when there are no strings in the input
  --input_sep INPUT_SEP
                        Input file separator (default: '\t')
  --output_sep OUTPUT_SEP
                        Output file separator (default: '\t')
  --index_name INDEX_NAME
                        Name for the index column (default: 'cgmlst-dists')
  --matrix-format {full,lower-tri,upper-tri}
                        Format for the output matrix (default: full)
  --num_threads NUM_THREADS
                        Number of threads for parallel execution (default: auto-detected)
  --io_threads IO_THREADS
                        Number of I/O threads for file operations
  --max_memory_gb MAX_MEMORY_GB
                        Maximum memory to use in GB for distance calculation
  --chunk_size CHUNK_SIZE
                        Size of chunks for reading/writing files (default: 1000)
  --missing_char MISSING_CHAR
                        Character used for missing data (default: '-')
  --locus-completeness LOCUS_COMPLETENESS
                        Minimum percentage of non-missing data required for a locus (0-100)
  --sample-completeness SAMPLE_COMPLETENESS
                        Minimum percentage of non-missing data required for a sample (0-100)
  --gpu                 Use GPU acceleration when available
  --binary-output       Also save results in binary format for large matrices
  --version            show program's version number and exit

Examples

Basic Usage

python cgmlst-dists.py --input input.tsv --output output.tsv

With GPU Acceleration (if available)

python cgmlst-dists.py --input input.tsv --output output.tsv --gpu

Data Filtering

Filter both loci and samples to include only those with ≥90% data completeness:

python cgmlst-dists.py --input input.tsv --output output.tsv --locus-completeness 90 --sample-completeness 90

Handling Large Datasets

For very large datasets, optimize memory and I/O:

python cgmlst-dists.py --input large_data.tsv --output large_output.tsv --max_memory_gb 16 --chunk_size 500 --binary-output

Performance Considerations

GPU Acceleration: Provides dramatic speedup for the distance calculation kernel (up to 123x on NVIDIA L4), requires CUDA-capable NVIDIA GPU
CPU Vectorization: The numpy-based CPU kernel is significantly faster than the previous numba triple-loop approach, scaling well with thread count
Memory Usage: Adjust --max_memory_gb based on your system's available RAM to prevent out-of-memory errors
I/O Performance: For large files, increase --io_threads on systems with fast storage
Binary Output: Useful for very large matrices (>5000 samples) as it provides faster saving/loading for future analysis

Performance Benchmarks

Test System Specifications

CPU: INTEL(R) XEON(R) GOLD 6542Y
CPU Cores: 80
Memory: 480 GB
GPU: NVIDIA L4
GPU Memory: 22 GB
OS: AlmaLinux 10

Distance Calculation Benchmarks (5,000 samples × 3,000 loci)

Method	Calc Time	Total Time	Speedup (calc)
v0.1.1 CPU (8 threads, numba)	55.5s	64.2s	1x
v0.1.3 CPU (8 threads, numpy)	8.5s	17.1s	6.5x
v0.1.3 CPU (16 threads, numpy)	5.1s	13.9s	10.9x
v0.1.3 GPU (NVIDIA L4)	0.45s	9.6s	123x

Distance Calculation Benchmarks (10,000 samples × 3,000 loci)

Method	Calc Time	Total Time
v0.1.1 CPU (8 threads)	50.8s	84.9s
v0.1.3 CPU (8 threads)	33.9s	68.7s
v0.1.3 GPU (NVIDIA L4)	1.3s	35.7s

Large-Scale Test (50,000 samples × 5,000 loci)

Implementation	Hardware	Runtime	Notes
Original C version	16-core CPU	Failed	Out of memory error
Python CPU version	16-core CPU	~32 minutes	Full processing time
Python GPU version	NVIDIA L4 GPU	~12 minutes	Full processing time

Both CPU and GPU implementations produce identical output (verified via MD5 checksum).

Docker Usage

docker run --rm -v "$(pwd):/app/data" ghcr.io/genpat-it/cgmlst-dists-py --input data/input.tab --output data/output.tab

With GPU support:

docker run --rm --gpus all -v "$(pwd):/app/data" ghcr.io/genpat-it/cgmlst-dists-py --input data/input.tab --output data/output.tab --gpu

Advantages Over Original Implementation

Scalability: Efficiently handles much larger datasets through batch processing and memory optimization
Speed: Significantly faster for large matrices through multithreading and optional GPU acceleration
Data Quality: Advanced filtering options for more accurate analysis
Hardware Optimization: Auto-detects and adapts to available system resources
More Output Options: Supports binary format for very large matrices

Limitations

Requires more dependencies than the C implementation
More complex configuration options (though with sensible defaults)
GPU acceleration requires CUDA-capable NVIDIA graphics card

Citation

If you use this tool in your research, please cite the original cgmlst-dists tool:

Seemann T, cgmlst-dists: https://github.com/tseemann/cgmlst-dists/

License

This project is licensed under the same terms as the original cgmlst-dists.

Contact

Please submit issues and feature requests through the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
benchmark		benchmark
test		test
validation		validation
.dockerignore		.dockerignore
.gitattributes		.gitattributes
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cgmlst-dists.py		cgmlst-dists.py
entrypoint.sh		entrypoint.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cgmlst-dists-py

Installation

Bioconda (recommended)

Docker

From source

Overview

Usage

Examples

Basic Usage

With GPU Acceleration (if available)

Data Filtering

Handling Large Datasets

Performance Considerations

Performance Benchmarks

Test System Specifications

Distance Calculation Benchmarks (5,000 samples × 3,000 loci)

Distance Calculation Benchmarks (10,000 samples × 3,000 loci)

Large-Scale Test (50,000 samples × 5,000 loci)

Docker Usage

Advantages Over Original Implementation

Limitations

Citation

License

Contact

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cgmlst-dists-py

Installation

Bioconda (recommended)

Docker

From source

Overview

Usage

Examples

Basic Usage

With GPU Acceleration (if available)

Data Filtering

Handling Large Datasets

Performance Considerations

Performance Benchmarks

Test System Specifications

Distance Calculation Benchmarks (5,000 samples × 3,000 loci)

Distance Calculation Benchmarks (10,000 samples × 3,000 loci)

Large-Scale Test (50,000 samples × 5,000 loci)

Docker Usage

Advantages Over Original Implementation

Limitations

Citation

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages