Skip to content

Latest commit

 

History

History
274 lines (216 loc) · 11.3 KB

File metadata and controls

274 lines (216 loc) · 11.3 KB

MDSearch: Minimum Discriminatory SNP Set Search

MDSearch is a Python tool designed to identify minimal sets of Single Nucleotide Polymorphisms (SNPs) that can discriminate between samples in a VCF (Variant Call Format) file based on a specified minimum Hamming distance. The tool is particularly useful for applications such as genetic barcoding, sample identification, and quality control in genomic studies.

Key Features

  • Robust genotype handling: Correctly processes missing genotypes and polyploid SNPs
  • Customizable discrimination: Set minimum Hamming distance between samples
  • Multiple SNP sets: Generate several distinct (complitely disjoint) discriminatory SNP sets
  • Advanced scoring: Customizable entropy and Minor Allele Frequency (MAF) weights
  • Comprehensive output: Human-readable SNP profiles, statistics, and VCF files
  • Memory monitoring: Dynamic memory usage tracking with system-aware thresholds

Core Techniques

The codebase implements several sophisticated algorithms:

  • Two-phase optimization: Forward selection followed by backward elimination for minimal SNP sets
  • Vectorized Hamming distance calculation: Optimized pairwise distance computation using NumPy arrays
  • Entropy-based SNP scoring: Combines Shannon entropy and MAF for optimal SNP selection
  • Incremental distance updates: Efficient backward elimination without full set reconstruction
  • Memory-aware processing: Dynamic memory monitoring with graceful degradation
  • Deterministic SNP selection: Stable, reproducible results with tie-breaking by SNP ID

Key Technologies & Libraries

Core Bioinformatics Libraries

  • pysam: High-performance VCF/BCF file reading and writing
  • NumPy: Vectorized numerical operations for genotype matrices
  • bcftools: VCF/BCF indexing and format support

System & Performance

  • psutil: Cross-platform system and process monitoring

Quality

  • pytest: Comprehensive testing framework with 29+ test cases

Project Structure

MDSearch/
├── src/                              # Core application code
│   ├── core/                         # Core algorithms and data structures
│   │   ├── vcf_parser.py             # VCF file parsing and validation
│   │   ├── distance_calculator.py    # Optimized Hamming distance computation
│   │   ├── genotype_utils.py         # Genotype conversion and utility functions
│   │   └── snp_selector.py           # SNP selection and optimization algorithms
│   ├── io/                           # Input/output operations
│   │   ├── vcf_writer.py             # VCF file output with format support
│   │   ├── summary_writer.py         # Statistical summary generation
│   │   ├── run_info_writer.py        # Execution information and metadata
│   │   ├── snp_profile_writer.py     # Human-readable SNP profiles
│   │   └── structure_info_writer.py  # Output directory documentation
│   └── utils/                        # Utility functions
│       ├── memory_monitor.py         # Memory usage monitoring and warnings
│       ├── logging_setup.py          # Structured logging configuration
│       ├── indexing.py               # VCF indexing utilities
│       └── validation.py             # Input validation functions
├── tests/                            # Comprehensive test suite
├── environment.yml                   # Conda environment specification
└── mdsearch.py                       # Main command-line interface

Key directories:

  • src/core/: Contains the core algorithms for SNP selection, distance calculation, and genotype processing
  • src/io/: Handles all output generation including VCF files, summaries, and human-readable profiles
  • tests/: Comprehensive test suite covering genotype handling, distance calculations, and CLI validation

Installation & Usage

Quick Start

git clone https://github.com/alermol/MDSearch.git
cd MDSearch
conda env create -f environment.yml
conda activate mdsearch
python mdsearch.py input.vcf results -md 2 -ns 3

Basic Command

python mdsearch.py input.vcf output_folder -pl 2 -md 2 -ns 3

This will:

  • Assume that input VCF file input.vcf is diploid
  • Ensure that minimum Hamming distance of 2 between samples in all final discriminatory VCF sets
  • Generate 3 distinct discriminatory SNP sets
  • Create comprehensive output in the output_folder

Command Line Interface

Basic Syntax

python mdsearch.py IVCF OUTPUT_FOLDER [options]

Positional Arguments

  • IVCF: Input VCF file (bi-allelic, with SNP IDs)
  • OUTPUT_FOLDER: Path to output folder (will be created if absent)

Selection Parameters

  • -pl, --ploidy PLOIDY (default: 2) — VCF ploidy
  • -md, --min-distance MIN_DIST (default: 1) — Minimal Hamming distance between samples
  • -ns, --num-sets N_SETS (default: 0) — Number of distinct SNP sets in output (0 or 'unlimited' for all possible sets)
  • -we, --weight-entropy WEIGHT_ENTROPY (default: 0.5) — Weight for entropy component in SNP scoring (0.0 to 1.0)
  • -wm, --weight-maf WEIGHT_MAF (default: 0.5) — Weight for MAF component in SNP scoring (0.0 to 1.0)

Output Options

  • -ch, --convert-het — Convert heterozygous calls into NA

Input/Output Formats

  • -I, --input-format {auto,v,z,u,b} (default: auto) — Input format (VCF/VCF.gz/BCF uncompressed/BCF compressed)
  • -O, --output-format {v,z,u,b} (default: v) — Output format (VCF/VCF.gz/BCF uncompressed/BCF compressed)

Logging Options

  • -q, --quiet — Suppress progress output
  • -L, --log-level {DEBUG,INFO,WARNING,ERROR} — Logging level
  • -F, --log-format {text,json} (default: text) — Logging format

Information

  • -V, --version — Show program version, commit hash, and Python version and exit
  • -h, --help — Show help message and exit

Examples

Basic Usage

python mdsearch.py data.vcf results -pl 2 -md 2 -ch -ns 3

Custom SNP Scoring Weights

python mdsearch.py data.vcf results -pl 2 -md 2 -we 0.8 -wm 0.2

Unlimited Mode (Find All Possible Disjoint SNP Sets)

python mdsearch.py data.vcf results -pl 2 -md 2 -ch -ns 0
# or
python mdsearch.py data.vcf results -pl 2 -md 2 -ch -ns unlimited

Advanced Usage with Logging

python mdsearch.py data.vcf results -md 2 -ns 3 -L INFO

JSON Logging for Automated Pipelines

python mdsearch.py data.vcf results -F json -q

BCF Format Processing

python mdsearch.py input.bcf results -I b -O b

Output Structure

The tool creates a well-organized output directory with comprehensive analysis results:

results/
├── mdss/                           # SNP set VCF files
│   ├── minimal_set_1.vcf           # First minimal discriminative set
│   ├── minimal_set_2.vcf           # Alternative minimal set
│   └── minimal_set_3.vcf           # Third alternative set
├── snp_profiles/                   # Human-readable SNP profiles
│   ├── set_1_profile.txt           # Profile for set 1
│   ├── set_2_profile.txt           # Profile for set 2
│   └── set_3_profile.txt           # Profile for set 3
├── summary.tsv                     # Comprehensive statistics for each SNP set
├── run_info.txt                    # Detailed execution information
├── output_structure.txt            # Output organization guide
├── best_set.vcf                    # Optimal set with highest entropy
└── best_set_profile.txt            # Profile for the optimal SNP set

Detailed File Descriptions

summary.tsv - Statistical Summary

A tab-separated values file containing comprehensive statistics for each generated SNP set:

Column Description
set_index 1-based index of the SNP set (1, 2, 3, ...)
output_vcf Path to the corresponding VCF file (e.g., "minimal_set_1.vcf")
num_snps Number of SNPs in the set
min_distance Achieved minimum Hamming distance between all sample pairs
shannon_entropy Shannon entropy score measuring chromosome distribution balance

The Shannon entropy indicates how well SNPs are distributed across chromosomes - higher values indicate more balanced distribution, which is often desirable for genetic studies.

snp_profiles/ - Human-Readable Profiles

Each profile file (set_N_profile.txt) provides a compact, human-readable representation of SNP sets:

Header Section:

  • SNP set identification and chromosome distribution summary
  • Format: #SNP Set N Profile and #Chromosome Distribution: Chr1(2); Chr2(3); Chr3(1)

Sample Profiles: Each sample is represented as a line showing all SNP genotypes in the format:

Sample_name: Chr1:Position1(Genotype1) Chr2:Position2(Genotype2) ...

Genotype Format:

  • Diploid: AA, AT, TT (homozygous reference, heterozygous, homozygous alternate)
  • Triploid: AAA, AAT, TTT (no separators between alleles)
  • Missing: -- (for diploid) or --- (for triploid)
  • Polyploid: Concatenated alleles without separators

Example Profile Content:

#SNP Set 1 Profile
#Chromosome Distribution: Chr1(2); Chr2(1); Chr3(2)
--------------------------------------------------
Sample_001: Chr1:12345(AA) Chr1:67890(TT) Chr2:11111(AT) Chr3:22222(AA) Chr3:33333(TT)
Sample_002: Chr1:12345(AT) Chr1:67890(AT) Chr2:11111(AA) Chr3:22222(AT) Chr3:33333(AT)

best_set.vcf and best_set_profile.txt

The optimal SNP set automatically selected based on highest Shannon entropy:

  • best_set.vcf: Standard VCF format of the optimal SNP set
  • best_set_profile.txt: Human-readable profile of the optimal set (same format as other profiles)

run_info.txt - Execution Metadata

Comprehensive execution information including:

  • Program version and build details
  • Input file specifications and parameters
  • Output statistics and timing information
  • System memory usage and thresholds
  • Complete configuration used for the run

output_structure.txt - Directory Guide

Detailed explanation of all output files and their purposes, serving as a reference for understanding the results.

Advanced Features

Custom SNP Scoring

python mdsearch.py input.vcf results -we 0.8 -wm 0.2

Control the balance between entropy (-we) and MAF (-wm) components in SNP selection.

Unlimited Mode

python mdsearch.py input.vcf results -ns unlimited

Find all possible disjoint discriminatory SNP sets.

Structured Logging

python mdsearch.py input.vcf results -F json -q

Generate JSON-formatted logs for automated pipeline integration.

Testing

Run the comprehensive test suite:

pytest -v

The test suite covers genotype handling, distance calculations, CLI validation, and VCF format compliance.

License

This project is open-source and available under the MIT License.


Note: MDSearch today represents a major advancement over the version described in the original publication, featuring extensive improvements and new capabilities. If you use MDSearch in your work, please cite the original article.