Imperfect Repeat Support for BWT-based Tandem Repeat Finder

Overview

This implementation extends the BWT-based tandem repeat finder to handle imperfect repeats (repeats with SNPs/mismatches) while maintaining the original exact-match capability. The enhancement follows a three-layer approach:

Seeding: Find candidate repeats without missing those with mismatches
Extension: Score and extend tandem arrays when copies are imperfect
Reporting: Report variability metrics (consensus motif, mismatch rate) for downstream tools

Key Features

1. Similarity Model

Uses Hamming distance for SNP detection (no indels)
Per-copy mismatch tolerance: r ≤ max(1, ⌈0.1 × motif_length⌉)
Adjustable mismatch rate (default: 10% per copy)

2. Seeding Strategy (Tier 1)

Seed-and-extend approach: Uses exact FM-index hits as seeds
Extends left/right from each seed allowing mismatches per copy
Filters low-complexity sequences using Shannon entropy (default: ≥1.0 bits)
Requires minimum copies (default: 3) and minimum array length (default: 12bp)

3. Consensus Motif Building

Uses majority vote per position across all copies
Incrementally updates consensus during extension
Tracks mismatch statistics per copy

4. Strand-Agnostic Canonicalization

Computes reverse complement for each motif
Selects lexicographically smallest rotation across both strands
Reports strand information (+/-)

5. Quality Filtering

Entropy filter: Skips homopolymers and low-complexity sequences
Copy count filter: Requires ≥3 copies by default
Array length filter: Minimum 12bp for Tier 1, 20bp for Tier 2
Maximality check: Ensures repeats cannot be extended further

New Output Fields

BED Format

chrom  start  end  consensus_motif  copies  tier  mismatch_rate  strand

VCF Format (INFO fields)

MOTIF: Original seed motif
CONS_MOTIF: Consensus motif from all copies
COPIES: Number of tandem copies
TIER: Detection tier (1=short, 2=medium/long, 3=very long)
CONF: Confidence score (0.0-1.0)
MM_RATE: Overall mismatch rate across all copies
MAX_MM_PER_COPY: Maximum mismatches in any single copy
N_COPIES_EVAL: Number of copies used for consensus
STRAND: Strand of canonical motif (+/-)

TRF-Compatible Formats

TRF Table Format (--format trf_table): 12 columns

Indices, Period, CopyNumber, ConsensusSize, PercentMatches, PercentIndels
Score, A, C, G, T, Entropy

TRF DAT Format (--format trf_dat): 14+ fields

Start, End, Period, CopyNumber, ConsensusSize, PercentMatches, PercentIndels
Score, A%, C%, G%, T%, Entropy, ConsensusPattern, Sequence

See TRF_COMPATIBILITY.md for details.

STRfinder-Compatible Format

STRfinder CSV Format (--format strfinder): 11 columns (tab-delimited)

STR_marker: Marker name or auto-generated ID (STR_chr_pos)
STR_position: Genomic position (chr:start-end)
STR_motif: Motif structure ([MOTIF]n)
STR_genotype_structure: Genotype with motif ([MOTIF]copies)
STR_genotype: Repeat numbers (haploid)
STR_core_seq: Core STR sequence
Allele_coverage: Confidence percentage
Alleles_ratio: - (haploid only)
Reads_Distribution: Copy number distribution (copies:count)
STR_depth: Total depth (equals copies for haploid)
Full_seq: 5' flanking + CORE + 3' flanking

See STRFINDER_FORMAT.md for details and performance optimizations.

Command-Line Options

Basic Usage

# Find imperfect repeats (default mode)
python bwt.py reference.fa -o repeats.bed

# Output in VCF format with detailed INFO fields
python bwt.py reference.fa -o repeats.vcf --format vcf

# Exact matches only (original behavior)
python bwt.py reference.fa --no-mismatches

Advanced Options

# Tier 1 only with custom motif length
python bwt.py reference.fa --tier1 --max-motif-len 10

# Tier 2 with custom period range
python bwt.py reference.fa --tier2 --min-period 20 --max-period 500

# Adjust quality filters
python bwt.py reference.fa --min-copies 5 --min-entropy 1.5

# Both tiers with custom settings
python bwt.py reference.fa --tier1 --tier2 \
    --max-motif-len 8 \
    --min-period 15 \
    --max-period 200 \
    --min-copies 4

All Options

Option	Default	Description
`--no-mismatches`	False	Disable mismatch tolerance (exact matches only)
`--max-motif-len`	6	Maximum motif length for Tier 1
`--min-period`	10	Minimum period for Tier 2
`--max-period`	1000	Maximum period for Tier 2
`--min-copies`	3	Minimum number of copies required
`--min-entropy`	1.0	Minimum Shannon entropy (bits per base)
`--sa-sample`	32	Suffix array sampling rate
`--progress`	False	Show progress information

Implementation Details

Tier 1: Short Tandem Repeats (1-10bp)

Algorithm:

Enumerate all canonical primitive motifs of length k
Filter by Shannon entropy (≥1.0 bits)
Use FM-index to find exact occurrences (seeds)
For each seed position:
- Extend left/right using Hamming distance
- Build consensus motif via majority vote
- Stop when distance > threshold
Check maximality and apply filters

Key Functions:

_find_tandems_with_mismatches(): Seed-and-extend strategy
_extend_tandem_array(): Bidirectional extension with consensus update
_is_maximal_repeat_approx(): Maximality check with mismatch tolerance

Tier 2: Medium/Long Repeats (10-1000bp)

Algorithm:

Scan genome with candidate periods p ∈ [min_period, max_period]
For each position, extract motif of length p
Filter by entropy and character validity (no '$' or 'N')
Extend with mismatch tolerance:
- Compare copies to evolving consensus
- Accept if Hamming distance ≤ threshold
Normalize to primitive period
Build final consensus and compute statistics

Key Functions:

_find_repeats_simple(): Main scanning loop with mismatch support
_extend_with_mismatches(): Bidirectional extension for longer periods
_get_max_mismatches_per_copy(): Dynamic threshold based on motif length

Consensus Building Algorithm

def build_consensus(copies, motif_len):
    consensus = [None] * motif_len
    for pos in range(motif_len):
        bases = [copy[pos] for copy in copies]
        # Majority vote
        consensus[pos] = most_common(bases)
    return consensus

Mismatch Rate Calculation:

mm_rate = total_mismatches / (n_copies × motif_len)

Performance Considerations

Memory

Sampling rates reduce memory: SA sampled every 32 positions, Occ every 128
Per-chromosome processing with explicit memory cleanup
Consensus building uses O(motif_len × n_copies) space

Speed

FM-index provides O(|pattern|) backward search
Hamming distance computed via NumPy vectorization
Seen-regions tracking prevents redundant processing
Early termination when entropy/copy filters fail

False Positives

Mitigated by:

Minimum entropy threshold (1.0 bits) → filters homopolymers
Minimum copy count (3) → reduces noise
Minimum array length (12-20bp) → avoids spurious short runs
Maximality check → prevents suboptimal calls

Example Output

BED Format

# chrom  start  end    consensus_motif  copies  tier  mismatch_rate  strand
chr1    1000   1036   ATCG             9.0     1     0.028          +
chr1    5230   5310   ATGATCATC        8.9     2     0.042          -

VCF Format

##fileformat=VCFv4.2
##INFO=<ID=MOTIF,Number=1,Type=String,Description="Original seed motif">
##INFO=<ID=CONS_MOTIF,Number=1,Type=String,Description="Consensus motif from all copies">
...
#CHROM  POS   ID   REF  ALT   QUAL  FILTER  INFO
chr1    1001  TR0  .    <TR>  .     PASS    MOTIF=ATCG;CONS_MOTIF=ATCG;COPIES=9.0;TIER=1;CONF=0.97;MM_RATE=0.028;MAX_MM_PER_COPY=2;N_COPIES_EVAL=9;STRAND=+

Testing

Small Test Case

# Create test FASTA
echo ">test" > test.fa
echo "ATCGATCGATCGATCGATCG" >> test.fa  # Perfect repeat
echo "ATCGATCGATGGGTCGATCG" >> test.fa  # Imperfect repeat (2 mismatches in one copy)

# Run finder
python bwt.py test.fa -o test.bed --tier1 --max-motif-len 4

Expected: Finds ATCG repeat with ~5 copies and mismatch_rate ≈ 0.05

Design Decisions

Why Hamming Distance (not edit distance)?

Simpler position arithmetic (no indels to track)
Faster computation (vectorized NumPy comparison)
SNPs are the dominant variant type in STRs for many applications

Why Majority Vote (not alignment)?

Linear time: O(motif_len × n_copies)
Robust to outliers (1-2 divergent copies)
No need for expensive MSA algorithms

Why Entropy Filter?

Prevents false positives from homopolymers (AAAA...)
Threshold of 1.0 bits allows dinucleotide repeats (AT: entropy ≈1.0)
Configurable via --min-entropy

Why Require ≥3 Copies?

2-copy arrays are often spurious or borderline
3+ copies provide statistical confidence
Reduces output size by ~30-50%

Limitations

No Indels: Current implementation uses Hamming distance only
- Extension to edit distance is possible but complicates position tracking
Fixed Mismatch Rate: 10% per copy is hardcoded
- Could be made configurable with --max-mismatch-rate option
Seed Dependency: Requires at least one exact seed occurrence
- Very divergent repeats (>20% mismatches) may be missed
- Bitap/Shift-Or k-mismatch search could improve sensitivity
Consensus Ambiguity: Uses simple majority vote
- Ties resolved arbitrarily (first in NumPy unique order)
- Could emit IUPAC ambiguity codes for low-confidence positions

Future Enhancements

Near-Term

Add --max-mismatch-rate CLI option for per-copy tolerance
Emit IUPAC consensus motifs for ambiguous positions
Implement Bitap fuzzy matching for Tier 1 (alternative to seed-and-extend)
Support for gzipped FASTA inputs

Long-Term

Edit distance support (Levenshtein) for indel tolerance
Multi-seed strategy with overlapping k-mers
Phasing support for diploid genomes
Integration with variant calling pipelines (BCFtools, GATK)

References

FM-index: Ferragina & Manzini (2000)
Kasai LCP algorithm: Kasai et al. (2001)
Tandem Repeat Finder: Benson (1999)
Shannon entropy for sequence complexity: Wootton & Federhen (1996)

Citation

If you use this tool in your research, please cite:

BWT-based Tandem Repeat Finder with Imperfect Repeat Support
[Your citation information here]

License

[Your license information here]

FilesExpand file tree

IMPERFECT_REPEATS_README.md

Latest commit

History

IMPERFECT_REPEATS_README.md

File metadata and controls

Imperfect Repeat Support for BWT-based Tandem Repeat Finder

Overview

Key Features

1. Similarity Model

2. Seeding Strategy (Tier 1)

3. Consensus Motif Building

4. Strand-Agnostic Canonicalization

5. Quality Filtering

New Output Fields

BED Format

VCF Format (INFO fields)

TRF-Compatible Formats

STRfinder-Compatible Format

Command-Line Options

Basic Usage

Advanced Options

All Options

Implementation Details

Tier 1: Short Tandem Repeats (1-10bp)

Tier 2: Medium/Long Repeats (10-1000bp)

Consensus Building Algorithm

Performance Considerations

Memory

Speed

False Positives

Example Output

BED Format

VCF Format

Testing

Small Test Case

Design Decisions

Why Hamming Distance (not edit distance)?

Why Majority Vote (not alignment)?

Why Entropy Filter?

Why Require ≥3 Copies?

Limitations

Future Enhancements

Near-Term

Long-Term

References

Citation

License