This implementation extends the BWT-based tandem repeat finder to handle imperfect repeats (repeats with SNPs/mismatches) while maintaining the original exact-match capability. The enhancement follows a three-layer approach:
- Seeding: Find candidate repeats without missing those with mismatches
- Extension: Score and extend tandem arrays when copies are imperfect
- Reporting: Report variability metrics (consensus motif, mismatch rate) for downstream tools
- Uses Hamming distance for SNP detection (no indels)
- Per-copy mismatch tolerance:
r ≤ max(1, ⌈0.1 × motif_length⌉) - Adjustable mismatch rate (default: 10% per copy)
- Seed-and-extend approach: Uses exact FM-index hits as seeds
- Extends left/right from each seed allowing mismatches per copy
- Filters low-complexity sequences using Shannon entropy (default: ≥1.0 bits)
- Requires minimum copies (default: 3) and minimum array length (default: 12bp)
- Uses majority vote per position across all copies
- Incrementally updates consensus during extension
- Tracks mismatch statistics per copy
- Computes reverse complement for each motif
- Selects lexicographically smallest rotation across both strands
- Reports strand information (+/-)
- Entropy filter: Skips homopolymers and low-complexity sequences
- Copy count filter: Requires ≥3 copies by default
- Array length filter: Minimum 12bp for Tier 1, 20bp for Tier 2
- Maximality check: Ensures repeats cannot be extended further
chrom start end consensus_motif copies tier mismatch_rate strand
MOTIF: Original seed motifCONS_MOTIF: Consensus motif from all copiesCOPIES: Number of tandem copiesTIER: Detection tier (1=short, 2=medium/long, 3=very long)CONF: Confidence score (0.0-1.0)MM_RATE: Overall mismatch rate across all copiesMAX_MM_PER_COPY: Maximum mismatches in any single copyN_COPIES_EVAL: Number of copies used for consensusSTRAND: Strand of canonical motif (+/-)
TRF Table Format (--format trf_table): 12 columns
- Indices, Period, CopyNumber, ConsensusSize, PercentMatches, PercentIndels
- Score, A, C, G, T, Entropy
TRF DAT Format (--format trf_dat): 14+ fields
- Start, End, Period, CopyNumber, ConsensusSize, PercentMatches, PercentIndels
- Score, A%, C%, G%, T%, Entropy, ConsensusPattern, Sequence
See TRF_COMPATIBILITY.md for details.
STRfinder CSV Format (--format strfinder): 11 columns (tab-delimited)
- STR_marker: Marker name or auto-generated ID (
STR_chr_pos) - STR_position: Genomic position (
chr:start-end) - STR_motif: Motif structure (
[MOTIF]n) - STR_genotype_structure: Genotype with motif (
[MOTIF]copies) - STR_genotype: Repeat numbers (haploid)
- STR_core_seq: Core STR sequence
- Allele_coverage: Confidence percentage
- Alleles_ratio:
-(haploid only) - Reads_Distribution: Copy number distribution (
copies:count) - STR_depth: Total depth (equals copies for haploid)
- Full_seq: 5' flanking + CORE + 3' flanking
See STRFINDER_FORMAT.md for details and performance optimizations.
# Find imperfect repeats (default mode)
python bwt.py reference.fa -o repeats.bed
# Output in VCF format with detailed INFO fields
python bwt.py reference.fa -o repeats.vcf --format vcf
# Exact matches only (original behavior)
python bwt.py reference.fa --no-mismatches# Tier 1 only with custom motif length
python bwt.py reference.fa --tier1 --max-motif-len 10
# Tier 2 with custom period range
python bwt.py reference.fa --tier2 --min-period 20 --max-period 500
# Adjust quality filters
python bwt.py reference.fa --min-copies 5 --min-entropy 1.5
# Both tiers with custom settings
python bwt.py reference.fa --tier1 --tier2 \
--max-motif-len 8 \
--min-period 15 \
--max-period 200 \
--min-copies 4| Option | Default | Description |
|---|---|---|
--no-mismatches |
False | Disable mismatch tolerance (exact matches only) |
--max-motif-len |
6 | Maximum motif length for Tier 1 |
--min-period |
10 | Minimum period for Tier 2 |
--max-period |
1000 | Maximum period for Tier 2 |
--min-copies |
3 | Minimum number of copies required |
--min-entropy |
1.0 | Minimum Shannon entropy (bits per base) |
--sa-sample |
32 | Suffix array sampling rate |
--progress |
False | Show progress information |
Algorithm:
- Enumerate all canonical primitive motifs of length k
- Filter by Shannon entropy (≥1.0 bits)
- Use FM-index to find exact occurrences (seeds)
- For each seed position:
- Extend left/right using Hamming distance
- Build consensus motif via majority vote
- Stop when distance > threshold
- Check maximality and apply filters
Key Functions:
_find_tandems_with_mismatches(): Seed-and-extend strategy_extend_tandem_array(): Bidirectional extension with consensus update_is_maximal_repeat_approx(): Maximality check with mismatch tolerance
Algorithm:
- Scan genome with candidate periods p ∈ [min_period, max_period]
- For each position, extract motif of length p
- Filter by entropy and character validity (no '$' or 'N')
- Extend with mismatch tolerance:
- Compare copies to evolving consensus
- Accept if Hamming distance ≤ threshold
- Normalize to primitive period
- Build final consensus and compute statistics
Key Functions:
_find_repeats_simple(): Main scanning loop with mismatch support_extend_with_mismatches(): Bidirectional extension for longer periods_get_max_mismatches_per_copy(): Dynamic threshold based on motif length
def build_consensus(copies, motif_len):
consensus = [None] * motif_len
for pos in range(motif_len):
bases = [copy[pos] for copy in copies]
# Majority vote
consensus[pos] = most_common(bases)
return consensusMismatch Rate Calculation:
mm_rate = total_mismatches / (n_copies × motif_len)- Sampling rates reduce memory: SA sampled every 32 positions, Occ every 128
- Per-chromosome processing with explicit memory cleanup
- Consensus building uses O(motif_len × n_copies) space
- FM-index provides O(|pattern|) backward search
- Hamming distance computed via NumPy vectorization
- Seen-regions tracking prevents redundant processing
- Early termination when entropy/copy filters fail
Mitigated by:
- Minimum entropy threshold (1.0 bits) → filters homopolymers
- Minimum copy count (3) → reduces noise
- Minimum array length (12-20bp) → avoids spurious short runs
- Maximality check → prevents suboptimal calls
# chrom start end consensus_motif copies tier mismatch_rate strand
chr1 1000 1036 ATCG 9.0 1 0.028 +
chr1 5230 5310 ATGATCATC 8.9 2 0.042 -
##fileformat=VCFv4.2
##INFO=<ID=MOTIF,Number=1,Type=String,Description="Original seed motif">
##INFO=<ID=CONS_MOTIF,Number=1,Type=String,Description="Consensus motif from all copies">
...
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 1001 TR0 . <TR> . PASS MOTIF=ATCG;CONS_MOTIF=ATCG;COPIES=9.0;TIER=1;CONF=0.97;MM_RATE=0.028;MAX_MM_PER_COPY=2;N_COPIES_EVAL=9;STRAND=+
# Create test FASTA
echo ">test" > test.fa
echo "ATCGATCGATCGATCGATCG" >> test.fa # Perfect repeat
echo "ATCGATCGATGGGTCGATCG" >> test.fa # Imperfect repeat (2 mismatches in one copy)
# Run finder
python bwt.py test.fa -o test.bed --tier1 --max-motif-len 4Expected: Finds ATCG repeat with ~5 copies and mismatch_rate ≈ 0.05
- Simpler position arithmetic (no indels to track)
- Faster computation (vectorized NumPy comparison)
- SNPs are the dominant variant type in STRs for many applications
- Linear time: O(motif_len × n_copies)
- Robust to outliers (1-2 divergent copies)
- No need for expensive MSA algorithms
- Prevents false positives from homopolymers (AAAA...)
- Threshold of 1.0 bits allows dinucleotide repeats (AT: entropy ≈1.0)
- Configurable via
--min-entropy
- 2-copy arrays are often spurious or borderline
- 3+ copies provide statistical confidence
- Reduces output size by ~30-50%
-
No Indels: Current implementation uses Hamming distance only
- Extension to edit distance is possible but complicates position tracking
-
Fixed Mismatch Rate: 10% per copy is hardcoded
- Could be made configurable with
--max-mismatch-rateoption
- Could be made configurable with
-
Seed Dependency: Requires at least one exact seed occurrence
- Very divergent repeats (>20% mismatches) may be missed
- Bitap/Shift-Or k-mismatch search could improve sensitivity
-
Consensus Ambiguity: Uses simple majority vote
- Ties resolved arbitrarily (first in NumPy unique order)
- Could emit IUPAC ambiguity codes for low-confidence positions
- Add
--max-mismatch-rateCLI option for per-copy tolerance - Emit IUPAC consensus motifs for ambiguous positions
- Implement Bitap fuzzy matching for Tier 1 (alternative to seed-and-extend)
- Support for gzipped FASTA inputs
- Edit distance support (Levenshtein) for indel tolerance
- Multi-seed strategy with overlapping k-mers
- Phasing support for diploid genomes
- Integration with variant calling pipelines (BCFtools, GATK)
- FM-index: Ferragina & Manzini (2000)
- Kasai LCP algorithm: Kasai et al. (2001)
- Tandem Repeat Finder: Benson (1999)
- Shannon entropy for sequence complexity: Wootton & Federhen (1996)
If you use this tool in your research, please cite:
BWT-based Tandem Repeat Finder with Imperfect Repeat Support
[Your citation information here]
[Your license information here]