-
Notifications
You must be signed in to change notification settings - Fork 138
Description
Hi,
I'm encountering an issue when using Cutadapt to filter sequences based on mismatches for a specific primer pair. The tool is not discarding sequences that have more than 6 mismatches in total, which is unexpected given the error threshold set.
Primer Information
Primer Name: mICOIintF_jgHCO2198
Forward Primer: GGWACWGGWTGAACWGTWTAYCCYCC
Reverse Primer: TAIACYTCIGGRTGICCRAARAAYCA
Reverse Complement (used in command): TGRTTYTTYGGICAYCCIGARGTITA
Cutadapt command being used
cutadapt -g "GGWACWGGWTGAACWGTWTAYCCYCC;min_overlap=26...TGRTTYTTYGGICAYCCIGARGTITA;min_overlap=26" --output amplicon/bmi-database.fasta bmi_database_processed.fasta --no-indels -e 3 --revcomp --quiet --minimum-length 50 --maximum-length 548 --discard-untrimmed --action retain
The goal is to discard any sequences that have more than 3 mismatches on either the forward or reverse primer.
Unexpected Behaviour
When we test the same command with a file that only contains sequences with more than 6 combined mismatches, none are filtered out:
cutadapt -g "GGWACWGGWTGAACWGTWTAYCCYCC;min_overlap=26...TGRTTYTTYGGICAYCCIGARGTITA;min_overlap=26" --output amplicon/morethan6seqsonly.fasta sequences_with_more_than_six_combined_mismatches_only.fasta --no-indels -e 3 --revcomp --quiet --minimum-length 50 --maximum-length 548 --discard-untrimmed --action retain
These sequences should be discarded, but they are retained.
How we're calculating mismatches
We are computing mismatches using a custom function that takes IUPAC codes into account. It compares each base in the primer to the corresponding base in the template and counts a mismatch if there is no overlap in possible base representations.
For each sequence, we calculate:
full_fwd_mismatches = calculate_iupac_mismatches(primer_seq_fwd, pbs_fwd_seq)
full_rev_mismatches = calculate_iupac_mismatches(rev_comp_primer_seq_rev, pbs_rev_seq)
full_len_mismatch_sum = full_fwd_mismatches + full_rev_mismatches
All sequences in the test set have more than 6 mismatches combined, as confirmed by this function. These values are recorded in mismatches_per_seq_id.tsv (link below).
We expected sequences with more than 3 mismatches on either the forward or reverse primer should be discarded. This works correctly for 8 other primer pairs we are testing, but only this one shows the issue. We're using version 4.9 from cutadapt, installed using pip from PyPI inside a conda environment.
All relevant input/output files mentioned are available at https://drive.google.com/drive/folders/14o33K_VXZVbOSufDm_5-sqAuIBDjuwnp?usp=drive_link.
We'd appreciate any help in better understanding what might be happening with this specific case. In the meantime, we’ll be applying a custom mismatch filter function after Cutadapt to ensure the output meet our criteria.
Thanks in advance!