Skip to content

Sequences exceeding mismatch threshold not discarded as expected in Cutadapt 4.9 #855

@camilababo

Description

@camilababo

Hi,

I'm encountering an issue when using Cutadapt to filter sequences based on mismatches for a specific primer pair. The tool is not discarding sequences that have more than 6 mismatches in total, which is unexpected given the error threshold set.

Primer Information

Primer Name: mICOIintF_jgHCO2198
Forward Primer: GGWACWGGWTGAACWGTWTAYCCYCC
Reverse Primer: TAIACYTCIGGRTGICCRAARAAYCA
Reverse Complement (used in command): TGRTTYTTYGGICAYCCIGARGTITA

Cutadapt command being used
cutadapt -g "GGWACWGGWTGAACWGTWTAYCCYCC;min_overlap=26...TGRTTYTTYGGICAYCCIGARGTITA;min_overlap=26" --output amplicon/bmi-database.fasta bmi_database_processed.fasta --no-indels -e 3 --revcomp --quiet --minimum-length 50 --maximum-length 548 --discard-untrimmed --action retain
The goal is to discard any sequences that have more than 3 mismatches on either the forward or reverse primer.

Unexpected Behaviour
When we test the same command with a file that only contains sequences with more than 6 combined mismatches, none are filtered out:
cutadapt -g "GGWACWGGWTGAACWGTWTAYCCYCC;min_overlap=26...TGRTTYTTYGGICAYCCIGARGTITA;min_overlap=26" --output amplicon/morethan6seqsonly.fasta sequences_with_more_than_six_combined_mismatches_only.fasta --no-indels -e 3 --revcomp --quiet --minimum-length 50 --maximum-length 548 --discard-untrimmed --action retain
These sequences should be discarded, but they are retained.

How we're calculating mismatches
We are computing mismatches using a custom function that takes IUPAC codes into account. It compares each base in the primer to the corresponding base in the template and counts a mismatch if there is no overlap in possible base representations.

For each sequence, we calculate:

full_fwd_mismatches = calculate_iupac_mismatches(primer_seq_fwd, pbs_fwd_seq)
full_rev_mismatches = calculate_iupac_mismatches(rev_comp_primer_seq_rev, pbs_rev_seq)
full_len_mismatch_sum = full_fwd_mismatches + full_rev_mismatches

All sequences in the test set have more than 6 mismatches combined, as confirmed by this function. These values are recorded in mismatches_per_seq_id.tsv (link below).

We expected sequences with more than 3 mismatches on either the forward or reverse primer should be discarded. This works correctly for 8 other primer pairs we are testing, but only this one shows the issue. We're using version 4.9 from cutadapt, installed using pip from PyPI inside a conda environment.

All relevant input/output files mentioned are available at https://drive.google.com/drive/folders/14o33K_VXZVbOSufDm_5-sqAuIBDjuwnp?usp=drive_link.


We'd appreciate any help in better understanding what might be happening with this specific case. In the meantime, we’ll be applying a custom mismatch filter function after Cutadapt to ensure the output meet our criteria.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions