Skip to content

Strange results #15

@andrewjmc

Description

@andrewjmc

Hello,

Thanks for this amazing tool. I am using fastv in perhaps an unusual way. I'm looking to detect the presence or absence of homologous gene clusters in metagenomic data. I started with ~17 million ORFs from our contigs, and clustered (mmseqs2, 95% identity) them into almost 4 million clusters. I want to detecte the clusters by detecting one or more unique kmers from their representative sequences.

I used unique_kmer (initially) to identify unique 24mers, but this took a long time, generated millions of files and unique 24-mers could not be found for a majority of sequences. I fell back on jellyfish, 32mers, and a convoluted pipeline of aligning the kmers against the cluster representatives and then filtering the sorted SAM file to include only non-overlapping kmers. This way the vast majority of cluster representatives had one or more unique kmers (mean of 3 and up to hundreds).

I applied fastv with minimal filtering and lowest thresholds (-A -G -Q -L -p 0.001 -d 0.001), but only ~100k of ~3.4 million cluster representatives are ever identified across all >200 samples.

I tested further by extracting only unique kmers from one sample and testing them against the sample reads: no hits!

Yet, when I search with seqkit, I find that the sequence file does indeed contain this kmer three times: "ATGAAATTCCATGGAATGGAATGGAATGGAAA"

e.g.
@K00150:405:HGVM5BBXY:6:2119:16741:38381/1
ATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGA*ATGAAATTCCATGGAATGGAATGGAATGGAAA*AGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGAAATTCC
+
AAAFFJJJJJFAJJJJFJAFAAFJJFAFFJJAJJFF-FJAJ<JFJFJFFJFJJJFFAJJJFFJJJJJJJJFFJFFAAFJJFAFJFA-FJJFFJJJJFFFJFFJJJJFJFJJFJJJJFJJJF7FJFFFJJJF7FJJAAAFJF7AFJJFFAF

Can you advise why fastv seems not to be detecting it?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions