-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Hello,
Thanks for this amazing tool. I am using fastv in perhaps an unusual way. I'm looking to detect the presence or absence of homologous gene clusters in metagenomic data. I started with ~17 million ORFs from our contigs, and clustered (mmseqs2, 95% identity) them into almost 4 million clusters. I want to detecte the clusters by detecting one or more unique kmers from their representative sequences.
I used unique_kmer (initially) to identify unique 24mers, but this took a long time, generated millions of files and unique 24-mers could not be found for a majority of sequences. I fell back on jellyfish, 32mers, and a convoluted pipeline of aligning the kmers against the cluster representatives and then filtering the sorted SAM file to include only non-overlapping kmers. This way the vast majority of cluster representatives had one or more unique kmers (mean of 3 and up to hundreds).
I applied fastv with minimal filtering and lowest thresholds (-A -G -Q -L -p 0.001 -d 0.001), but only ~100k of ~3.4 million cluster representatives are ever identified across all >200 samples.
I tested further by extracting only unique kmers from one sample and testing them against the sample reads: no hits!
Yet, when I search with seqkit, I find that the sequence file does indeed contain this kmer three times: "ATGAAATTCCATGGAATGGAATGGAATGGAAA"
e.g.
@K00150:405:HGVM5BBXY:6:2119:16741:38381/1
ATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGAATGGA*ATGAAATTCCATGGAATGGAATGGAATGGAAA*AGAATGGAATGGAATGAAATTCCATGGAATGGAATGGAATGGAATGGAATGAAATTCC
+
AAAFFJJJJJFAJJJJFJAFAAFJJFAFFJJAJJFF-FJAJ<JFJFJFFJFJJJFFAJJJFFJJJJJJJJFFJFFAAFJJFAFJFA-FJJFFJJJJFFFJFFJJJJFJFJJFJJJJFJJJF7FJFFFJJJF7FJJAAAFJF7AFJJFFAF
Can you advise why fastv seems not to be detecting it?
Thanks!