Skip to content

Virfinder AWK Filtering Maybe not Correct #27

@benyoung93

Description

@benyoung93

Hi there :)

First of all, a wonderful collection of tools and a pipeline !! I have had trouble with some if the installation but I am just chopping and cutting the pipeline to fit my needs.

A quick query for the Virfinder step. So I know you want to do q-value < 0.01 and length > 1000bp. I have run the virfinder rscript file and got the output folder. I just want to double check that the filtering awk script is correct.

head P10_virfinder.tsv

"name"	"length"	"score"	"pvalue"	"qvalue"
"1"	"k141_270856 flag=1 multi=2.0000 len=320"	320	0.138513157806992	0.588997020854022	0.56810885275558
"2"	"k141_33860 flag=1 multi=2.0000 len=361"	361	0.248927602838187	0.42085402184707	0.525288351341074
"3"	"k141_237000 flag=1 multi=6.0000 len=478"	478	0.392398453460999	0.271737835153923	0.470800520085438
"4"	"k141_93112 flag=1 multi=5.0000 len=419"	419	0.316555183433318	0.343932472691162	0.499862326324076
"5"	"k141_177752 flag=1 multi=2.0000 len=398"	398	0.461420194784674	0.219920556107249	0.450421369675625
"6"	"k141_321640 flag=1 multi=4.0000 len=383"	383	0.456895550455814	0.222641509433962	0.450905940743768
"7"	"k141_211608 flag=1 multi=2.0000 len=477"	477	0.152712635250371	0.562542204568024	0.561554065996342
"8"	"k141_16930 flag=1 multi=2.0000 len=426"	426	0.165548747858572	0.539602780536246	0.555654697140864
"9"	"k141_8465 flag=1 multi=3.0000 len=431"	431	0.855015816724245	0.0269116186693148	0.268291393791584

The subsequent awk script is then this obviously (ignore all mu '"'"'" this is for a loop script that generates jobs for all of my samples).

cat '"$PALPAL"'_virfinder.tsv | awk -F'\t' '"'"'{ if ($4 <= 0.01) print }'"'"' | awk -F'_' '"'"'{ if ($4 >= 1000) print }'"'"' | cut -f2 | sed "s/\"//g" > '"$PALPAL"'_virfinder_filtered_data.txt

So this awk script is currently firstly using awk on the fourth \t column which is p-value ($4), should this not be $5 as the <0.01 filtering is for q-value ?

Secondly, second awk is extracting the fourth instance after underscore, looking at the output this doesn't seem to be correct. If you are wanting length >1000 would it not be a better idea to do awk -F'\t' '"'"'{ if ($2 >= 1000) print }'"'"'

Looking at the virfinder github it seems the original results format would work with your awk script, but I think a new update may of switched this ?

I used the mamba install in the mudoger install scripts but edited a wee bit (mamba create -n virfinder_env -c bioconda r-virfinder) so I believe both my local and the mudogoer install versions would be the same.

Happy to provide more info if needed, and I apologise in advance if I am barking up the wrong tree, but I was doing some QC and testing and noticed that the awk was not approproate for the output file generated by virfinder :).

Ben

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions