Skip to content

fastq/bam header parsing when unexpected format #15

@lnblum

Description

@lnblum

I was running some external data through with fastqs and the function below resulted in the barcodes being assigned
1172 length=111 from fastq records that look like so :
@SRR20318439.1 A00536:248:HFHTKDSX3:1:1101:2736:1000 length=111
NCACNAATTNAAACCATTACAACNATNAACACTNTATNATAAATANNCCNANNTNCCANANTATAAAAAACNCATTANTACANTCATTTAAATTATATTAAATTTATACCT
+
#FFF#FFFF#FFFFF:FFFFFFF#FF#FFFFFF#FFF#FFFFFFF##FF#F##F#FFF#F#FFFFFFFFFF#FFFFF#FFFF#FFFFFFFFFFFFFFFFFFFFFFFFFFFF

The external data has no barcode in the header, so the function doesn't produce a logical outcome. Because that resulted in a random string with a space in it this caused downstream processes to fail as it added a space in the commands containing the "barcode" variable. Perhaps some checks could be added for when the fastq/bam headers are not formatted as expected for cases like external data or users. Because I could not find any barcodes in metadata I couldn't add them to the original data before running.

barcodes_from_fastq () {

       set +o pipefail

       zcat -f $1 \
       | head -n10000 \
       | awk '{
           if (NR%4==1) {
               split($0, parts, ":"); 
               arr[ parts[ length(parts) ] ]++
           }} END { for (i in arr) {print arr[i]"\\t"i} }' \
       | sort -k1nr | head -n1 | cut -f2 
       # | tr -c "[ACGTN]" "\\t"

       set -o pipefail

   }  

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions