-
Notifications
You must be signed in to change notification settings - Fork 0
Bioinformatics_log_book
We have the following file:
grep ">" -m5 PROKKA_04032020.faa
>IAGIMNNI_00001 hypothetical protein
>IAGIMNNI_00002 hypothetical protein
>IAGIMNNI_00003 hypothetical protein
>IAGIMNNI_00004 DNA double-strand break repair Rad50 ATPase
>IAGIMNNI_00005 (R)-citramalate synthase CimA
We want to obtain the following structure:
IAGIMNNI_00001,IAGIMNNI
IAGIMNNI_00002,IAGIMNNI
IAGIMNNI_00003,IAGIMNNI
IAGIMNNI_00004,IAGIMNNI
IAGIMNNI_00005,IAGIMNNI`
For that we do the following command
(prokka) [rafaellp@ocean Prokka]$ grep -m5 '>' ../Prokka/*.faa | sed 's/\(>.*\)\(_[0-9]*\)\( .*\)/\1\2,\1/' | sed 's/>//g'
IAGIMNNI_00001,IAGIMNNI
IAGIMNNI_00002,IAGIMNNI
IAGIMNNI_00003,IAGIMNNI
IAGIMNNI_00004,IAGIMNNI
IAGIMNNI_00005,IAGIMNNI
From the second sed we have the following explanation
s/ # substitute
\(>.*\) # indicates the fist buffer memory. You indicate with \(YOUR CONTENT). In our case is > and everything after that
\(_[0-9]*\) # indicates the second buffer memory. sed can have until 9
\( .*\) # third buffer expression
/\1\2,\1/' #how substitution works: First the first buffer memory \1, then the 2nd one\2, then a,and finally the first one again\1`
Annotation of Pescadero bins
pwd
/export/data1/projects/rlperez/Pescadero/Metagenomes/Bins_to_analyze
[rafaellp@ocean Genomes]$ ls *.fasta > List_genomes
[rafaellp@ocean Genomes]$ sed -i 's/\.fasta//' List_genomes
[rafaellp@ocean Annotation]$ for i in `cat ../Genomes/List_genomes` ; do mkdir $i ; done
I need first to have some databases prepared like COGs. Download COGsoft from here and the COG database.
For the COG database some processing is required
(prokka) [rafaellp@ocean COG2014]$ pwd
/export/data1/projects/rlperez/Resources/database/COG2014
(prokka) [rafaellp@ocean Blast_database]$ makeblastdb -in ../prot2003-2014.fa -dbtype prot -out COG_blast_db -parse_seqids
Building a new DB, current time: 04/03/2020 09:12:15
New DB name: /export/data1/projects/rlperez/Resources/database/COG2014/Blast_database/COG_blast_db
New DB title: ../prot2003-2014.fa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1785722 sequences in 138.585 seconds.
(prokka) [rafaellp@ocean Blast_database]$ ls
COG_blast_db.phr COG_blast_db.pin COG_blast_db.pnd COG_blast_db.pni COG_blast_db.pog COG_blast_db.psd COG_blast_db.psi COG_blast_db.psq
(prokka) [rafaellp@ocean COG2014]$ cut -d ',' -f-1,-2 cog2003-2014.csv > cog2003-2014_only_names.csv
After that I run the following script
The script worked well except the COG part. I discovered that with the prokka environment the blast version is 2.9.0+ and we need a lower one. Therefore I modify to get a lower version blast/2.2.29+. Script is now updated
Annotation of Pescadero bins, archaea
Establish an arCOGs database
(prokka) [rafaellp@ocean arCOG]$ makeblastdb -in ar14.fa -dbtype prot -out BLAST_database/arCOGs_BLAST_database -parse_seqids
Building a new DB, current time: 04/06/2020 07:53:21
New DB name: BLAST_database/arCOGs_BLAST_database
New DB title: ar14.fa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
You need a file arcCOG.p2o.csv that has: 'prot-ID,Organism-ID' from the arCOG database.
(prokka) [rafaellp@ocean arCOG]$ awk -F ',' '{ print $3,$2 }' ar14.arCOG.csv | sed 's/ /,/' > arcCOG.p2o.csv
(prokka) [rafaellp@ocean arCOG]$ head arcCOG.p2o.csv
332798015,Acidianus_hospitalis_W1_uid66875
302348728,Acidilobus_saccharovorans_345_15_uid51395
302348796,Acidilobus_saccharovorans_345_15_uid51395
302348989,Acidilobus_saccharovorans_345_15_uid51395
432328170,Aciduliprofundum_MAR08_339_uid184407
432328361,Aciduliprofundum_MAR08_339_uid184407
289596272,Aciduliprofundum_boonei_T469_uid43333
289596494,Aciduliprofundum_boonei_T469_uid43333
14600381,Aeropyrum_pernix_K1_uid57757
11499137,Archaeoglobus_fulgidus_DSM_4304_uid57717
(prokka) [rafaellp@ocean arCOG]$ tail arcCOG.p2o.csv
490715806,nanoarchaeote_Nst1
490715807,nanoarchaeote_Nst1
490715808,nanoarchaeote_Nst1
490715809,nanoarchaeote_Nst1
490715810,nanoarchaeote_Nst1
490715814,nanoarchaeote_Nst1
490715817,nanoarchaeote_Nst1
490715818,nanoarchaeote_Nst1
490715819,nanoarchaeote_Nst1
490715820,nanoarchaeote_Nst1
I see that the error is because the database file ar14.arCOG.csv is not 100% correct. At the tail of the file, there are some proteins with non assigned arCOGs #Example
rperez@aphros:/scratch2/rperez/databases/arCOGs$ tail ar14.arCOG.csv 490715806,nanoarchaeote_Nst1,490715806,537,1,537 490715807,nanoarchaeote_Nst1,490715807,402,1,402 490715808,nanoarchaeote_Nst1,490715808,196,1,196 490715809,nanoarchaeote_Nst1,490715809,43,1,43 490715810,nanoarchaeote_Nst1,490715810,97,1,97 490715814,nanoarchaeote_Nst1,490715814,128,1,128 490715817,nanoarchaeote_Nst1,490715817,345,1,345 490715818,nanoarchaeote_Nst1,490715818,215,1,215 490715819,nanoarchaeote_Nst1,490715819,412,1,412 490715820,nanoarchaeote_Nst1,490715820,257,1,257
There is an error in the file ar14.arCOG.csv since at the tail of the file, there are some proteins with non assigned arCOGs. In older versions of this database, when proteins had no COG assigned, they had additional ,,,, indicating that some columns were blanck. I formatted the database like this, where ! indicates #opposite of the string search
(prokka) [rafaellp@ocean arCOG]$ sed '/arCOG/! s/$/,,,,/' ar14.arCOG.csv > ar14.arCOG_modified.csv #BEFORE (prokka) [rafaellp@ocean arCOG]$ tail ar14.arCOG.csv 490715806,nanoarchaeote_Nst1,490715806,537,1,537 490715807,nanoarchaeote_Nst1,490715807,402,1,402 490715808,nanoarchaeote_Nst1,490715808,196,1,196 490715809,nanoarchaeote_Nst1,490715809,43,1,43 490715810,nanoarchaeote_Nst1,490715810,97,1,97 490715814,nanoarchaeote_Nst1,490715814,128,1,128 490715817,nanoarchaeote_Nst1,490715817,345,1,345 490715818,nanoarchaeote_Nst1,490715818,215,1,215 490715819,nanoarchaeote_Nst1,490715819,412,1,412 490715820,nanoarchaeote_Nst1,490715820,257,1,257
#AFTER (prokka) [rafaellp@ocean arCOG]$ tail ar14.arCOG_modified.csv 490715806,nanoarchaeote_Nst1,490715806,537,1,537,,,, 490715807,nanoarchaeote_Nst1,490715807,402,1,402,,,, 490715808,nanoarchaeote_Nst1,490715808,196,1,196,,,, 490715809,nanoarchaeote_Nst1,490715809,43,1,43,,,, 490715810,nanoarchaeote_Nst1,490715810,97,1,97,,,, 490715814,nanoarchaeote_Nst1,490715814,128,1,128,,,, 490715817,nanoarchaeote_Nst1,490715817,345,1,345,,,, 490715818,nanoarchaeote_Nst1,490715818,215,1,215,,,, 490715819,nanoarchaeote_Nst1,490715819,412,1,412,,,, 490715820,nanoarchaeote_Nst1,490715820,257,1,257,,,,
### 07.04.2020
Aphros: Gunter project
### 08.04.2020
Check the ANME bins from the Costa Rica project. Do quick Prokka annotation and import to Geneious. Comparison of ANI with the HMMV reference genome.