Skip to content

Bioinformatics_log_book

Rafael Laso Pérez edited this page Apr 8, 2020 · 8 revisions

Useful commands

Using the buffer memory from sed

We have the following file:

grep ">" -m5 PROKKA_04032020.faa
>IAGIMNNI_00001 hypothetical protein
>IAGIMNNI_00002 hypothetical protein
>IAGIMNNI_00003 hypothetical protein
>IAGIMNNI_00004 DNA double-strand break repair Rad50 ATPase
>IAGIMNNI_00005 (R)-citramalate synthase CimA

We want to obtain the following structure:

IAGIMNNI_00001,IAGIMNNI
IAGIMNNI_00002,IAGIMNNI
IAGIMNNI_00003,IAGIMNNI
IAGIMNNI_00004,IAGIMNNI
IAGIMNNI_00005,IAGIMNNI`

For that we do the following command

(prokka) [rafaellp@ocean Prokka]$ grep -m5 '>' ../Prokka/*.faa | sed 's/\(>.*\)\(_[0-9]*\)\( .*\)/\1\2,\1/' | sed 's/>//g'  
IAGIMNNI_00001,IAGIMNNI
IAGIMNNI_00002,IAGIMNNI
IAGIMNNI_00003,IAGIMNNI
IAGIMNNI_00004,IAGIMNNI
IAGIMNNI_00005,IAGIMNNI

From the second sed we have the following explanation

s/ # substitute \(>.*\) # indicates the fist buffer memory. You indicate with \(YOUR CONTENT). In our case is > and everything after that \(_[0-9]*\) # indicates the second buffer memory. sed can have until 9 \( .*\) # third buffer expression /\1\2,\1/' #how substitution works: First the first buffer memory \1, then the 2nd one\2, then a,and finally the first one again\1`

Logbook

03.04.2020

Annotation of Pescadero bins

pwd
/export/data1/projects/rlperez/Pescadero/Metagenomes/Bins_to_analyze

[rafaellp@ocean Genomes]$ ls *.fasta > List_genomes
[rafaellp@ocean Genomes]$ sed -i 's/\.fasta//' List_genomes 

[rafaellp@ocean Annotation]$ for i in `cat ../Genomes/List_genomes` ; do mkdir $i ; done

I need first to have some databases prepared like COGs. Download COGsoft from here and the COG database.

For the COG database some processing is required

(prokka) [rafaellp@ocean COG2014]$ pwd
/export/data1/projects/rlperez/Resources/database/COG2014

(prokka) [rafaellp@ocean Blast_database]$ makeblastdb -in ../prot2003-2014.fa -dbtype prot -out COG_blast_db -parse_seqids

Building a new DB, current time: 04/03/2020 09:12:15
New DB name:   /export/data1/projects/rlperez/Resources/database/COG2014/Blast_database/COG_blast_db
New DB title:  ../prot2003-2014.fa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1785722 sequences in 138.585 seconds.
(prokka) [rafaellp@ocean Blast_database]$ ls
COG_blast_db.phr  COG_blast_db.pin  COG_blast_db.pnd  COG_blast_db.pni  COG_blast_db.pog  COG_blast_db.psd  COG_blast_db.psi  COG_blast_db.psq

(prokka) [rafaellp@ocean COG2014]$ cut -d ',' -f-1,-2 cog2003-2014.csv > cog2003-2014_only_names.csv


After that I run the following script

The script worked well except the COG part. I discovered that with the prokka environment the blast version is 2.9.0+ and we need a lower one. Therefore I modify to get a lower version blast/2.2.29+. Script is now updated

06.04.2020

Annotation of Pescadero bins, archaea

Establish an arCOGs database

(prokka) [rafaellp@ocean arCOG]$ makeblastdb -in ar14.fa -dbtype prot -out BLAST_database/arCOGs_BLAST_database -parse_seqids


Building a new DB, current time: 04/06/2020 07:53:21
New DB name:   BLAST_database/arCOGs_BLAST_database
New DB title:  ar14.fa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B

You need a file arcCOG.p2o.csv that has: 'prot-ID,Organism-ID' from the arCOG database.

(prokka) [rafaellp@ocean arCOG]$ awk -F ',' '{ print $3,$2 }' ar14.arCOG.csv | sed 's/ /,/' > arcCOG.p2o.csv
(prokka) [rafaellp@ocean arCOG]$ head arcCOG.p2o.csv 
332798015,Acidianus_hospitalis_W1_uid66875
302348728,Acidilobus_saccharovorans_345_15_uid51395
302348796,Acidilobus_saccharovorans_345_15_uid51395
302348989,Acidilobus_saccharovorans_345_15_uid51395
432328170,Aciduliprofundum_MAR08_339_uid184407
432328361,Aciduliprofundum_MAR08_339_uid184407
289596272,Aciduliprofundum_boonei_T469_uid43333
289596494,Aciduliprofundum_boonei_T469_uid43333
14600381,Aeropyrum_pernix_K1_uid57757
11499137,Archaeoglobus_fulgidus_DSM_4304_uid57717
(prokka) [rafaellp@ocean arCOG]$ tail arcCOG.p2o.csv 
490715806,nanoarchaeote_Nst1
490715807,nanoarchaeote_Nst1
490715808,nanoarchaeote_Nst1
490715809,nanoarchaeote_Nst1
490715810,nanoarchaeote_Nst1
490715814,nanoarchaeote_Nst1
490715817,nanoarchaeote_Nst1
490715818,nanoarchaeote_Nst1
490715819,nanoarchaeote_Nst1
490715820,nanoarchaeote_Nst1

I see that the error is because the database file ar14.arCOG.csv is not 100% correct. At the tail of the file, there are some proteins with non assigned arCOGs #Example

rperez@aphros:/scratch2/rperez/databases/arCOGs$ tail ar14.arCOG.csv 490715806,nanoarchaeote_Nst1,490715806,537,1,537 490715807,nanoarchaeote_Nst1,490715807,402,1,402 490715808,nanoarchaeote_Nst1,490715808,196,1,196 490715809,nanoarchaeote_Nst1,490715809,43,1,43 490715810,nanoarchaeote_Nst1,490715810,97,1,97 490715814,nanoarchaeote_Nst1,490715814,128,1,128 490715817,nanoarchaeote_Nst1,490715817,345,1,345 490715818,nanoarchaeote_Nst1,490715818,215,1,215 490715819,nanoarchaeote_Nst1,490715819,412,1,412 490715820,nanoarchaeote_Nst1,490715820,257,1,257


There is an error in the file ar14.arCOG.csv since at the tail of the file, there are some proteins with non assigned arCOGs. In older versions of this database, when proteins had no COG assigned, they had additional ,,,, indicating that some columns were blanck. I formatted the database like this, where ! indicates #opposite of the string search

(prokka) [rafaellp@ocean arCOG]$ sed '/arCOG/! s/$/,,,,/' ar14.arCOG.csv > ar14.arCOG_modified.csv #BEFORE (prokka) [rafaellp@ocean arCOG]$ tail ar14.arCOG.csv 490715806,nanoarchaeote_Nst1,490715806,537,1,537 490715807,nanoarchaeote_Nst1,490715807,402,1,402 490715808,nanoarchaeote_Nst1,490715808,196,1,196 490715809,nanoarchaeote_Nst1,490715809,43,1,43 490715810,nanoarchaeote_Nst1,490715810,97,1,97 490715814,nanoarchaeote_Nst1,490715814,128,1,128 490715817,nanoarchaeote_Nst1,490715817,345,1,345 490715818,nanoarchaeote_Nst1,490715818,215,1,215 490715819,nanoarchaeote_Nst1,490715819,412,1,412 490715820,nanoarchaeote_Nst1,490715820,257,1,257

#AFTER (prokka) [rafaellp@ocean arCOG]$ tail ar14.arCOG_modified.csv 490715806,nanoarchaeote_Nst1,490715806,537,1,537,,,, 490715807,nanoarchaeote_Nst1,490715807,402,1,402,,,, 490715808,nanoarchaeote_Nst1,490715808,196,1,196,,,, 490715809,nanoarchaeote_Nst1,490715809,43,1,43,,,, 490715810,nanoarchaeote_Nst1,490715810,97,1,97,,,, 490715814,nanoarchaeote_Nst1,490715814,128,1,128,,,, 490715817,nanoarchaeote_Nst1,490715817,345,1,345,,,, 490715818,nanoarchaeote_Nst1,490715818,215,1,215,,,, 490715819,nanoarchaeote_Nst1,490715819,412,1,412,,,, 490715820,nanoarchaeote_Nst1,490715820,257,1,257,,,,


### 07.04.2020

Aphros: Gunter project

### 08.04.2020  

Check the ANME bins from the Costa Rica project. Do quick Prokka annotation and import to Geneious. Comparison of ANI with the HMMV reference genome.

Clone this wiki locally