-
Notifications
You must be signed in to change notification settings - Fork 4
21. DESMAN
We apply the DESMAN pipeline to a set of Metagenome Assembled Genome (MAG) of interest. Code is adapted from the DESMAN tutorial.
Notice: If the AddLengths.pl script returns wrongly formatted file e.g., not something like contig_x 1 1293 then run dos2unix *
These are the modules/software you will need to load:
module load prodigal ncbi-blast python-anaconda2/201607 biopython/1.60 samtools/1.5 R gsl
The below code assumes that all your MAGs fasta files have the .fna extension and are located in the folder where you will run the below shell scripts. In principle you can run all scripts after each other, without having to change or manually adjust anything. Step 6, where DESMAN is run is the most time intensive and can take several days per MAG (~number of variant positions). To avoid this try to run in parallel many DESMAN runs and reduce the number of strains to test for (e.g. 5 instead of 12). Time for each replicate run can take up to 2 hours, or as little as 20 mins.
---- STEP 1 ---
#!/bin/bash
for bin in `ls *.fna`; do
MAG=${bin%.fna}
# Start by removing everything after first space in header of fasta file.
awk '{print $1}' ${MAG}.fna > ${MAG}-simplified.fna
# prodigal/2.6.3
prodigal -i ${MAG}-simplified.fna -a ./Gene_calls/${MAG}-simplified.faa -d ./Gene_calls/${MAG}-simplified.genes.fna -f gff -p single -o ./Gene_calls/${MAG}-simplified.gff > p.out
# Have ncbi-blast and parallel loaded
export COGSDB_DIR=/scratch/vdenef_fluxm/rprops/DESMAN/DESMAN_MAG8/rpsblast_cog_db
/home/rprops/CONCOCT/scripts/RPSBLAST.sh -f ./Gene_calls/${MAG}-simplified.faa -p -c 32 -r 1
# Now we assign COGs and genes to contigs using one of these scripts:
# Annotate genes
python /home/rprops/DESMAN/scripts/ExtractCogs.py -b ./Gene_calls/${MAG}-simplified.out --cdd_cog_file /home/rprops/CONCOCT/scgs/cdd_to_cog.tsv -g ./Gene_calls/${MAG}-simplified.gff > ./Annotate/${MAG}-simplified.cogs
python /home/rprops/DESMAN/scripts/ExtractGenes.py -g ./Gene_calls/${MAG}-simplified.gff > ./Annotate/${MAG}-simplified.genes
# Get contig lengths
python /home/rprops/DESMAN/scripts/Lengths.py -i ${MAG}-simplified.fna > ${MAG}-simplified.len
# Split bam files
grep ">" ${MAG}-simplified.fna | sed 's/>//g' > ${MAG}-simplified_contigs.txt
dos2unix ${MAG}-simplified_contigs.txt
dos2unix ${MAG}-simplified.len
/home/rprops/StrainMetaSim/scripts/AddLengths.pl ${MAG}-simplified.len < ${MAG}-simplified_contigs.txt > ${MAG}-simplified_contigs.tsv
done
mkdir SplitBam
mkdir ReadcountFilter
---- STEP 2 ---
#!/bin/bash
set -e
for contig_list in `ls *-simplified_contigs.tsv`
do
MAG=${contig_list%-simplified_contigs.tsv}
for sample_bam in `ls /scratch/vdenef_fluxm/rprops/DESMAN/metaG/vizbin_rebin_anvio_v230/map/*mapped.sorted.bam`
do
stub1=${sample_bam##*/}
stub=${stub1%.sorted.bam}
echo $stub
samtools view -bhL ${MAG}-simplified_contigs.tsv $sample_bam > ./SplitBam/${stub}.${MAG}.bam
done
done
---- STEP 3 ---
#!/bin/bash
set -e
for contig_list in `ls *-simplified_contigs.tsv`
do
MAG=${contig_list%-simplified_contigs.tsv}
if [ ! -d "./ReadcountFilter/${MAG}"]; then mkdir ./ReadcountFilter/${MAG}; fi
echo $MAG
for bamfile in `ls ./SplitBam/*.${MAG}.bam`
do
stub1=${bamfile##*/}
stub=${stub1%.bam}
echo $stub
(samtools index $bamfile; bam-readcount -w 1 -q 20 -l "./${MAG}-simplified_contigs.tsv" -f ./${MAG}-simplified.fna $bamfile 2> ./ReadcountFilter/${MAG}/${stub}.err > ./ReadcountFilter/${MAG}/${stub}.cnt)&
done
# Don't forget to zip .cnt files
gzip ./SplitBam/ReadcountFilter/${MAG}/*cnt
done
---- STEP 4 ---
# Get core COGs
#!/bin/bash
set -e
for contig_list in `ls *-simplified_contigs.tsv`
do
MAG=${contig_list%-simplified_contigs.tsv}
/home/rprops/DESMAN/scripts/SelectContigsPos.pl /home/rprops/StrainMetaSim/config/cogs.txt < ./Annotate/${MAG}-simplified.cogs > ./Annotate/${MAG}_core.cogs
done
---- STEP 5 ---
#!/bin/bash
set -e
if [ ! -d "./Variants" ];
then
mkdir ./Variants;
fi
for contig_list in `ls *-simplified_contigs.tsv`
do
MAG=${contig_list%-simplified_contigs.tsv}
if [ ! -d "./Variants/{MAG}" ];
then
mkdir ./Variants/${MAG};
fi
# Get frequency counts
python /home/rprops/DESMAN/scripts/ExtractCountFreqGenes.py ./Annotate/${MAG}-simplified_core.cogs ./ReadcountFilter/${MAG} --output_file ./Variants/${MAG}/${MAG}.freq > ./Variants/${MAG}/${MAG}.log.txt
# Variant detection
python /home/rprops/DESMAN/desman/Variant_Filter.py ./Variants/${MAG}/${MAG}.freq -p -o ./Variants/${MAG}/${MAG}_scg.freq -m 1.0 -f 25.0 -c -sf 0.80 -t 2.5 > ./Variants/${MAG}/${MAG}_scg.freq.vout
done
---- STEP 6 ---
* Make sure to have `gsl` and `ncbi-blast` loaded.
#!/bin/bash
set -e
for contig_list in `ls *-simplified_contigs.tsv`
do
MAG=${contig_list%-simplified_contigs.tsv}
echo $MAG
varFile=./Variants/${MAG}/${MAG}_scg.freqsel_var.csv
eFile=./Variants/${MAG}/${MAG}_scg.freqtran_df.csv
for g in 1 2 3 4 5 6 7 8 9 10 11 12
do
for r in 0 1 2 3 4 5 6 7 8 9
do
echo $g
desman $varFile -e $eFile -o ./Variants/${MAG}/${MAG}_${g}_${r} -g $g -s $r -m 1.0 > ./Variants/${MAG}/${MAG}_${g}_${r}.out
done
done
done
---- STEP 7 ---
Format output files for each replicate run and determine optimal number of strains using the resolvenhap.py script.
#!/bin/bash
set -e
for contig_list in `ls *-simplified_contigs.tsv`
do
MAG=${contig_list%-simplified_contigs.tsv}
echo $MAG
cat ./Variants/${MAG}/*/fit.txt | cut -d"," -f2- > ./Variants/${MAG}/${MAG}_Dev.csv
sed -i '1iH,G,LP,Dev' ./Variants/${MAG}/${MAG}_Dev.csv
#/home/rprops/DESMAN/scripts/PlotDev.R -l ./Variants/${MAG}/${MAG}_Dev.csv -o ./Variants/${MAG}/${MAG}_Dev.pdf
cd ./Variants/${MAG}
python /home/rprops/DESMAN/scripts/resolvenhap.py ${MAG} > ${MAG}_resolvedNhaplotypes.csv
cd -
done
Optional
/home/rprops/DESMAN/scripts/PlotDev.R -l ./Variants/${MAG}/${MAG}_Dev.csv -o ./Variants/${MAG}/${MAG}_Dev.pdf
Put all your best run folders in one file called best_desman_runs.txt in the main directory. For example:
MAG10_3_2
MAG9_3_2
MAG8_5_2
MAG5_6_2
MAG2_3_2
...
#!/bin/bash
set -e
if [ ! -d "./VariantsAll" ];
then
mkdir ./VariantsAll;
fi
if [ ! -d "./Gene_calls_step8" ];
then
mkdir ./Gene_calls_step8;
fi
for contig_list in `ls *-simplified_contigs.tsv`
do
MAG=${contig_list%-simplified_contigs.tsv}
echo $MAG
# Then we collate the count files together filtering to genes greater than 500bp:
python /home/rprops/DESMAN/scripts/ExtractCountFreqGenes.py -g ./Annotate/${MAG}-simplified.genes ./ReadcountFilter/${MAG} --output_file ./Gene_calls_step8/${MAG}_cluster_esc3.freq
# The -g flag here tells the script to expect gene positions in a slightly different format to the cog file used above.
# Now we find variants again, this time insisting on a minimum frequency of 3% and not filtering on sample coverage:
python /home/rprops/DESMAN/desman/Variant_Filter.py ./Gene_calls_step8/${MAG}_cluster_esc3.freq -o ./VariantsAll/${MAG}_ -m 0.0 -v 0.03
# To assign contigs we also need individual gene coverages, for consistency we generate these from the aggregated count files:
python /home/rprops/DESMAN/scripts/CalcGeneCov.py ./Gene_calls_step8/${MAG}_cluster_esc3.freq > ./Gene_calls_step8/${MAG}_cluster_esc3_cov.csv
# Get list of core COGs:
cut -d"," -f5 ./Annotate/${MAG}-simplified_core.cogs > ./Annotate/${MAG}_core_genes.txt
# Calculate coverage on core genes:
python /home/rprops/DESMAN/scripts/CalcDelta.py ./Gene_calls_step8/${MAG}_cluster_esc3_cov.csv ./Annotate/${MAG}_core_genes.txt ${MAG}_core_cov.txt
done
#!/bin/bash
set -e
if [ ! -d "./DESMAN_gene_assignment" ];
then
mkdir ./DESMAN_gene_assignment;
fi
for best_run in `cat best_desman_runs.txt`
do
MAG=${best_run%%_*}
echo $MAG
# Then we run the gene/contig assignment algorithm.
export SEL_RUN=./Variants/${MAG}/${best_run}
# Run gene assignment
python /home/rprops/DESMAN/desman/GeneAssign.py ./${MAG}_core_cov.txtmean_sd_df.csv $SEL_RUN/Gamma_star.csv ./Gene_calls_step8/${MAG}_cluster_esc3_cov.csv $SEL_RUN/Eta_star.csv -m 20 -v ./VariantsAll/${MAG}_sel_var.csv -o ./DESMAN_gene_assignment/${MAG}_ --assign_tau > ./DESMAN_gene_assignment/${MAG}.cout
done
This should generate the following output files.
*_log_file.txt: A log file
*eta_df.csv: The assignments from NMF unmanipulated useful for identifying multicopy genes.
*etaD_df.csv: As above but discretised NMF predictions.
*etaS_df.csv: Predictions from the Gibbs sampler selecting run with maximum log posterior.
*etaM_df.csv: Mean log posterior predictions from Gibbs sampler.