Skip to content

21. DESMAN

rprops edited this page Oct 24, 2019 · 6 revisions

Information to run on flux

We apply the DESMAN pipeline to a set of Metagenome Assembled Genome (MAG) of interest. Code is adapted from the DESMAN tutorial.

Notice: If the AddLengths.pl script returns wrongly formatted file e.g., not something like contig_x 1 1293 then run dos2unix *

These are the modules/software you will need to load:

module load prodigal ncbi-blast python-anaconda2/201607 biopython/1.60 samtools/1.5 R gsl

Running the DESMAN workflow

The below code assumes that all your MAGs fasta files have the .fna extension and are located in the folder where you will run the below shell scripts. In principle you can run all scripts after each other, without having to change or manually adjust anything. Step 6, where DESMAN is run is the most time intensive and can take several days per MAG (~number of variant positions). To avoid this try to run in parallel many DESMAN runs and reduce the number of strains to test for (e.g. 5 instead of 12). Time for each replicate run can take up to 2 hours, or as little as 20 mins.

---- STEP 1 ---

#!/bin/bash

for bin in `ls *.fna`; do

MAG=${bin%.fna}

# Start by removing everything after first space in header of fasta file.
awk '{print $1}' ${MAG}.fna > ${MAG}-simplified.fna

# prodigal/2.6.3
prodigal -i ${MAG}-simplified.fna -a ./Gene_calls/${MAG}-simplified.faa -d ./Gene_calls/${MAG}-simplified.genes.fna  -f gff -p single -o ./Gene_calls/${MAG}-simplified.gff > p.out

# Have ncbi-blast and parallel loaded
export COGSDB_DIR=/scratch/vdenef_fluxm/rprops/DESMAN/DESMAN_MAG8/rpsblast_cog_db
/home/rprops/CONCOCT/scripts/RPSBLAST.sh -f ./Gene_calls/${MAG}-simplified.faa -p -c 32 -r 1

# Now we assign COGs and genes to contigs using one of these scripts:
# Annotate genes
python /home/rprops/DESMAN/scripts/ExtractCogs.py -b ./Gene_calls/${MAG}-simplified.out --cdd_cog_file /home/rprops/CONCOCT/scgs/cdd_to_cog.tsv -g ./Gene_calls/${MAG}-simplified.gff > ./Annotate/${MAG}-simplified.cogs
python /home/rprops/DESMAN/scripts/ExtractGenes.py -g ./Gene_calls/${MAG}-simplified.gff > ./Annotate/${MAG}-simplified.genes

# Get contig lengths
python /home/rprops/DESMAN/scripts/Lengths.py -i ${MAG}-simplified.fna > ${MAG}-simplified.len

# Split bam files
grep ">" ${MAG}-simplified.fna | sed 's/>//g' > ${MAG}-simplified_contigs.txt
dos2unix ${MAG}-simplified_contigs.txt
dos2unix ${MAG}-simplified.len
/home/rprops/StrainMetaSim/scripts/AddLengths.pl ${MAG}-simplified.len < ${MAG}-simplified_contigs.txt > ${MAG}-simplified_contigs.tsv

done
mkdir SplitBam
mkdir ReadcountFilter

---- STEP 2 ---

#!/bin/bash
set -e

for contig_list in `ls *-simplified_contigs.tsv`
do
MAG=${contig_list%-simplified_contigs.tsv}
for sample_bam in `ls /scratch/vdenef_fluxm/rprops/DESMAN/metaG/vizbin_rebin_anvio_v230/map/*mapped.sorted.bam`
do
        stub1=${sample_bam##*/}
        stub=${stub1%.sorted.bam}
        echo $stub
        samtools view -bhL ${MAG}-simplified_contigs.tsv $sample_bam > ./SplitBam/${stub}.${MAG}.bam
		done
done

---- STEP 3 ---

#!/bin/bash
set -e

for contig_list in `ls *-simplified_contigs.tsv`
do
MAG=${contig_list%-simplified_contigs.tsv}
if [ ! -d "./ReadcountFilter/${MAG}"]; then mkdir ./ReadcountFilter/${MAG}; fi
echo $MAG
for bamfile in `ls ./SplitBam/*.${MAG}.bam`
        do
        stub1=${bamfile##*/}
        stub=${stub1%.bam}
        echo $stub
        (samtools index $bamfile; bam-readcount -w 1 -q 20 -l "./${MAG}-simplified_contigs.tsv" -f ./${MAG}-simplified.fna $bamfile 2> ./ReadcountFilter/${MAG}/${stub}.err > ./ReadcountFilter/${MAG}/${stub}.cnt)&

done
# Don't forget to zip .cnt files
gzip ./SplitBam/ReadcountFilter/${MAG}/*cnt
done

---- STEP 4 ---

# Get core COGs
#!/bin/bash
set -e

for contig_list in `ls *-simplified_contigs.tsv`
do
MAG=${contig_list%-simplified_contigs.tsv}
/home/rprops/DESMAN/scripts/SelectContigsPos.pl /home/rprops/StrainMetaSim/config/cogs.txt < ./Annotate/${MAG}-simplified.cogs > ./Annotate/${MAG}_core.cogs
done

---- STEP 5 ---

#!/bin/bash
set -e

if [ ! -d "./Variants" ];
then
        mkdir ./Variants;
fi
for contig_list in `ls *-simplified_contigs.tsv`
do
        MAG=${contig_list%-simplified_contigs.tsv}

        if [ ! -d "./Variants/{MAG}" ];
        then
        mkdir ./Variants/${MAG};
        fi

        # Get frequency counts
        python /home/rprops/DESMAN/scripts/ExtractCountFreqGenes.py ./Annotate/${MAG}-simplified_core.cogs ./ReadcountFilter/${MAG} --output_file ./Variants/${MAG}/${MAG}.freq > ./Variants/${MAG}/${MAG}.log.txt

        # Variant detection
        python /home/rprops/DESMAN/desman/Variant_Filter.py ./Variants/${MAG}/${MAG}.freq -p -o ./Variants/${MAG}/${MAG}_scg.freq -m 1.0 -f 25.0 -c -sf 0.80 -t 2.5 > ./Variants/${MAG}/${MAG}_scg.freq.vout

done

---- STEP 6 ---

* Make sure to have `gsl` and `ncbi-blast` loaded.
#!/bin/bash
set -e

for contig_list in `ls *-simplified_contigs.tsv`
do

MAG=${contig_list%-simplified_contigs.tsv}
echo $MAG
varFile=./Variants/${MAG}/${MAG}_scg.freqsel_var.csv
eFile=./Variants/${MAG}/${MAG}_scg.freqtran_df.csv
	for g in 1 2 3 4 5 6 7 8 9 10 11 12
		do
			for r in 0 1 2 3 4 5 6 7 8 9
				do
				echo $g
                desman $varFile -e $eFile -o ./Variants/${MAG}/${MAG}_${g}_${r} -g $g -s $r -m 1.0 > ./Variants/${MAG}/${MAG}_${g}_${r}.out
            done
	done
done

---- STEP 7 ---

Format output files for each replicate run and determine optimal number of strains using the resolvenhap.py script.

#!/bin/bash
set -e

for contig_list in `ls *-simplified_contigs.tsv`
do
        MAG=${contig_list%-simplified_contigs.tsv}
        echo $MAG
        cat ./Variants/${MAG}/*/fit.txt | cut -d"," -f2- > ./Variants/${MAG}/${MAG}_Dev.csv
        sed -i '1iH,G,LP,Dev' ./Variants/${MAG}/${MAG}_Dev.csv
        #/home/rprops/DESMAN/scripts/PlotDev.R -l ./Variants/${MAG}/${MAG}_Dev.csv  -o ./Variants/${MAG}/${MAG}_Dev.pdf
        cd ./Variants/${MAG}
        python /home/rprops/DESMAN/scripts/resolvenhap.py ${MAG} > ${MAG}_resolvedNhaplotypes.csv
        cd -
done

Optional

plot posterior deviance vs. number of strains

/home/rprops/DESMAN/scripts/PlotDev.R -l ./Variants/${MAG}/${MAG}_Dev.csv -o ./Variants/${MAG}/${MAG}_Dev.pdf

Infer accessory genomes

Put all your best run folders in one file called best_desman_runs.txt in the main directory. For example:

MAG10_3_2
MAG9_3_2
MAG8_5_2
MAG5_6_2
MAG2_3_2
...
#!/bin/bash
set -e

if [ ! -d "./VariantsAll" ];
then
mkdir ./VariantsAll;
fi

if [ ! -d "./Gene_calls_step8" ];
then
mkdir ./Gene_calls_step8;
fi

for contig_list in `ls *-simplified_contigs.tsv`
do
        MAG=${contig_list%-simplified_contigs.tsv}
        echo $MAG
	# Then we collate the count files together filtering to genes greater than 500bp:
	python /home/rprops/DESMAN/scripts/ExtractCountFreqGenes.py -g ./Annotate/${MAG}-simplified.genes ./ReadcountFilter/${MAG} --output_file ./Gene_calls_step8/${MAG}_cluster_esc3.freq 

	# The -g flag here tells the script to expect gene positions in a slightly different format to the cog file used above. 
	# Now we find variants again, this time insisting on a minimum frequency of 3% and not filtering on sample coverage:
	python /home/rprops/DESMAN/desman/Variant_Filter.py ./Gene_calls_step8/${MAG}_cluster_esc3.freq -o ./VariantsAll/${MAG}_ -m 0.0 -v 0.03
		
	# To assign contigs we also need individual gene coverages, for consistency we generate these from the aggregated count files:
	python /home/rprops/DESMAN/scripts/CalcGeneCov.py ./Gene_calls_step8/${MAG}_cluster_esc3.freq  > ./Gene_calls_step8/${MAG}_cluster_esc3_cov.csv
		
	# Get list of core COGs:
	cut -d"," -f5 ./Annotate/${MAG}-simplified_core.cogs > ./Annotate/${MAG}_core_genes.txt

	# Calculate coverage on core genes:
	python /home/rprops/DESMAN/scripts/CalcDelta.py ./Gene_calls_step8/${MAG}_cluster_esc3_cov.csv ./Annotate/${MAG}_core_genes.txt ${MAG}_core_cov.txt
		
done


#!/bin/bash
set -e

if [ ! -d "./DESMAN_gene_assignment" ];
then
mkdir ./DESMAN_gene_assignment;
fi

for best_run in `cat best_desman_runs.txt`
do
        MAG=${best_run%%_*}
        echo $MAG 
	# Then we run the gene/contig assignment algorithm.
	export SEL_RUN=./Variants/${MAG}/${best_run}
	# Run gene assignment
	python /home/rprops/DESMAN/desman/GeneAssign.py ./${MAG}_core_cov.txtmean_sd_df.csv $SEL_RUN/Gamma_star.csv ./Gene_calls_step8/${MAG}_cluster_esc3_cov.csv $SEL_RUN/Eta_star.csv -m 20 -v ./VariantsAll/${MAG}_sel_var.csv -o ./DESMAN_gene_assignment/${MAG}_ --assign_tau > ./DESMAN_gene_assignment/${MAG}.cout
done

This should generate the following output files.

*_log_file.txt: A log file

*eta_df.csv: The assignments from NMF unmanipulated useful for identifying multicopy genes.

*etaD_df.csv: As above but discretised NMF predictions.

*etaS_df.csv: Predictions from the Gibbs sampler selecting run with maximum log posterior.

*etaM_df.csv: Mean log posterior predictions from Gibbs sampler.

Clone this wiki locally