-
- 1.1. Software
- 1.2. Set up directory structure
-
- 2.1. Basecalling
- 2.2. Quality control
- 2.3. Merge the simplex and duplex reads
- 2.4. Demultiplexing
- 2.5. Create consensus sequences
- 2.6. Aggregate consensus sequences
-
- 3.1. Acquire raw data
- 3.2. Assemble genome
- 3.3. Evaluate genome
- 3.4. Annotate genome
-
- 6.1. Compare ITS lengths
- 6.1.1. Download datasets and run ITSx
- 6.1.2. Run Infernal
- 6.1.3. Visualize data
- 6.2. ITS intragenomic variation
- 6.3. Global distribution of Sorochytrium
- 6.1. Compare ITS lengths
Throughout this document, code in brackets should be replaced with your own commands.
The following software need to be installed and findable in $PATH.
Published software used in this pipeline
antismash7.1.0ASTRAL-III5.15.5AvP1.0.2BBTools38.96Bedtools2.30.0BLAST+2.12.0BUSCO5.3.0BUSCO_phylogenomics.pyDIAMOND2.0.12Duplex Tools0.3.1EDTA2.2.2Entrez DirectextractITS.pyFastTree2.1.10Flye2.9.2Funannotate1.8.14 and dependenciesGenomeScopeGuppy6.4.6Hapo-G1.3.6ITSx1.1.3IQ-TREE3.0.1Jellyfish2.3.0MiniBar0.21MinIONQC1.4.1MUSCLE3.8.31NCBI Datasets16.40.1NextPolish1.4.1NGSpeciesID0.1.3pycoQC2.5.2Python3.10.4QUAST5.2.0R4.2.0Rpackagesdata.table,futile.logger,ggplot2,ggtree,optparse,plyr,readr,reshape2,scales,tidyverse,viridis,yamlSingularity4.1.3Seqtk1.3SPAdes3.15.5sratoolkit3.1.1trimAl1.2rev59
The directory containing this README.md file will be the project directory. Change to the parent directory and run the code below to clone the GitHub directory sorochytrium.
git clone https://github.com/aldendirks/sorochytrium.git
export project_dir="$PWD/sorochytrium"
cd $project_dir
mkdir -p data figures its meta phylo results scripts seqs
This code is modified primarily from the Nanopore amplicon pipeline developed by Steve Russell with contributions from previous work done by Rabern Simmons, Katelyn McKindles, and Michelle Orozco-Quime.
Basecalling is the process of converting the raw data in FAST5 format (nanopore electrical signals) to FASTQ files containing nucleotide sequences and quality information. Sorochytrium was only one sample in a multiplexed run. The raw FAST5 files are not public but can be made available upon request.
NOTE: In FASTQ files, nucleotide quality scores are represented by unique symbols.
A number of basecalling software exist, but the current "official" one is Guppy. Guppy basecalling is the most computationally intensive part of this pipeline. Running Guppy with CPUs is possible but takes way more time and resources than running it with GPUs.
sbatch $project_dir/scripts/guppy.sh
This script first converts the FAST5 files to FASTQ format. Then, Duplex Tools is used to identify duplex reads. Around 20-40% of the reads are duplex reads, composed of multiple different sequences fused together. These are filtered and rerun with guppy_basecaller_duplex.
MinIONQC produces high-quality graphs from the Guppy summary files.
Rscript $project_dir/scripts/MinIONQC.R -i $project_dir/seqs/rdna/guppy/simplex_calls/sequencing_summary.txt -o $project_dir/seqs/rdna/QC_reports
Rscript $project_dir/scripts/MinIONQC.R -i $project_dir/seqs/rdna/guppy/duplex_calls/sequencing_summary.txt -o $project_dir/seqs/rdna/QC_reports
mkdir $project_dir/seqs/rdna/guppy/merged_calls
{ sed 's/ /\n/' $project_dir/seqs/rdna/guppy/simplex_calls/pairs/pair_ids_filtered.txt" | seqkit grep -v -f - "$project_dir/seqs/rdna/guppy/simplex_calls/pass/basecall.fastq" ; cat $project_dir/seqs/rdna/guppy/duplex_calls/pass/*.fastq ; } > $project_dir/seqs/rdna/guppy/merged_calls/merged.fastq
cat $project_dir/seqs/rdna/guppy/merged_calls/merged.fastq | wc -l | awk '{print $1/4}' > $project_dir/seqs/rdna/summary/merged_reads.txt
Use minibar.py to pull individual sample files from the merged data file with reference to a metadata file.
mkdir $project_dir/seqs/rdna/NGSpeciesID
cd $project_dir/seqs/rdna/NGSpeciesID
$project_dir/scripts/minibar.py -F $project_dir/data/metadata.txt $project_dir/seqs/rdna/guppy/merged_calls/merged.fastq
conda activate NGSpeciesID
cd $project_dir/seqs/rdna/NGSpeciesID
for file in *.fastq
do
bn=`basename $file .fastq`
NGSpeciesID --ont --consensus --sample_size 1000 --m 5000 --s 2000 --medaka --primer_file $project_dir/data/primers.txt --fastq $file --outfolder ${bn}
done
# Modify the command for files that fall outside the default range, e.g. Sorochytrium (large rDNA)
for file in sample_s1865.1*.fastq
do
bn=`basename $file .fastq`
NGSpeciesID --ont --consensus --sample_size 1000 --m 10000 --s 2000 --medaka --primer_file $project_dir/data/primers.txt --fastq $file --outfolder ${bn}
done
conda deactivate
> $project_dir/seqs/rdna/summary/consensus_seqs.fasta
cd $project_dir/seqs/rdna/NGSpeciesID
for file in *.fastq
do
bn=`basename $file .fastq`
for file in $(ls $bn/consensus_reference_*.fasta)
do
sed 's%^>\(.*\)%>'$bn'_\1%I' $file >> $project_dir/seqs/rdna/summary/consensus_seqs.fasta
done
done
cd $project_dir/seqs/rdna/summary
The Nanopore and Illumina raw reads are available in NCBI SRA under BioProject PRJNA1015119.
- Nanopore: run SRR26060970
- Illumina: run SRR26060971
mkdir -p $project_dir/seqs/genome/raw
cd $project_dir/seqs/genome/raw
for run in SRR26060970 SRR26060971
do
prefetch $run
fasterq-dump --gzip $run
done
Assemble the Nanopore reads.
sbatch $project_dir/scripts/flye.sh
Filter and trim the Illumina reads.
sbatch $project_dir/scripts/check_read_quality_and_trim.sh
Polish and phase the Nanopore assembly with the filtered and trimmed Illumina reads.
sbatch $project_dir/scripts/polish_and_phase.sh
Evaluate genome assembly with quast.
sbatch $project_dir/scripts/quast.sh
Run GenomeScope to estimate genome size and heterozygosity.
sbatch $project_dir/scripts/genomescope.sh
Create a kmer histogram for visualization in R.
sbatch $project_dir/scripts/calculate_kmer_count.sh
cd "$project_dir"/seqs/genome/kmers
Rscript $project_dir/scripts/kmers_histogram.R
Annotate the genome with Funannotate.
sbatch $project_dir/scripts/funannotate_predict.sh
sbatch $project_dir/scripts/antismash7.sh
sbatch $project_dir/scripts/funannotate_annotate.sh
Identify transposable elements with EDTA.
sbatch $project_dir/scripts/edta.sh
Prepare files for phylogenetic analysis.
bash $project_dir/scripts/phylo_prep.sh
Run IQ-TREE.
sbatch $project_dir/scripts/phylo_iqtree.sh
Download genomes from GenBank.
bash phylogenomics_prep.sh
Run BUSCO on genomes to identify conserved single-copy orthologs.
mkdir -p $project_dir/phylo/phylogenomics/busco
cd $project_dir/phylo/phylogenomics/busco
busco --download fungi_odb10
for fasta in $project_dir/seqs/phylogenomics/*.fasta
do
sbatch $project_dir/scripts/phylogenomics_busco.sh $fasta
done
The BUSCO phylogenomics pipeline is used for coalescence and to prepare the dataset for IQTREE concatenation analysis. You will need to define the path to the ASTRAL library files and ASTRAL jar file, e.g., "/home/adirks/apps/ASTRAL/Astral/lib/" and "/home/adirks/apps/ASTRAL/Astral/astral.5.15.5.jar".
# Consider running the following code to find the paths if available, although find can take a while
#find / -name "astral.5.15.5.jar"
export ASTRAL_LIB_PATH="/path/to/Astral/lib/"
export ASTRAL_JAR_PATH="/path/to/astral.5.15.5.jar"
sbatch $project_dir/scripts/phylogenomics_trees.sh
Rscript $project_dir/scripts/phylogenomics_visualize.R
Download the UNITE and ROD ITS datasets. ITSoneDB needs to be downloaded locally. ITSx identifies the SSU, 5.8S, and LSU bounds and reports the lengths of these different regions for the ITSoneDB, ROD, and UNITE datasets. The script below runs ITSx and parses the output files to extract length data.
sbatch $project_dir/scripts/itsx.sh
Process the longest sequences with Infernal to check the ITSx results. Make sure to designate the path to your locally downloaded Rfam database.
export RFAM="/path/to/Rfam.cm"
sbatch $project_dir/scripts/infernal.sh
Open the document $project_dir/scripts/its_lengths.R in RSudio and work through the code.
BBMap is used to map the trimmed Illumina reads to the assembled genome and determine coverage. Coverage is calculated for the genome, 10 single-copy orthologs (to identify the coverage signature of a single-copy gene), and the rDNA sequences.
sbatch $project_dir/scripts/calculate_coverage.sh
sbatch $project_dir/scripts/extract_rdna.sh
Download Parmelia saxatilis lichen Illumina reads (metagenome reads of sample from which Sorochytrium milnesiophthora was isolated).
lichen_dir="$project_dir/seqs/parmelia"
mkdir -p $lichen_dir/raw
cd $lichen_dir/raw
prefetch SRR34964407
fasterq-dump --gzip SRR34964407
Trim Illumina reads, BLAST against Sorochytrium genome, subset matching reads, and assemble into a Sorochytrium metagenome-derived assembly.
sbatch $project_dir/scripts/metagenomics_parmelia.sh
Pebblescout showed Sorochytrium milnesiophthora 5.8S to have a single hit to a rhizosphere metagenome belonging to Barbacenia macrantha from Brazil. Download the raw reads.
barb_dir="$project_dir/seqs/parmelia"
mkdir -p $barb_dir/raw
cd $barb_dir/raw
prefetch SRR34964407
fasterq-dump --gzip SRR34964407
Extract Sorochytrium from the rhizosphere metagenome.
sbatch $project_dir/scripts/metagenomics_barbacenia.sh
This code is formatted for University of Michigan Advanced Genomics Core (AGC).
Install code for the Nanopore amplicon pipeline.
module load Bioinformatics
module load python
pip install --user duplex_tools
NOTE: Modules are preinstalled software on the AGC computing cluster. Whenever possible, use a module because the software already works. Installing your own software locally is doable but is trickier the more dependencies involved. Search for the existence of a module with
module spider [name]. You need to load the python module to use pip, a python package management command. The flag--useris necessary so that the software is installed in your home directory rather than the default higher level directory, which requires admin permissions.
NOTE: The default version for a module may change over time, which can impact the pipeline. For example, if you install
duplex_toolsat one point in time with the most recent (default) python module set to version 3.10.4, later on the default version may change to a more recent one like 3.12.1.duplex_toolswould need to be reinstalled with this newer version of python or the corresponding version of 'python' would need to be specified in the module command (e.g.,module load python/3.10.4)
R packages need to be installed to use MinIONQC in the Nanopore amplicon pipeline. Launch R with R after loading the R module and in the R prompt install the following packages. More information on installing R packages on AGC can be found here.
module load R
R
install.packages(c("data.table","futile.logger","ggplot2","optparse","plyr","readr","reshape2","scales","viridis","yaml"))
quit()
Install NGSpeciesID with Conda (this takes a little while).
conda create -n NGSpeciesID python=3.6 pip
conda activate NGSpeciesID
conda install --yes -c conda-forge -c bioconda medaka==0.11.5 openblas==0.3.3 spoa racon minimap2
pip install --user NGSpeciesID
#NGSpeciesID --help