Skip to content
xonq edited this page Aug 19, 2024 · 17 revisions

Mycotools

OVERVIEW

Mycotools is a comparative genomics software suite centered around a curated database (MycotoolsDB/MTDB) of genomic data. This guide references Mycotools scripts that enable high throughput pipelining and routine genomic analysis. Mycotools is HEAVILY dependent on the MycotoolsDB, so there is no guarantee that these scripts will work with external files.

Why the dependency on MycotoolsDB? Because bioinformatics file formats need systematic, uniform curation. See here for more



Analysis_examples





GETTING STARTED

Following installation, determine if you are a user seeking to use an existing MycotoolsDB or if you seek to administer a new MycotoolsDB. If you are a user, you can link with a preexisting central MycotoolsDB interfacing with it. If you are an administrator, proceed to initialization.



MYCOTOOLSDB

The following scripts interface with and manipulate MycotoolsDB (MTDB) .mtdb files. To learn more about MycotoolsDB and the .mtdb format standard, refer to this guide.



Initialization

mtdb update is for initializing and building the primary MTDB; to interface with an established MTDB, see interfacing

De novo initialization

FUNGI

To initialize a curated database of all NCBI and MycoCosm (JGI) fungal genomes; note, JGI downloading is limited to one file per minute, so initialization will take multiple days and can be resumed with -r YYYYmmdd:

mtdb update -i <INIT DIRECTORY>

BACTERIA, PLANTS, ANIMALS, ARCHAEA

To initialize a curated database of NCBI genomes of other kingdoms/superkingdoms; please note, there are 100,000s bacterial genomes to download, plant and animal genomes of orders of magnitude larger, and this may take several days-weeks):

mtdb update -i <INIT_DIRECTORY> --kingdom <KINGDOM>

HPC Usage

If you are using a job submission protocol, such as SLURM, you can encrypt you NCBI and JGI login information and pass the encryption password when submitting a job as noted below.

SPECIFIC LINEAGES

To initialize a curated database of specific lineages of Fungi append --lineage and --rank to your intialization command. Make sure to include the appropriate --kingdom argument if not Fungi.

mtdb update -i <INIT_DIRECTORY> --lineage "Pezizomycotina,Agaricomycotina" \
--rank "subphylum,subphylum"

Reference initialization

A MycotoolsDB can be initialized referencing an external .mtdb file, e.g. to reproduce an analysis using another dataset. Please note this will only assimilate NCBI and JGI genomes from the reference .mtdb.

If you are currently linked to an existing MycotoolsDB, unlink via:

mtdb -u

To initialize a MycotoolsDB from a reference, appending necessary arguments:

mtdb update -i <INIT_DIR> -r <REF.mtdb>

If successful, a new MycotoolsDB will be initialized in <INIT_DIR>; to link back to any previously established MycotoolsDBs see how to interface.


Local initialization

A MycotoolsDB can be initialized referencing a predb.tsv file by filling it out with the metadata of your local genomes. Mycotools requires an assembly and annotation for all genomes in the database.

mtdb predb2mtdb > predb.tsv # generate a blank spreadsheet

Once filled, you can initialize your primary MTDB referencing your metadata spreadsheet:

mtdb udpate -i <INIT_DIR> --predb predb.tsv



Updating

To update MycotoolsDB:

mtdb update -u

To update the taxonomy of the primary MTDB:

mtdb update -t



Interfacing

mtdb is the MycotoolsDB central utility. It initializes interfacing with an established primary database or just prints the path of the primary database. MycotoolsDBs are labelled YYYYmmdd.mtdb.

mtdb
/home/xonq/mtdb/mtdb/20210125.mtdb

To add interfacing with a primary MTDB:

mtdb -i <DATABASE_BASE_DIRECTORY>



Managing

To encrypt a local copy of your NCBI and JGI passwords for fast access, run:

mtdb manage -p

This is necessary to submit jobs for scripts that require passwords. Such scripts (like mtdb update) will need to receive a password from stdin e.g.:

read -s PASSWORD
<type password>
printf '{"username":"myname","password":"%s"}' $PASSWORD | mtdb update -u



Querying

mtdb can query ome codes. To obtain the whole row for an ome code, simply:

mtdb <OME>

To obtain the specific file path for an ome file, append .fna (assembly), .faa (proteome), or .gff3 (gene coordinates) to the ome:

mtdb <OME>.<EXTENSION>

Because you can obtain PATHs using mtdb, you can use basic bash functionality to work with the output, e.g. to open in a text editor or to grep the file:

vim $(mtdb)
grep 'Psilocybe' $(mtdb)

NOTE: grep may yield non-specific taxonomy results, e.g. grep Athelia $(mtdb) will not only yield the genus Athelia, but also all members of the family Atheliaceae. See mtdb extract to extract based on taxonomy.



Creating modular databases

mtdb extract subsets lineages of interest from the primary MTDB. By default, this script will ignore use-restricted data - add -n to include use-restricted data if you are aware of and will respect the limitations of use-restricted data. Run mtdb extract -h to see all options.

e.g. grab a database of a taxonomic order:

mtdb extract -l Atheliales > atheliales.mtdb

grab all NCBI Aspergilli accessions:

mtdb extract -s ncbi -l aspergillus > aspergillus.mtdb_ncbi

grab a list of orders from a file:

mtdb extract -ll <TAX_FILE> > taxa.mtdb



Acquiring database files

Inputs a .mtdb file (by default uses the primary database), then symlinks the selected file types, hard copies the files, or prints their PATHs. A symlink is simply creating a placeholder file that links to the database file... this way it does not take up additional storage space like a hard copy does. However, editing symlinks will edit the original file, so only hard copy --hard if you need to edit the files.

Let's say you want protein data from organisms in one family. First, you should extract a database of organisms you want:

mtdb extract -l Atheliaceae > atheliaceae.mtdb

Then, run db2files to copy the protein fastas into the current directory (call -h to see all options):

db2files -d atheliaceae.mtdb -p 

Alternatively, if you just need the paths (links) to these files, simply run:

db2files -d atheliaceae.mtdb -p --print



Adding local genomes

To add in-house genomes and annotations (required) mtdb predb2mtdb will input your genome and metadata, curate, and prepare a database file to add to the database. The administrator will then take your database and add it to the primary MTDB.

First, generate a predb spreadsheet:

mtdb predb2mtdb > predb.tsv

The resulting predb.tsv can be filled in via spreadsheet software and exported as a tab delimited .tsv. Alternatively, use a plain text editor and separate by tabs. De novo annotations produced by Funannotate/Orthofiller must be filled in as "new" for the genomeSource column; *annotations directly derived from NCBI/JGI data need to be specified in genomeSource. Updates to existing database entries may not be automatically detected using the metadata, so please explicitly enter the current MTDB ome code you are updating in the previous_ome column.

Next, generate a MycotoolsDB file from your completed predb, and notify your database administrator that it is ready for integration:

mtdb predb2mtdb <PREDB.TSV>

Finally, inform your database administrator that the database is ready for submission to the primary MTDB. Administrators will execute:

mtdb update -a <PREDB2DB.mtdb>



Converting genome code names to strain names

Substitutes MycotoolsDB organism code names (e.g. fusgra1) for taxonomic information (e.g. Fusarium_graminearum_var._XYZ).

e.g. to substitute ome for genus species and strain: ome2name <INPUT> oa

ome2name -h
Usage: ome2name <INPUTFILE> | ome2name <INPUTFILE> [MYCOTOOLSDB] asvg*&
DEFAULTS: primary db, see script for default forbidden characters
Input file to regex sub omes with their name.
optional MycotoolsDB, string of forbidden characters
"o" no ome | "g" no genus | "s" no species | "v" no strain | "a" no alternative ome




SEQUENCE DATA TOOLS

Sequence data statistics

assemblyStats <ASSEMBLY.fa>
annotationStats <ANNOTATION.gff3>

To obtain a table of annotation statistics, create a mycotoolsDB file with the organisms of interest and run:

assemblyStats <MYCOTOOLSDB.mtdb>
annotationStats <MYCOTOOLSDB.mtdb>

If you want to route the output to a file, simply redirect output by appending > <OUTPUTFILE> to the command, or add an output file as the second argument



Downloading JGI/NCBI files

These scripts input a MycotoolsDB or can be manually made as shown at the bottom of this section. Upon completion, they will output predb.tsv files that can be used to as a standard local genome metadata spreadsheet for adding local genomes to the primary MTDB.

Say you want to grab transcript information from a genus, Aspergillus. First, extract entries in the database that are within Aspergillus:

mtdb extract -l aspergillus > aspergillus.mtdb_ncbi

If there are organisms you do not want in the extracted .mtdbs, delete their line(s) in the file. Next call jgiDwnld -h or ncbiDwnld -h to find the flags necessary to download the files you want. To download transcript data (and EST data for JGI) in your current directory:

jgiDwnld -i aspergillus.mtdb_jgi -t -e
ncbiDwnld -i aspergillus.mtdb_ncbi -t

To unzip all the files, run gunzip <FILETYPE>/*.gz. To submit as a job (not recommended), you must create an encrypted MycotoolsDB passkey using mtdb manage and pass the password to stdin to these scripts.


These scripts can input assembly accessions (NCBI) or genome codes (JGI)

jgiGenomeCodes.txt

Abobi1
Absrep1
Acain1

ncbiBioSamples.txt

SAMN15352002
SAMN06205226
SAMN02744098

You can download NCBI SRA's after downloading NCBI's SRA tools and making sure fastq-dump is in your PATH:

ncbiDwnld --sra -i <REFERENCE>

You can create a file with SRA ID's or BioProject, etc. Basically any query that is unique and sufficient to acquire the SRRs of interest. For paired-end reads, include -pe.



Acquiring accessions in fasta, gff, and gbk format

All Mycotools accessions - assembly or protein - are <ome>_<acc> where "ome" is the MTDB genome codename and "acc" is the accession. Let's say you want to query Panaeolus cyanescens' (ome: pancya1) PsiD and the NCBI accession is "PPQ80975.1"

Prepend the ome codename to "_" and the accession and grab it from the MTDB:

acc2gff -a pancya1_PPQ80975.1
acc2fa -a pancya1_PPQ80975.1
acc2gbk -a pancya1_PPQ80975.1

To output to file, simply append > <FILENAME> to your command.

If you have a list of accessions, create an input file with the accessions separated by new lines then run:

acc2gff -i <INPUTFILE>
acc2fa -i <INPUTFILE>
acc2gbk -i <INPUTFILE>

Alternatively, input accessions from stdin:

echo "pancya1_PPQ80975.1" | acc2gbk -a -

Full genome GenBanks

Full genome GenBanks (GBKs) are not available by default to save space. To retrieve a full genome GBK, simply append --full to the acc2gbk command and change -a from an accession to an ome code.

acc2gbk -f -a athter2.2 > athter2.2.full.gbk

Acquiring loci

Grab loci the same as above within a set number of genes or nucleotides (-n) plus or minus:

list proximal +/- 5Kb from accession to standard out; NOTE -n for nucleotide:

acc2locus -a fibpsy1_906341 -p 5000 -n
fibpsy1_809145
fibpsy1_923701
fibpsy1_771516
fibpsy1_906341
fibpsy1_719531
fibpsy1_846242
fibpsy1_138

list proximal +/- 5 genes to standard out; NOTE no -n:

acc2locus -a fibpsy1_906341 -p 5
fibpsy1_880711
fibpsy1_846234
fibpsy1_809145
fibpsy1_923701
fibpsy1_771516
fibpsy1_906341
fibpsy1_719531
fibpsy1_846242
fibpsy1_138
fibpsy1_942299
fibpsy1_906343

grab genes between two accessions; NOTE -b for between:

acc2locus -a "psicub1_30114 psicub1_87205" -b
psicub1_30114
psicub1_72370
psicub1_30121
psicub1_30049
psicub1_72373
psicub1_87205

Generate a genbank from the locus (useful for clinker):

acc2locus -a <OME>_<ACC> -p 5 | acc2gbk -a -

Output GFFs and protein fastas for the locus:

acc2locus -a <OME>_<ACC> -p 5 --ome



Extracting coordinates from fasta

coords2fa -h

Input nucleotide fasta/tsv input, extract coordinates
coords2fa <FA> <SEQID> <START_COORD> <END_COORD> <STRAND_SENSE>
Extract full sequence from sense strand: coords2fa test.fna scaffold_20 0 -1
Extract coordinates from antisense strand: coords2fa test.fna scaffold_20 69 420 -


Bulk extraction tab delimitted row format:
#fasta_path   sequence_id     start_coordinate        end_coordinate  strand_sense

coords2fa coords.tsv



Extracting sequences from GFF coordinates

gff2seq will extract the nucleotide or amino acid sequences associated with a gene coordinates gff file. Accepts optional flanking nucleotide plus/minus to extract from flanks of coordinates. gff2seq will extract flanks independently for each sequence ID (column 1) within the gff. Coding or noncoding regions can be specified. If --intergenic is called then only the first and last gene for each fasta sequence are considered.

e.g. extract nucleotide sequences and 1 kilobase flanks and noncoding regions within the following genes:

gff2seq -g <.GFF3> -a <.FNA> -nc -pm 1000 -n



Adding corrected/new gene models to database

This script will add new and corrected gene models to a full genome gff. Overlapping coordinates can be removed and an update file for the primary MTDB is optionally generated.

To add corrected genes from an exonerate-derived gff to a MycotoolsDB gff and prepare a database update:

add2gff -i <EXONERATE_GFF> -a $(mtdb <OME>.gff3) -u

NOTE: database administrators will need to finalize the updates to propagate the data to the primary database via:

mtdb update -a <ADD2GFF.mtdb>



Visualizing Loci

This script will input a .gff3, or new line-delimited list of .gff3 paths, and for each contig output an SVG of the locus annotated by function retrieved from the 'product=' field.


e.g. make an SVG from a GFF3: gff2svg -g <MY.gff3>

make SVGs for all GFF3s in a new line delimited list with width set to 20:

gff2svg -i <LISTOFGFF3.nsv> -o <OUTPUT_DIR> -w 20




EVOLUTIONARY ANALYSIS TOOLS

Homolog search MycotoolsDB

db2search will execute hmmer, blast, diamond, or mmseqs, query an input fasta, and output a results fasta for each accession in the query.



Homology group, single-copy ortholog, and pangenome circumscription

db2hgs can circumscribe homology groups using mmseqs cluster to cluster protein sequences via amino acid similarity. This algorithm will output a crude pangenome circumscription, homology groups that contain gene families in single-copy abundance across the dataset, and optionally near single-copy homology groups that meet criteria for median genes/HG in each genome (-e) and standard deviation (-s). This script is useful for large-scale homology group circumscription, and a crude single-copy ortholog inference module.



Phylogenetic tree building

fa2tree will input a fasta file or directory of fasta files, trim, and generate trees.

Information on a complete phylogenetic pipeline is elaborated below.


e.g. Swiftly construct a fastree from a fasta

fa2tree -i <FASTA>.fa --fast

Prepare a robust IQ-TREE for slurm job submission

fa2tree -i <FASTA> -s -A PAS1046

Construct a multigene phylogeny with independent evolutionary models for each gene

fa2tree -i <FASTA_DIR> -p

The final IQ-TREE file you want is *.contree; fastree is a .treefile View your trees using FigTree.



Microsynteny tree building

A microsynteny tree recapitulates the divergence in gene order between species. Mycotools implements a maximum-likelihood microsynteny tree reconstruction algorithm from a binary trait alignment of the presence-absence of combinations of gene families in focal neighborhoods. db2microsyntree attempts to determine these focal neighborhoods around gene families that exist in single-copy across the dataset, though it is ideal to manually input a set of focal genes via -f. The neighborhoods used in the presence-absence alignment are derived from pairs of gene families that include a focal gene family within an 11 gene sliding window, which can be adjusted with -w. The tree is then built from this binary alignment via IQ-TREE2 using a general time-rate reversible model with ascertainment bias correction.

db2microsyntree -d <MTDB> -f <FILE_OF_SCOs> -t <PHYLOGENOMIC_NEWICK_CONSTRAINT>



Sequence similarity clustering

Some gene families (e.g. P450s) have many highly identical homologs, which is problematic for conducting phylogenetic analysis and manipulating these large datasets. fa2clus invokes sequence clustering algorithms to systematically truncate your dataset without constructing a phylogeny. fa2clus optionally implements an automated iterative approach to obtaining a cluster of minimum - maximum size with the gene of interest.

For hierarchical agglomerative clustering: fa2clus will either take a fasta and generate a distance matrix using usearch calc_distmx by default or the % identity of diamond alignments. Then, cluster sequences via hierarchical agglomerative clustering and output a .clus file of cluster assignments and .newick dendrogram.

Please note that sequence similarity clustering is using a heuristic, sequence similarity, as a proxy of relatedness. Therefore, sequence similarity clustering may remove closely related sequences in cases where the inputted number of genes is much greater than the maximum cluster constraint and/or if there are not highly similar sequences around the gene of interest

To run mmseqs cluster on a faster with minimum 20% query coverage and minimum 30% AA identity

fa2clus -f <FASTA>.fa -m 0.2 -x 0.3

Iteratively cluster until a cluster size of 50-200 genes is achieved:

fa2clus -f <FASTA> -m 0.2 -x 0.3 --iterative <FOCAL_GENE> --minseq 50 --maxseq 200



Gene cluster phylogenetic and synteny analysis

Extracted clade of CRAP pipeline

CRAP, adopted and expanded from Slot & Rokas implementation, reconstructs and visualizes gene cluster phylogenies to study gene cluster evolution on a gene-by-gene basis. CRAP will:

  1. input a cluster query and use a search algorithm (BLAST/mmseqs/Diamond) or orthogroup-based approach to find homologs in the MycotoolsDB

  2. implement sequence similarity clustering to truncate the sequence set and detect outgroups. NOTE: sequence similarity clustering can sometimes remove close homologs if there are too many sequences analyzed relative to the set maximum sequences.

  3. construct phylogenies of each query sequence

  4. map locus synteny diagrams onto the tips of the phylogenies.

crap can operate on a query of MycotoolsDB accessions or a standalone multifasta input of external accessions. Following homolog acquisition, crap will submit each set of hits for tree building or sequence similarity clustering if the number of sequences exceeds the inputted maximum (-m).

By default, crap will construct trees using fasttree. Alternatively, CRAP can construct a robust IQ-TREE with 1000 boostrap iteration support values by specifying -i/--iqtree.


To search an extracted sub-MycotoolsDB using blastp and create phylogenies with fasttree:

mtdb extract --lineage Basidiomycota > basi.mtdb
crap -q <QUERYGENES> -d basi.mtdb -s blastp --bitscore 40 --cpu 12




MYCOTOOLS PIPELINING

Shell Pipelining with Mycotools

Mycotools is designed to enable pipelining in Linux shells as well as python.

For scripts such as acc2fa/acc2gbk/acc2gff/acc2locus, the input is an accession or set of accessions, and can be piped in via standard input. All scripts that accept standard input (stdin) will require "-" as the input argument. Stdin access allows you can to chain commands and swiftly generate the necessary input for downstream analysis.

Say you want to see if an accession of interest is part of a gene cluster. Run a CRAP around accessions within 20 kb +/- your gene of interest:

acc2locus -a athter2_3 -n -p 20000 | crap -q - <ARGS>

Create a file of query genes from the CRAP output, extract locus accessions for each gene, and output a genbank for each locus:

for acc in $(cat <INPUT_FILE>)
do 
  acc2locus -a $acc -p 20000 -n | acc2gbk -a - > $acc.locus.gbk 
done

Phylogenetic analysis

Despite the benefits of increased sampling, there are two prominent problems large samples create in phylogenetic analysis: 1) large alignments lose resolution because they must consider many sequences and 2) analysis time increases with sample size. For a small set of genes, such as ITS, one can usually assume that the gene family is conserved, so the dataset can be cut down to closely related organisms. For most other genes, it is not valid to assume the gene family is conserved because horizontal transfer can lead to unexpected distributions.

It is thus sometimes necessary to balance alignment resolution and computational tractability by systematically truncating the dataset into subset of gene homologs. This is accomplished in Mycotools by iteratively constructing phylogenies, identifying and extracting clades of homologs, and repeating until a manageable tree is obtained. On its own, this analysis requires elaborate integration of multiple independent softwares, but Mycotools takes care of most of this.


For example, acquire a set of homologs for a gene family of interest by BLASTing a representative query protein sequence.

  1. extract a database of published sequences, or use other arguments to extract other organisms of interest
mtdb extract > pub.mtdb

  1. obtain gene homologs using db2search with an e-value threshold of 10-2:
db2search -d pub.mtdb -e 2 -q <QUERY>
  1. If there are more than 1000 homologs, truncate the resulting homologs around a gene of interest between 50 and 500 genes:
fa2clus -f <FASTA>.fa --min_seq 50 --max_seq 500 -i <FOCAL_GENE>
  1. If there are less than 1000 homologs, construct a fastree, -f, if there are many samples, or remove that parameter for a robust IQ-TREE.
fa2tree -i <FASTA>.fa 
  1. If the phylogeny can be improved, extract a highly supported node from 4 and reconstruct, or systematically truncate the dataset with 3 and reconstruct.

Phylogenomic analysis

Phylogenomic tree from single-copy genes

Identify single-copy, or near single-copy, gene families from a MycotoolsDB (omit -d if using the full database):

db2hgs -d <MTDB_file>

Build a multi-gene partition phylogenomic tree from the inferred single-copy orthologs:

fa2tree -p -i db2hgs_<YYYmmdd>/single_copy_genes/ -c <CPUS>

Phylogenomic tree from near single-copy genes

Sometimes single-copy gene families do not exist because of copy number variation, or absence in some genomes. This is particularly relevant for datasets that include a broad taxonomic sample. Identify near single-copy gene families by specifying criteria for the maximum median genes per homology group, standard deviation, and minimum percent of genomes the homology group must be present in. A full list of parameters for this option is available from the db2hgs -h help menu.

db2hgs -d <MTDB_file> <NEAR_SINGLE-COPY_PARAMETERS> --hmm

Use the HMMs of the near single-copy genes to identify the best hits for these genes in your inputted genomes

db2search -a hmmsearch -q db2hgs_<YYYYmmdd>/hmm/*hmm -d <MTDB_file> -m 1

Build a phylogenomic tree from the best hits output

fa2tree -i db2search_<YYYYmmdd>/fastas -p -c <CPUS>

Phylogenomic tree from predetermined near single-copy genes

Download the HMM that includes 14 near single-copy protein models found in Fungi and search for the best hit of these proteins in your dataset.

wget https://raw.githubusercontent.com/xonq/mycotools/master/test/scos.hmm
db2search -d <MTDB_file> -a hmmsearch -q scos.hmm -m 1
fa2tree -i db2search_<YYYYmmdd>/fastas -p -c <CPUS>










TODO

  • add ncbiAcc2fa
  • separate better/organize into wiki
  • document clinker pipeline
  • add images of pipeline output
  • fa2hmmer2fa annotation
  • update crap information
  • eggnog to synteny diagram
  • overview of mycotools scripts functions, e.g. resume on -o, output standards
  • update phylogenetic pipeline with fa2clus renovations
  • add partition analysis tutorial
  • discuss output in crap
  • add bioreform
  • convert to wiki
  • add caveats to outputs of fa2clus, crap, etc

Clone this wiki locally