-
Notifications
You must be signed in to change notification settings - Fork 4
Usage
Mycotools is a comparative genomics software suite centered around a curated database (MycotoolsDB/MTDB) of genomic data. This guide references Mycotools scripts that enable high throughput pipelining and routine genomic analysis. Mycotools is HEAVILY dependent on the MycotoolsDB, so there is no guarantee that these scripts will work with external files.
Why the dependency on MycotoolsDB? Because bioinformatics file formats need systematic, uniform curation. See here for more

Following installation, determine if you are a user seeking to use an existing MycotoolsDB or if you seek to administer a new MycotoolsDB. If you are a user, you can link with a preexisting central MycotoolsDB interfacing with it. If you are an administrator, proceed to initialization.
The following scripts interface with and manipulate MycotoolsDB (MTDB) .mtdb
files. To learn more about MycotoolsDB and the .mtdb format standard, refer to
this guide.
mtdb update is for initializing and building the primary MTDB; to interface with
an established MTDB, see
interfacing
To initialize a curated database of all NCBI and MycoCosm (JGI) fungal genomes; note,
JGI downloading is limited to one file per minute, so initialization will take multiple days
and can be resumed with -r YYYYmmdd:
mtdb update -i <INIT DIRECTORY>To initialize a curated database of NCBI genomes of other kingdoms/superkingdoms; please note, there are 100,000s bacterial genomes to download, plant and animal genomes of orders of magnitude larger, and this may take several days-weeks):
mtdb update -i <INIT_DIRECTORY> --kingdom <KINGDOM>If you are using a job submission protocol, such as SLURM, you can encrypt you NCBI and JGI login information and pass the encryption password when submitting a job as noted below.
To initialize a curated database of specific lineages of Fungi append
--lineage and --rank to your intialization command. Make sure to include
the appropriate --kingdom argument if not Fungi.
mtdb update -i <INIT_DIRECTORY> --lineage "Pezizomycotina,Agaricomycotina" \
--rank "subphylum,subphylum"A MycotoolsDB can be initialized referencing an external .mtdb file, e.g. to reproduce an analysis using another dataset. Please note this will only assimilate NCBI and JGI genomes from the reference .mtdb.
If you are currently linked to an existing MycotoolsDB, unlink via:
mtdb -uTo initialize a MycotoolsDB from a reference, appending necessary arguments:
mtdb update -i <INIT_DIR> -r <REF.mtdb>If successful, a new MycotoolsDB will be initialized in <INIT_DIR>; to link back to any previously established MycotoolsDBs see how to interface.
A MycotoolsDB can be initialized referencing a predb.tsv file by filling it
out with the metadata of your local genomes. Mycotools requires an assembly
and annotation for all genomes in the database.
mtdb predb2mtdb > predb.tsv # generate a blank spreadsheetOnce filled, you can initialize your primary MTDB referencing your metadata spreadsheet:
mtdb udpate -i <INIT_DIR> --predb predb.tsvTo update MycotoolsDB:
mtdb update -uTo update the taxonomy of the primary MTDB:
mtdb update -tmtdb is the MycotoolsDB central utility. It initializes interfacing with an established primary
database or just prints the path of the primary database. MycotoolsDBs are labelled YYYYmmdd.mtdb.
mtdb
/home/xonq/mtdb/mtdb/20210125.mtdbTo add interfacing with a primary MTDB:
mtdb -i <DATABASE_BASE_DIRECTORY>To encrypt a local copy of your NCBI and JGI passwords for fast access, run:
mtdb manage -pThis is necessary to submit jobs for scripts that require passwords. Such scripts
(like mtdb update) will need to receive a password from stdin e.g.:
read -s PASSWORD
<type password>
printf '{"username":"myname","password":"%s"}' $PASSWORD | mtdb update -umtdb can query ome codes. To obtain the whole row for an ome code, simply:
mtdb <OME>To obtain the specific file path for an ome file, append .fna (assembly),
.faa (proteome), or .gff3 (gene coordinates) to the ome:
mtdb <OME>.<EXTENSION>Because you can obtain PATHs using mtdb, you can use basic bash functionality to work with the output,
e.g. to open in a text editor or to grep the file:
vim $(mtdb)
grep 'Psilocybe' $(mtdb)NOTE: grep may yield non-specific taxonomy results, e.g. grep Athelia $(mtdb) will not only yield the genus Athelia, but also all members of the
family Atheliaceae. See
mtdb extract
to extract based on taxonomy.
mtdb extract subsets lineages of interest from the primary MTDB. By default, this script will ignore use-restricted data - add -n to include use-restricted data if you are aware of and will respect the limitations of use-restricted data. Run mtdb extract -h to see all options.
e.g. grab a database of a taxonomic order:
mtdb extract -l Atheliales > atheliales.mtdbgrab all NCBI Aspergilli accessions:
mtdb extract -s ncbi -l aspergillus > aspergillus.mtdb_ncbigrab a list of orders from a file:
mtdb extract -ll <TAX_FILE> > taxa.mtdbInputs a .mtdb file (by default uses the primary database), then symlinks the selected file types, hard copies the files, or prints their PATHs. A symlink is simply creating a placeholder file that links to the database file... this way it does not take up additional storage space like a hard copy does. However, editing symlinks will edit the original file, so only hard copy --hard if you need to edit the files.
Let's say you want protein data from organisms in one family. First, you should extract a database of organisms you want:
mtdb extract -l Atheliaceae > atheliaceae.mtdbThen, run db2files to copy the protein fastas into the current directory (call -h to see all options):
db2files -d atheliaceae.mtdb -p Alternatively, if you just need the paths (links) to these files, simply run:
db2files -d atheliaceae.mtdb -p --printTo add in-house genomes and annotations (required) mtdb predb2mtdb will input your genome and
metadata, curate, and prepare a database file to add to the database. The
administrator will then take your database and add it to the primary MTDB.
First, generate a predb spreadsheet:
mtdb predb2mtdb > predb.tsvThe resulting predb.tsv can be filled in via spreadsheet software and
exported as a tab delimited .tsv. Alternatively, use a plain text editor and
separate by tabs. De novo annotations produced by Funannotate/Orthofiller must
be filled in as "new" for the genomeSource column; *annotations directly
derived from NCBI/JGI data need to be specified in genomeSource. Updates to
existing database entries may not be automatically detected using the metadata,
so please explicitly enter the current MTDB ome code you are updating in the
previous_ome column.
Next, generate a MycotoolsDB file from your completed predb, and notify your database administrator that it is ready for integration:
mtdb predb2mtdb <PREDB.TSV>Finally, inform your database administrator that the database is ready for submission to the primary MTDB. Administrators will execute:
mtdb update -a <PREDB2DB.mtdb>Substitutes MycotoolsDB organism code names (e.g. fusgra1) for taxonomic information (e.g. Fusarium_graminearum_var._XYZ).
e.g. to substitute ome for genus species and strain: ome2name <INPUT> oa
ome2name -h
Usage: ome2name <INPUTFILE> | ome2name <INPUTFILE> [MYCOTOOLSDB] asvg*&
DEFAULTS: primary db, see script for default forbidden characters
Input file to regex sub omes with their name.
optional MycotoolsDB, string of forbidden characters
"o" no ome | "g" no genus | "s" no species | "v" no strain | "a" no alternative omeassemblyStats <ASSEMBLY.fa>
annotationStats <ANNOTATION.gff3>To obtain a table of annotation statistics, create a mycotoolsDB file with the organisms of interest and run:
assemblyStats <MYCOTOOLSDB.mtdb>
annotationStats <MYCOTOOLSDB.mtdb>If you want to route the output to a file, simply redirect output by appending > <OUTPUTFILE> to the command, or add an output file as the second argument
These scripts input a MycotoolsDB or can be manually made as shown at the
bottom of this section. Upon completion, they will output predb.tsv files
that can be used to as a standard local genome metadata spreadsheet for
adding local genomes to the primary MTDB.
Say you want to grab transcript information from a genus, Aspergillus. First, extract entries in the database that are within Aspergillus:
mtdb extract -l aspergillus > aspergillus.mtdb_ncbiIf there are organisms you do not want in the extracted .mtdbs, delete their line(s) in the file. Next call jgiDwnld -h or ncbiDwnld -h to find the flags necessary to download the files you want. To download transcript data (and EST data for JGI) in your current directory:
jgiDwnld -i aspergillus.mtdb_jgi -t -e
ncbiDwnld -i aspergillus.mtdb_ncbi -tTo unzip all the files, run gunzip <FILETYPE>/*.gz.
To submit as a job (not recommended), you must create an encrypted MycotoolsDB passkey using mtdb manage and pass the password to stdin to these scripts.
These scripts can input assembly accessions (NCBI) or genome codes (JGI)
jgiGenomeCodes.txt
Abobi1
Absrep1
Acain1
ncbiBioSamples.txt
SAMN15352002
SAMN06205226
SAMN02744098
You can download NCBI SRA's after downloading NCBI's SRA tools and making sure fastq-dump is in your PATH:
ncbiDwnld --sra -i <REFERENCE>You can create a file with SRA ID's or BioProject, etc. Basically any query that is unique and sufficient to acquire the SRRs of interest. For paired-end reads, include -pe.
All Mycotools accessions - assembly or protein - are <ome>_<acc> where "ome"
is the MTDB genome codename and "acc" is the accession.
Let's say you want to query Panaeolus cyanescens' (ome: pancya1) PsiD and the NCBI accession is "PPQ80975.1"
Prepend the ome codename to "_" and the accession and grab it from the MTDB:
acc2gff -a pancya1_PPQ80975.1
acc2fa -a pancya1_PPQ80975.1
acc2gbk -a pancya1_PPQ80975.1To output to file, simply append > <FILENAME> to your command.
If you have a list of accessions, create an input file with the accessions separated by new lines then run:
acc2gff -i <INPUTFILE>
acc2fa -i <INPUTFILE>
acc2gbk -i <INPUTFILE>Alternatively, input accessions from stdin:
echo "pancya1_PPQ80975.1" | acc2gbk -a -Full genome GenBanks (GBKs) are not available by default to save space.
To retrieve a full genome GBK, simply append --full to the acc2gbk
command and change -a from an accession to an ome code.
acc2gbk -f -a athter2.2 > athter2.2.full.gbkGrab loci the same as above within a set number of genes or nucleotides (-n) plus or minus:
list proximal +/- 5Kb from accession to standard out; NOTE -n for nucleotide:
acc2locus -a fibpsy1_906341 -p 5000 -n
fibpsy1_809145
fibpsy1_923701
fibpsy1_771516
fibpsy1_906341
fibpsy1_719531
fibpsy1_846242
fibpsy1_138list proximal +/- 5 genes to standard out; NOTE no -n:
acc2locus -a fibpsy1_906341 -p 5
fibpsy1_880711
fibpsy1_846234
fibpsy1_809145
fibpsy1_923701
fibpsy1_771516
fibpsy1_906341
fibpsy1_719531
fibpsy1_846242
fibpsy1_138
fibpsy1_942299
fibpsy1_906343grab genes between two accessions; NOTE -b for between:
acc2locus -a "psicub1_30114 psicub1_87205" -b
psicub1_30114
psicub1_72370
psicub1_30121
psicub1_30049
psicub1_72373
psicub1_87205Generate a genbank from the locus (useful for clinker):
acc2locus -a <OME>_<ACC> -p 5 | acc2gbk -a -Output GFFs and protein fastas for the locus:
acc2locus -a <OME>_<ACC> -p 5 --omecoords2fa -h
Input nucleotide fasta/tsv input, extract coordinates
coords2fa <FA> <SEQID> <START_COORD> <END_COORD> <STRAND_SENSE>
Extract full sequence from sense strand: coords2fa test.fna scaffold_20 0 -1
Extract coordinates from antisense strand: coords2fa test.fna scaffold_20 69 420 -
Bulk extraction tab delimitted row format:
#fasta_path sequence_id start_coordinate end_coordinate strand_sense
coords2fa coords.tsvgff2seq will extract the nucleotide or amino acid sequences associated with a gene coordinates gff file. Accepts optional flanking nucleotide plus/minus to extract from flanks of coordinates. gff2seq will extract flanks independently for each sequence ID (column 1) within the gff. Coding or noncoding regions can be specified. If --intergenic is called then only the first and last gene for each fasta
sequence are considered.
e.g. extract nucleotide sequences and 1 kilobase flanks and noncoding regions within the following genes:
gff2seq -g <.GFF3> -a <.FNA> -nc -pm 1000 -nThis script will add new and corrected gene models to a full genome gff. Overlapping coordinates can be removed and an update file for the primary MTDB is optionally generated.
To add corrected genes from an exonerate-derived gff to a MycotoolsDB gff and prepare a database update:
add2gff -i <EXONERATE_GFF> -a $(mtdb <OME>.gff3) -u
NOTE: database administrators will need to finalize the updates to propagate the data to the primary database via:
mtdb update -a <ADD2GFF.mtdb>This script will input a .gff3, or new line-delimited list of .gff3 paths, and for each contig output an SVG of the locus annotated by function retrieved from the 'product=' field.
e.g. make an SVG from a GFF3: gff2svg -g <MY.gff3>
make SVGs for all GFF3s in a new line delimited list with width set to 20:
gff2svg -i <LISTOFGFF3.nsv> -o <OUTPUT_DIR> -w 20db2search will execute hmmer, blast, diamond, or mmseqs, query an input fasta, and output a results fasta for each accession in the query.
db2hgs can circumscribe homology groups using mmseqs cluster to cluster
protein sequences via amino acid similarity. This algorithm will output a crude
pangenome circumscription, homology groups that contain gene families in
single-copy abundance across the dataset, and optionally near single-copy
homology groups that meet criteria for median genes/HG in each genome (-e)
and standard deviation (-s). This script is useful for large-scale homology
group circumscription, and a crude single-copy ortholog
inference module.
fa2tree will input a fasta file or directory of fasta files, trim, and
generate trees.
Information on a complete phylogenetic pipeline is elaborated below.
e.g. Swiftly construct a fastree from a fasta
fa2tree -i <FASTA>.fa --fastPrepare a robust IQ-TREE for slurm job submission
fa2tree -i <FASTA> -s -A PAS1046Construct a multigene phylogeny with independent evolutionary models for each gene
fa2tree -i <FASTA_DIR> -pThe final IQ-TREE file you want is *.contree; fastree is a .treefile
View your trees using FigTree.
A microsynteny tree recapitulates the divergence in gene order between species.
Mycotools implements a maximum-likelihood microsynteny tree reconstruction
algorithm from a binary trait alignment of the presence-absence of combinations
of gene families in focal neighborhoods. db2microsyntree attempts to
determine these focal neighborhoods around gene families that exist in
single-copy across the dataset, though it is ideal to manually input a set of
focal genes via -f. The neighborhoods used in the presence-absence alignment
are derived from pairs of gene families that include a focal gene family within
an 11 gene sliding window, which can be adjusted with -w. The tree is then
built from this binary alignment via IQ-TREE2 using a general time-rate reversible model with ascertainment bias
correction.
db2microsyntree -d <MTDB> -f <FILE_OF_SCOs> -t <PHYLOGENOMIC_NEWICK_CONSTRAINT>
Some gene families (e.g. P450s) have many highly identical homologs, which
is problematic for conducting phylogenetic analysis and manipulating these large datasets.
fa2clus invokes sequence clustering algorithms to systematically truncate your dataset
without constructing a phylogeny. fa2clus optionally implements an automated iterative
approach to obtaining a cluster of minimum - maximum size with the gene of interest.
For hierarchical agglomerative clustering: fa2clus will either take a fasta and generate a distance matrix using
usearch calc_distmx by default or the % identity of diamond alignments.
Then, cluster sequences via hierarchical agglomerative clustering and output a
.clus file of cluster assignments and .newick dendrogram.
Please note that sequence similarity clustering is using a heuristic, sequence similarity, as a proxy of relatedness. Therefore, sequence similarity clustering may remove closely related sequences in cases where the inputted number of genes is much greater than the maximum cluster constraint and/or if there are not highly similar sequences around the gene of interest
To run mmseqs cluster on a faster with minimum 20% query coverage and minimum 30% AA identity
fa2clus -f <FASTA>.fa -m 0.2 -x 0.3Iteratively cluster until a cluster size of 50-200 genes is achieved:
fa2clus -f <FASTA> -m 0.2 -x 0.3 --iterative <FOCAL_GENE> --minseq 50 --maxseq 200
CRAP, adopted and expanded from Slot & Rokas implementation, reconstructs and visualizes gene cluster phylogenies to study gene cluster evolution on a gene-by-gene basis. CRAP will:
-
input a cluster query and use a search algorithm (BLAST/mmseqs/Diamond) or orthogroup-based approach to find homologs in the MycotoolsDB
-
implement sequence similarity clustering to truncate the sequence set and detect outgroups. NOTE: sequence similarity clustering can sometimes remove close homologs if there are too many sequences analyzed relative to the set maximum sequences.
-
construct phylogenies of each query sequence
-
map locus synteny diagrams onto the tips of the phylogenies.
crap can operate on a query of MycotoolsDB accessions or a standalone
multifasta input of external accessions. Following homolog acquisition,
crap will submit each set of hits for tree building or sequence similarity
clustering if the number of sequences exceeds the inputted maximum
(-m).
By default, crap will construct trees using fasttree.
Alternatively, CRAP can construct a robust IQ-TREE with 1000 boostrap iteration support values
by specifying -i/--iqtree.
To search an extracted sub-MycotoolsDB using blastp and create phylogenies with fasttree:
mtdb extract --lineage Basidiomycota > basi.mtdb
crap -q <QUERYGENES> -d basi.mtdb -s blastp --bitscore 40 --cpu 12Mycotools is designed to enable pipelining in Linux shells as well as python.
For scripts such as acc2fa/acc2gbk/acc2gff/acc2locus, the input is an accession or set of
accessions, and can be piped in via standard input. All scripts that accept
standard input (stdin) will require "-" as the input argument.
Stdin access allows you can to chain commands and swiftly generate the necessary
input for downstream analysis.
Say you want to see if an accession of interest is part of a gene cluster. Run a CRAP around accessions within 20 kb +/- your gene of interest:
acc2locus -a athter2_3 -n -p 20000 | crap -q - <ARGS>
Create a file of query genes from the CRAP output, extract locus accessions for each gene, and output a genbank for each locus:
for acc in $(cat <INPUT_FILE>)
do
acc2locus -a $acc -p 20000 -n | acc2gbk -a - > $acc.locus.gbk
doneDespite the benefits of increased sampling, there are two prominent problems large samples create in phylogenetic analysis: 1) large alignments lose resolution because they must consider many sequences and 2) analysis time increases with sample size. For a small set of genes, such as ITS, one can usually assume that the gene family is conserved, so the dataset can be cut down to closely related organisms. For most other genes, it is not valid to assume the gene family is conserved because horizontal transfer can lead to unexpected distributions.
It is thus sometimes necessary to balance alignment resolution and computational tractability by systematically truncating the dataset into subset of gene homologs. This is accomplished in Mycotools by iteratively constructing phylogenies, identifying and extracting clades of homologs, and repeating until a manageable tree is obtained. On its own, this analysis requires elaborate integration of multiple independent softwares, but Mycotools takes care of most of this.
For example, acquire a set of homologs for a gene family of interest by BLASTing a representative query protein sequence.
- extract a database of published sequences, or use other arguments to extract other organisms of interest
mtdb extract > pub.mtdb- obtain gene homologs using
db2searchwith an e-value threshold of 10-2:
db2search -d pub.mtdb -e 2 -q <QUERY>- If there are more than 1000 homologs, truncate the resulting homologs around a gene of interest between 50 and 500 genes:
fa2clus -f <FASTA>.fa --min_seq 50 --max_seq 500 -i <FOCAL_GENE>- If there are less than 1000 homologs, construct a fastree,
-f, if there are many samples, or remove that parameter for a robust IQ-TREE.
fa2tree -i <FASTA>.fa - If the phylogeny can be improved, extract a highly supported node from 4 and reconstruct, or systematically truncate the dataset with 3 and reconstruct.
Identify single-copy, or near single-copy, gene families from a MycotoolsDB
(omit -d if using the full database):
db2hgs -d <MTDB_file>
Build a multi-gene partition phylogenomic tree from the inferred single-copy orthologs:
fa2tree -p -i db2hgs_<YYYmmdd>/single_copy_genes/ -c <CPUS>
Sometimes single-copy gene families do not exist because of copy number
variation, or absence in some genomes. This is particularly relevant for
datasets that include a broad taxonomic sample. Identify near single-copy gene families by specifying criteria for the maximum
median genes per homology group, standard deviation, and minimum percent of genomes the
homology group must be present in. A full list of parameters for this option is
available from the db2hgs -h help menu.
db2hgs -d <MTDB_file> <NEAR_SINGLE-COPY_PARAMETERS> --hmm
Use the HMMs of the near single-copy genes to identify the best hits for these genes in your inputted genomes
db2search -a hmmsearch -q db2hgs_<YYYYmmdd>/hmm/*hmm -d <MTDB_file> -m 1
Build a phylogenomic tree from the best hits output
fa2tree -i db2search_<YYYYmmdd>/fastas -p -c <CPUS>
Download the HMM that includes 14 near single-copy protein models found in Fungi and search for the best hit of these proteins in your dataset.
wget https://raw.githubusercontent.com/xonq/mycotools/master/test/scos.hmm
db2search -d <MTDB_file> -a hmmsearch -q scos.hmm -m 1
fa2tree -i db2search_<YYYYmmdd>/fastas -p -c <CPUS>
- add ncbiAcc2fa
- separate better/organize into wiki
- document clinker pipeline
- add images of pipeline output
- fa2hmmer2fa annotation
- update crap information
- eggnog to synteny diagram
- overview of mycotools scripts functions, e.g. resume on -o, output standards
- update phylogenetic pipeline with fa2clus renovations
- add partition analysis tutorial
- discuss output in crap
- add bioreform
- convert to wiki
- add caveats to outputs of fa2clus, crap, etc
