-
Notifications
You must be signed in to change notification settings - Fork 4
MycotoolsDB

MycotoolsDBs (MTDBs) are locally assimilated, uniformly curated databases of local genomes,
JGI MycoCosm, and NCBI fungal or prokaryotic genomic data. MTDBs are represented in tab delimitted database .mtdb reference files,
which serve as the scalable input to Mycotools scripts.
Herein are the objectives, standards, and expectations of MTDB and associated files.
Enable broadscale comparative genomics via a systematically curated, automatically assembled/updated, scalable genomes database. MTDB primarily seeks to resolve several outstanding problems in comparative genomics:
- Uniformly curate genome data within and across multiple databases, i.e. the
inconsistency of the gene coordinates,
gfffile - Promote ease-of-use for scalable and large scale analyses, e.g. transitioning between datasets in a phylogenetic analysis
- Keep-up with the accelerating deposition of public genome data via automatic updates
- Implement a modular comparative genomic analyses toolkit to enable automated pipelining and make routine comparative genomic analyses accessible
Because MTDBs are essentially tab-delimitted spreadsheets, the database can be
scaled by extracting rows of interest using
bash/mtdb extract
and feeding the scaled MTDB into Mycotools
scripts. Master MTDB files are
labelled YYYYmmdd.mtdb based on the date the update began; the most recent
mtdb in the primary MTDB folder will be used as the primary database.
Example row:
#ome genus species strain taxonomy version source biosample assembly_acc acquisition_date published fna faa gff3
aaoarx1 Aaosphaeria arxii CBS17579 {"clade": "dothideomyceta", "kingdom": "Fungi", "phylum": "Ascomycota", "subphylum": "Pezizomycotina", "class": "Dothideomycetes", "subclass": "Pleosporomycetidae", "order": "Pleosporales", "family": "", "subfamily": "", "genus": "Aaosphaeria", "species": "Aaosphaeria arxii"} v1.0 jgi Aaoar1 20210414 Haridas S et al.,2020
Tab-delimited file, with one row per genome and ordered columns:
-
ome: MTDB accession "ome" - first three letters of genus, first three letters of species (or "sp."), unique database number, and optional MTDB version tag '.\d+', e.g.psicub1/cryneo24.1[a-zA-Z0-9\.] -
genus: Genus name;[a-zA-Z] -
species: Species name;[a-zA-Z] -
strain: Strain name;[a-zA-Z0-9\-\.] -
taxonomy: NCBI taxonomyJSONobject derived from genus -
version: MycoCosm version/NCBI modification date -
source: Genome source, e.g. 'ncbi'/'jgi'/'lab';[a-z0-9\.] -
biosample: optional NCBI BioSample accession -
assembly_acc: NCBI GenBank/RefSeq assembly accession or MycoCosm portal -
published: Publication metadata or binary publication response; 0/None/'' are use-restricted - all others are presumed to be open access by Mycotools scripts (see below) -
acquisition_date: Date of input into primary database;YYYYmmdd -
fna: assembly.fna, required when not$MYCOFNA/fna/<ome>.fna; PATH -
faa: proteome.faa, required when not$MYCOFAA/faa/<ome>.faa; PATH -
gff3: gene coordinate.gff3, required when not$MYCOGFF3/gff3/<ome>.gff3; PATH
If headers are included, the line must begin with '#'; generally, lines beginning with '#' are ignored.
MTDB requires an assembly and gene coordinates gff3 for ALL GENOMES.
Proteomes will be generated by referencing the assembly and gff3.
All MTDB aliases will be formatted as <ome>_<acc> where ome is ome in the
MTDB and acc is the retrieved accession for both assemblies and proteomes.
MTDB alias accessions are directly connected to the MTDB by slicing to the
underscore.
For JGI, MTDB accessions will pull from the protein_id field in the gene coordinates file.
For NCBI, MTDB accessions will pull from the product_id field in the gene coordinates file.
For entries without a detected protein ID, an alias will be assigned with the
prefix, 'mtdb'. Pseudogenes, tRNAs, and rRNAs aliases will format as
<ome>_<type><type_count>
MTDB attempts to curate, assimilate, and modernize MycoCosm and NCBI
legacy data. All entries will contain an MTDB alias in the attributes field.
All attribute fields will contain ID=[^;]+, Alias=[^;]+;
Non-gene entries will have a parent field Parent=[^;]+ that relates the entry
to its parent RNA and each RNA to their parent gene. For non-gene terminal entries
(when the highest entry in a hierarchy is not a gene/pseudogene), these entries will
be assigned an Alias that corresponds to their type field.
On the occassion GFF entries are not given an Alias, assume that these are
ignored by Mycotools; while curation is fairly robust for JGI and NCBI GFFs,
other GFFs may have cryptic formatting discrepancies. CDSs without an alias will
not be translated into the proteome faa fasta.
- gene, pseudogene: contains the terminal ID of descendant entries and alias (Alias=.*;) that contains
all MTDB aliases derived from the gene, separated by
|.ID=gene_<ACC> - mRNA, tRNA, rRNA, RNA:
RNAis synonymous withtranscriptand represents ambiguous RNA, which may be interpreted as mRNA in downstream software or ignored.ID=<RNA>_<ACC> - exon: parent will be an RNA ID; introns will be curated to exons.
ID=exon_<ACC>_<EXON#> - CDS: CDS ID parent will be an RNA ID; typically contains a
protein_id/product_idfield.ID=CDS_<ACC>_<CDS#> - three_prime_UTR, five_prime_UTR
MTDB recognizes several attribute fields, separated by a semi-colon and optionally contained within single/double quotes. MTDB permits non-recognized fields.
-
Alias=<ome>_<acc>: MTDB accession; REQUIRED -
ID=[^;]+: entryID; REQUIRED -
Parent=[^;]+: theIDthis row is descended from, i.e. gene->mRNA->CDS/exon -
[protein_id[^;]+|proteinId=[^;]+]: protein ID field -
product=[^;]+: functional annotation -
[transcriptId=[^;]+|transcript_id=[^;]+]: transcript ID field
Alternately spliced genes are accounted for in curation. Genes with alternately spliced descendants will have multiple aliases, separated by '|'. mRNAs and their children will all have unique aliases. Groups of CDS coordinates tied to the same mRNA and that are exactly the same as another will be removed because it is commonly an annotation error and/or the resulting protein sequence will not be different.
Proteomes will be generated on the fly when updating the database by
referencing an MTDB-curated gff3 and assembly. Proteins are generated from
CDS coordinates that can be tied to an mRNA with a gene parent.
mtdb update will prioritize MycoCosm (JGI) genomes over NCBI by referencing
the submitter field in NCBI assembly metadata. NCBI genomes that lack a strain
will be excluded if there is a JGI genome of the same species. For JGI
downloading, each unique Portal (genome accession) is retrieved from the
MycoCosm primary table.
Use restriction metadata is applied from the associated field in the MycoCosm primary table.
By default, use-restricted JGI data is excluded and must be specifically
requested via --nonpublished in mtdb update.
IT IS USER RESPONSIBILITY TO VERIFY THE VALIDITY OF AUTOMATICALLY APPLIED USE-RESTRICTION LABELS.
Please review JGI policy on use-restricted data.
NCBI genomes will be retrieved from the primary eukaryotes.txt/prokaryotes.txt
and each unique assembly accession that was not submitted by JGI is retrieved.
Version checking operates on the Modify Date field.
All NCBI entries are assumed to be "published" for non-restricted use. "However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted." - NCBI. It is the user responsbility to determine which, if any, genome data have use-restriction policies.
There may be edge-case examples of use-restricted NCBI data, however NCBI cannot provide oversight for their particular restrictions, and thus MTDB cannot determine what is use-restricted. A git issue can be raised for isolated examples, which can then be incorporated into a manually curated exceptions file; for local handling, simply empty the publication field for the associated row. MTDBs are user-assimilated databases, and Mycotools makes no guarantee that it comprehensively addresses use-restriction. It is ultimately user responsibility to ensure any sensitive, published data is use available.
Locally annotated genomes can be added to the database by filling out and
submitting a .predb file using mtdb predb2db. mtdb predb2db will curate the
inputted data and output into the current directory. Once complete,
mtdb update --add <PREDB_RESULT> will add the .mtdb generated to the
primary database.
All taxonomy metadata is acquired by querying the NCBI taxonomy with the genus name. Therefore, taxonomy is subject to errors in NCBI taxonomy.
The primary MTDB should be generated from one user, and privileges should be
distributed using chmod; e.g. chmod -R 755. Note, the primary user/group are the only ones with
privileges to update and merge manually curated predb files into the primary database.
Users should refrain from making edits to database files as unexpected errors may result with downstream scripts.