Skip to content

MycotoolsDB

xonq edited this page Jul 3, 2024 · 6 revisions

MycotoolsDB

OVERVIEW

MycotoolsDBs (MTDBs) are locally assimilated, uniformly curated databases of local genomes, JGI MycoCosm, and NCBI fungal or prokaryotic genomic data. MTDBs are represented in tab delimitted database .mtdb reference files, which serve as the scalable input to Mycotools scripts.

Herein are the objectives, standards, and expectations of MTDB and associated files.




OBJECTIVE

Enable broadscale comparative genomics via a systematically curated, automatically assembled/updated, scalable genomes database. MTDB primarily seeks to resolve several outstanding problems in comparative genomics:

  1. Uniformly curate genome data within and across multiple databases, i.e. the inconsistency of the gene coordinates, gff file
  2. Promote ease-of-use for scalable and large scale analyses, e.g. transitioning between datasets in a phylogenetic analysis
  3. Keep-up with the accelerating deposition of public genome data via automatic updates
  4. Implement a modular comparative genomic analyses toolkit to enable automated pipelining and make routine comparative genomic analyses accessible




FORMATTING

.mtdb file format standard

Because MTDBs are essentially tab-delimitted spreadsheets, the database can be scaled by extracting rows of interest using bash/mtdb extract and feeding the scaled MTDB into Mycotools scripts. Master MTDB files are labelled YYYYmmdd.mtdb based on the date the update began; the most recent mtdb in the primary MTDB folder will be used as the primary database.

Example row:

#ome    genus   species strain  taxonomy        version source  biosample   assembly_acc    acquisition_date        published       fna     faa     gff3
aaoarx1 Aaosphaeria     arxii   CBS17579        {"clade": "dothideomyceta", "kingdom": "Fungi", "phylum": "Ascomycota", "subphylum": "Pezizomycotina", "class": "Dothideomycetes", "subclass": "Pleosporomycetidae", "order": "Pleosporales", "family": "", "subfamily": "", "genus": "Aaosphaeria", "species": "Aaosphaeria arxii"}    v1.0    jgi             Aaoar1  20210414 Haridas S et al.,2020

Tab-delimited file, with one row per genome and ordered columns:

  • ome: MTDB accession "ome" - first three letters of genus, first three letters of species (or "sp."), unique database number, and optional MTDB version tag '.\d+', e.g. psicub1/cryneo24.1 [a-zA-Z0-9\.]
  • genus: Genus name; [a-zA-Z]
  • species: Species name; [a-zA-Z]
  • strain: Strain name; [a-zA-Z0-9\-\.]
  • taxonomy: NCBI taxonomy JSON object derived from genus
  • version: MycoCosm version/NCBI modification date
  • source: Genome source, e.g. 'ncbi'/'jgi'/'lab'; [a-z0-9\.]
  • biosample: optional NCBI BioSample accession
  • assembly_acc: NCBI GenBank/RefSeq assembly accession or MycoCosm portal
  • published: Publication metadata or binary publication response; 0/None/'' are use-restricted - all others are presumed to be open access by Mycotools scripts (see below)
  • acquisition_date: Date of input into primary database; YYYYmmdd
  • fna: assembly .fna, required when not $MYCOFNA/fna/<ome>.fna; PATH
  • faa: proteome .faa, required when not $MYCOFAA/faa/<ome>.faa; PATH
  • gff3: gene coordinate .gff3, required when not $MYCOGFF3/gff3/<ome>.gff3; PATH

If headers are included, the line must begin with '#'; generally, lines beginning with '#' are ignored.

MTDB requires an assembly and gene coordinates gff3 for ALL GENOMES. Proteomes will be generated by referencing the assembly and gff3.



Accession formatting

All MTDB aliases will be formatted as <ome>_<acc> where ome is ome in the MTDB and acc is the retrieved accession for both assemblies and proteomes. MTDB alias accessions are directly connected to the MTDB by slicing to the underscore. For JGI, MTDB accessions will pull from the protein_id field in the gene coordinates file. For NCBI, MTDB accessions will pull from the product_id field in the gene coordinates file. For entries without a detected protein ID, an alias will be assigned with the prefix, 'mtdb'. Pseudogenes, tRNAs, and rRNAs aliases will format as <ome>_<type><type_count>



GFF formatting

MTDB attempts to curate, assimilate, and modernize MycoCosm and NCBI legacy data. All entries will contain an MTDB alias in the attributes field. All attribute fields will contain ID=[^;]+, Alias=[^;]+; Non-gene entries will have a parent field Parent=[^;]+ that relates the entry to its parent RNA and each RNA to their parent gene. For non-gene terminal entries (when the highest entry in a hierarchy is not a gene/pseudogene), these entries will be assigned an Alias that corresponds to their type field.

On the occassion GFF entries are not given an Alias, assume that these are ignored by Mycotools; while curation is fairly robust for JGI and NCBI GFFs, other GFFs may have cryptic formatting discrepancies. CDSs without an alias will not be translated into the proteome faa fasta.

Supported features

  • gene, pseudogene: contains the terminal ID of descendant entries and alias (Alias=.*;) that contains all MTDB aliases derived from the gene, separated by |. ID=gene_<ACC>
  • mRNA, tRNA, rRNA, RNA: RNA is synonymous with transcript and represents ambiguous RNA, which may be interpreted as mRNA in downstream software or ignored. ID=<RNA>_<ACC>
  • exon: parent will be an RNA ID; introns will be curated to exons. ID=exon_<ACC>_<EXON#>
  • CDS: CDS ID parent will be an RNA ID; typically contains a protein_id/product_id field. ID=CDS_<ACC>_<CDS#>
  • three_prime_UTR, five_prime_UTR

Attributes column formatting

MTDB recognizes several attribute fields, separated by a semi-colon and optionally contained within single/double quotes. MTDB permits non-recognized fields.

  • Alias=<ome>_<acc>: MTDB accession; REQUIRED
  • ID=[^;]+: entry ID; REQUIRED
  • Parent=[^;]+: the ID this row is descended from, i.e. gene->mRNA->CDS/exon
  • [protein_id[^;]+|proteinId=[^;]+]: protein ID field
  • product=[^;]+: functional annotation
  • [transcriptId=[^;]+|transcript_id=[^;]+]: transcript ID field

Alternate splicing

Alternately spliced genes are accounted for in curation. Genes with alternately spliced descendants will have multiple aliases, separated by '|'. mRNAs and their children will all have unique aliases. Groups of CDS coordinates tied to the same mRNA and that are exactly the same as another will be removed because it is commonly an annotation error and/or the resulting protein sequence will not be different.



Proteome faa

Proteomes will be generated on the fly when updating the database by referencing an MTDB-curated gff3 and assembly. Proteins are generated from CDS coordinates that can be tied to an mRNA with a gene parent.




DATA ACQUISITION

JGI

mtdb update will prioritize MycoCosm (JGI) genomes over NCBI by referencing the submitter field in NCBI assembly metadata. NCBI genomes that lack a strain will be excluded if there is a JGI genome of the same species. For JGI downloading, each unique Portal (genome accession) is retrieved from the MycoCosm primary table.

Use restriction metadata is applied from the associated field in the MycoCosm primary table. By default, use-restricted JGI data is excluded and must be specifically requested via --nonpublished in mtdb update. IT IS USER RESPONSIBILITY TO VERIFY THE VALIDITY OF AUTOMATICALLY APPLIED USE-RESTRICTION LABELS. Please review JGI policy on use-restricted data.

NCBI

NCBI genomes will be retrieved from the primary eukaryotes.txt/prokaryotes.txt and each unique assembly accession that was not submitted by JGI is retrieved. Version checking operates on the Modify Date field.

All NCBI entries are assumed to be "published" for non-restricted use. "However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted." - NCBI. It is the user responsbility to determine which, if any, genome data have use-restriction policies.

There may be edge-case examples of use-restricted NCBI data, however NCBI cannot provide oversight for their particular restrictions, and thus MTDB cannot determine what is use-restricted. A git issue can be raised for isolated examples, which can then be incorporated into a manually curated exceptions file; for local handling, simply empty the publication field for the associated row. MTDBs are user-assimilated databases, and Mycotools makes no guarantee that it comprehensively addresses use-restriction. It is ultimately user responsibility to ensure any sensitive, published data is use available.

Local genomes

Locally annotated genomes can be added to the database by filling out and submitting a .predb file using mtdb predb2db. mtdb predb2db will curate the inputted data and output into the current directory. Once complete, mtdb update --add <PREDB_RESULT> will add the .mtdb generated to the primary database.

Taxonomy

All taxonomy metadata is acquired by querying the NCBI taxonomy with the genus name. Therefore, taxonomy is subject to errors in NCBI taxonomy.




ADMINISTRATION

The primary MTDB should be generated from one user, and privileges should be distributed using chmod; e.g. chmod -R 755. Note, the primary user/group are the only ones with privileges to update and merge manually curated predb files into the primary database.

Users should refrain from making edits to database files as unexpected errors may result with downstream scripts.










Clone this wiki locally