Skip to content

ZoliQua/Orthologs-Databases

Repository files navigation

Orthologs Databases

Archive of bioinformatics databases and processing pipelines used in cross-species ortholog analysis. This project integrates data from multiple orthology, protein interaction, and functional annotation databases to study conserved proteins across model organisms.

Author: Zoltan Dul, PhD Period: 2012-2016 Language: PHP 5.x License: GNU GPL v2

Organisms

Code Species Taxon ID
AT Arabidopsis thaliana 3702
CE Caenorhabditis elegans 6239
DM Drosophila melanogaster 7227
DR Danio rerio 7955
HS Homo sapiens 9606
SC Saccharomyces cerevisiae 559292
SP Schizosaccharomyces pombe 4896

Databases

Ortholog Databases

Directory Database Description Species Size
homologene/ HomoloGene NCBI orthologous gene groups with Entrez-to-UniProt mapping 8 43 MB
inparanoid/ InParanoid Pairwise orthologs (v8) with confidence scores 7 22 MB
orthomcl/ OrthoMCL MCL-based ortholog/paralog clustering (v5) 8 436 MB
eggnog/ eggNOG Evolutionary orthologous groups (euNOG, NOGv4/v4.5) 9 750 MB
cog/ COG/KOG Clusters of Orthologous Groups for eukaryotes 5 73 MB
pombase/ PomBase S. pombe ortholog mappings + cell-size mutant data 5 5.5 MB
isobase/ IsoBase Multi-species isolog clusters with GO annotations 5 1.1 MB

Protein-Protein Interaction Databases

Directory Database Description Species Size
biogrid/ BioGRID Physical/genetic interactions (v3.4.125), Cytoscape networks 7 413 MB
intact/ IntAct EBI protein interactions, PSI-MI format (2014-2015 snapshots) 7 331 MB
dip/ DIP Experimentally validated interactions (Apr 2014) multi 25 MB
string/ STRING Interaction networks (v9.1) with detailed evidence scores 5 1.1 GB

Pathway & Functional Annotation Databases

Directory Database Description Species Size
gene_ontology/ Gene Ontology GO-to-UniProt mapping pipelines (2014 simple + 2016 GOSlim) 5-7 720 MB
kegg/ KEGG Metabolic/regulatory pathway-to-UniProt mappings 5 20 MB
reactome/ Reactome Biological pathway annotations with Reactome links 5 130 MB
signalink/ SignaLink Signal transduction pathway data (SQL dump, Oct 2012) multi 54 MB
complexes/ GO Complexes Protein complex membership from GO annotations 5 28 MB

ID Mapping & Reference Databases

Directory Database Description Species Size
biomart/ Ensembl BioMart Gene-to-UniProt mappings + GO annotations (June 2014) 5 33 MB
flybase/ FlyBase FBgn/FBtr/FBpp-to-UniProt 5-step mapping pipeline 1 (DM) 192 MB
ipi-database/ IPI International Protein Index human cross-references (discontinued 2011) 1 (HS) 16 MB
mitocheck/ MitoCheck Human mitochondrial protein annotations 1 (HS) 500 KB
jorgensen-list/ Jorgensen/Moretto S. cerevisiae cell-size gene lists with functional annotations 1 (SC) 1.3 MB

Common Identifier

All databases use UniProt KB accession numbers as the common protein identifier. Species-specific IDs (Entrez, TAIR, FlyBase FBgn, SGD, PomBase, WormBase, Ensembl) are mapped to UniProt through dedicated conversion pipelines in each database directory.

Directory Structure

Each database directory typically contains:

database_name/
  ├── *.php              # Processing script(s)
  ├── INFO.txt           # Data source documentation
  ├── source/            # Input/intermediate data files
  ├── original/          # Raw downloaded data
  ├── mappings/          # ID conversion files
  └── output/            # Processed results

Shared Configuration

framework.php — Common PHP settings shared across scripts (error reporting, unlimited memory/execution time).

Processing Pipelines

Most databases follow a similar processing pattern:

  1. Download raw data from the source database
  2. Parse and filter for target species
  3. Map database-specific IDs to UniProt accessions
  4. Generate pairwise ortholog relationships or annotation mappings
  5. Output standardized CSV/TSV files

Output format is typically semicolon-delimited CSV:

Species1;UniProtID1;Species2;UniProtID2;DatabaseEvidence

Git LFS

Large data files (>50 MB) are stored using Git Large File Storage. These include compressed database archives, large merged TSV/CSV outputs, and raw interaction network files. Run git lfs pull after cloning to download these files.

Data Vintage

All data was collected between 2012-2016. Database versions and download dates are documented in each directory's INFO.txt file. Some external APIs referenced in scripts (e.g., Gene Ontology SOLR, AmiGO MySQL) are no longer available.

Related Repositories

About

Orthologs Databases — Archive of bioinformatics databases and PHP processing pipelines for cross-species ortholog analysis across 7 model organisms (2012-2016). Integrates data from 21 sources including HomoloGene, InParanoid, OrthoMCL, eggNOG, BioGRID, STRING, Gene Ontology, KEGG, and Reactome.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages