Archive of bioinformatics databases and processing pipelines used in cross-species ortholog analysis. This project integrates data from multiple orthology, protein interaction, and functional annotation databases to study conserved proteins across model organisms.
Author: Zoltan Dul, PhD Period: 2012-2016 Language: PHP 5.x License: GNU GPL v2
| Code | Species | Taxon ID |
|---|---|---|
| AT | Arabidopsis thaliana | 3702 |
| CE | Caenorhabditis elegans | 6239 |
| DM | Drosophila melanogaster | 7227 |
| DR | Danio rerio | 7955 |
| HS | Homo sapiens | 9606 |
| SC | Saccharomyces cerevisiae | 559292 |
| SP | Schizosaccharomyces pombe | 4896 |
| Directory | Database | Description | Species | Size |
|---|---|---|---|---|
homologene/ |
HomoloGene | NCBI orthologous gene groups with Entrez-to-UniProt mapping | 8 | 43 MB |
inparanoid/ |
InParanoid | Pairwise orthologs (v8) with confidence scores | 7 | 22 MB |
orthomcl/ |
OrthoMCL | MCL-based ortholog/paralog clustering (v5) | 8 | 436 MB |
eggnog/ |
eggNOG | Evolutionary orthologous groups (euNOG, NOGv4/v4.5) | 9 | 750 MB |
cog/ |
COG/KOG | Clusters of Orthologous Groups for eukaryotes | 5 | 73 MB |
pombase/ |
PomBase | S. pombe ortholog mappings + cell-size mutant data | 5 | 5.5 MB |
isobase/ |
IsoBase | Multi-species isolog clusters with GO annotations | 5 | 1.1 MB |
| Directory | Database | Description | Species | Size |
|---|---|---|---|---|
biogrid/ |
BioGRID | Physical/genetic interactions (v3.4.125), Cytoscape networks | 7 | 413 MB |
intact/ |
IntAct | EBI protein interactions, PSI-MI format (2014-2015 snapshots) | 7 | 331 MB |
dip/ |
DIP | Experimentally validated interactions (Apr 2014) | multi | 25 MB |
string/ |
STRING | Interaction networks (v9.1) with detailed evidence scores | 5 | 1.1 GB |
| Directory | Database | Description | Species | Size |
|---|---|---|---|---|
gene_ontology/ |
Gene Ontology | GO-to-UniProt mapping pipelines (2014 simple + 2016 GOSlim) | 5-7 | 720 MB |
kegg/ |
KEGG | Metabolic/regulatory pathway-to-UniProt mappings | 5 | 20 MB |
reactome/ |
Reactome | Biological pathway annotations with Reactome links | 5 | 130 MB |
signalink/ |
SignaLink | Signal transduction pathway data (SQL dump, Oct 2012) | multi | 54 MB |
complexes/ |
GO Complexes | Protein complex membership from GO annotations | 5 | 28 MB |
| Directory | Database | Description | Species | Size |
|---|---|---|---|---|
biomart/ |
Ensembl BioMart | Gene-to-UniProt mappings + GO annotations (June 2014) | 5 | 33 MB |
flybase/ |
FlyBase | FBgn/FBtr/FBpp-to-UniProt 5-step mapping pipeline | 1 (DM) | 192 MB |
ipi-database/ |
IPI | International Protein Index human cross-references (discontinued 2011) | 1 (HS) | 16 MB |
mitocheck/ |
MitoCheck | Human mitochondrial protein annotations | 1 (HS) | 500 KB |
jorgensen-list/ |
Jorgensen/Moretto | S. cerevisiae cell-size gene lists with functional annotations | 1 (SC) | 1.3 MB |
All databases use UniProt KB accession numbers as the common protein identifier. Species-specific IDs (Entrez, TAIR, FlyBase FBgn, SGD, PomBase, WormBase, Ensembl) are mapped to UniProt through dedicated conversion pipelines in each database directory.
Each database directory typically contains:
database_name/
├── *.php # Processing script(s)
├── INFO.txt # Data source documentation
├── source/ # Input/intermediate data files
├── original/ # Raw downloaded data
├── mappings/ # ID conversion files
└── output/ # Processed results
framework.php — Common PHP settings shared across scripts (error reporting, unlimited memory/execution time).
Most databases follow a similar processing pattern:
- Download raw data from the source database
- Parse and filter for target species
- Map database-specific IDs to UniProt accessions
- Generate pairwise ortholog relationships or annotation mappings
- Output standardized CSV/TSV files
Output format is typically semicolon-delimited CSV:
Species1;UniProtID1;Species2;UniProtID2;DatabaseEvidence
Large data files (>50 MB) are stored using Git Large File Storage. These include compressed database archives, large merged TSV/CSV outputs, and raw interaction network files. Run git lfs pull after cloning to download these files.
All data was collected between 2012-2016. Database versions and download dates are documented in each directory's INFO.txt file. Some external APIs referenced in scripts (e.g., Gene Ontology SOLR, AmiGO MySQL) are no longer available.
- orthologs-python — Python tools for ortholog analysis