Orthologs Databases

Archive of bioinformatics databases and processing pipelines used in cross-species ortholog analysis. This project integrates data from multiple orthology, protein interaction, and functional annotation databases to study conserved proteins across model organisms.

Author: Zoltan Dul, PhD Period: 2012-2016 Language: PHP 5.x License: GNU GPL v2

Organisms

Code	Species	Taxon ID
AT	Arabidopsis thaliana	3702
CE	Caenorhabditis elegans	6239
DM	Drosophila melanogaster	7227
DR	Danio rerio	7955
HS	Homo sapiens	9606
SC	Saccharomyces cerevisiae	559292
SP	Schizosaccharomyces pombe	4896

Databases

Ortholog Databases

Directory	Database	Description	Species	Size
`homologene/`	HomoloGene	NCBI orthologous gene groups with Entrez-to-UniProt mapping	8	43 MB
`inparanoid/`	InParanoid	Pairwise orthologs (v8) with confidence scores	7	22 MB
`orthomcl/`	OrthoMCL	MCL-based ortholog/paralog clustering (v5)	8	436 MB
`eggnog/`	eggNOG	Evolutionary orthologous groups (euNOG, NOGv4/v4.5)	9	750 MB
`cog/`	COG/KOG	Clusters of Orthologous Groups for eukaryotes	5	73 MB
`pombase/`	PomBase	S. pombe ortholog mappings + cell-size mutant data	5	5.5 MB
`isobase/`	IsoBase	Multi-species isolog clusters with GO annotations	5	1.1 MB

Protein-Protein Interaction Databases

Directory	Database	Description	Species	Size
`biogrid/`	BioGRID	Physical/genetic interactions (v3.4.125), Cytoscape networks	7	413 MB
`intact/`	IntAct	EBI protein interactions, PSI-MI format (2014-2015 snapshots)	7	331 MB
`dip/`	DIP	Experimentally validated interactions (Apr 2014)	multi	25 MB
`string/`	STRING	Interaction networks (v9.1) with detailed evidence scores	5	1.1 GB

Pathway & Functional Annotation Databases

Directory	Database	Description	Species	Size
`gene_ontology/`	Gene Ontology	GO-to-UniProt mapping pipelines (2014 simple + 2016 GOSlim)	5-7	720 MB
`kegg/`	KEGG	Metabolic/regulatory pathway-to-UniProt mappings	5	20 MB
`reactome/`	Reactome	Biological pathway annotations with Reactome links	5	130 MB
`signalink/`	SignaLink	Signal transduction pathway data (SQL dump, Oct 2012)	multi	54 MB
`complexes/`	GO Complexes	Protein complex membership from GO annotations	5	28 MB

ID Mapping & Reference Databases

Directory	Database	Description	Species	Size
`biomart/`	Ensembl BioMart	Gene-to-UniProt mappings + GO annotations (June 2014)	5	33 MB
`flybase/`	FlyBase	FBgn/FBtr/FBpp-to-UniProt 5-step mapping pipeline	1 (DM)	192 MB
`ipi-database/`	IPI	International Protein Index human cross-references (discontinued 2011)	1 (HS)	16 MB
`mitocheck/`	MitoCheck	Human mitochondrial protein annotations	1 (HS)	500 KB
`jorgensen-list/`	Jorgensen/Moretto	S. cerevisiae cell-size gene lists with functional annotations	1 (SC)	1.3 MB

Common Identifier

All databases use UniProt KB accession numbers as the common protein identifier. Species-specific IDs (Entrez, TAIR, FlyBase FBgn, SGD, PomBase, WormBase, Ensembl) are mapped to UniProt through dedicated conversion pipelines in each database directory.

Directory Structure

Each database directory typically contains:

database_name/
  ├── *.php              # Processing script(s)
  ├── INFO.txt           # Data source documentation
  ├── source/            # Input/intermediate data files
  ├── original/          # Raw downloaded data
  ├── mappings/          # ID conversion files
  └── output/            # Processed results

Shared Configuration

framework.php — Common PHP settings shared across scripts (error reporting, unlimited memory/execution time).

Processing Pipelines

Most databases follow a similar processing pattern:

Download raw data from the source database
Parse and filter for target species
Map database-specific IDs to UniProt accessions
Generate pairwise ortholog relationships or annotation mappings
Output standardized CSV/TSV files

Output format is typically semicolon-delimited CSV:

Species1;UniProtID1;Species2;UniProtID2;DatabaseEvidence

Git LFS

Large data files (>50 MB) are stored using Git Large File Storage. These include compressed database archives, large merged TSV/CSV outputs, and raw interaction network files. Run git lfs pull after cloning to download these files.

Data Vintage

All data was collected between 2012-2016. Database versions and download dates are documented in each directory's INFO.txt file. Some external APIs referenced in scripts (e.g., Gene Ontology SOLR, AmiGO MySQL) are no longer available.

Related Repositories

orthologs-python — Python tools for ortholog analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Orthologs Databases

Organisms

Databases

Ortholog Databases

Protein-Protein Interaction Databases

Pathway & Functional Annotation Databases

ID Mapping & Reference Databases

Common Identifier

Directory Structure

Shared Configuration

Processing Pipelines

Git LFS

Data Vintage

Related Repositories

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
biogrid		biogrid
biomart		biomart
cog		cog
complexes		complexes
dip		dip
eggnog		eggnog
flybase		flybase
gene_ontology		gene_ontology
homologene		homologene
inparanoid		inparanoid
intact		intact
ipi-database		ipi-database
isobase		isobase
jorgensen-list		jorgensen-list
kegg		kegg
mitocheck		mitocheck
orthomcl		orthomcl
pombase		pombase
reactome		reactome
signalink		signalink
string		string
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
framework.php		framework.php

Folders and files

Latest commit

History

Repository files navigation

Orthologs Databases

Organisms

Databases

Ortholog Databases

Protein-Protein Interaction Databases

Pathway & Functional Annotation Databases

ID Mapping & Reference Databases

Common Identifier

Directory Structure

Shared Configuration

Processing Pipelines

Git LFS

Data Vintage

Related Repositories

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages