Predict protein solubility in E. coli using deep learning
Installation β’ Usage β’ Performance β’ How It Works β’ Citation
ORACLE is a deep learning tool that predicts whether your protein will express as soluble or form inclusion bodies in E. coli. It helps researchers prioritize which proteins to express experimentally, saving time and resources.
- Save weeks of lab work - Know beforehand which proteins are likely to fail
- State-of-the-art - Uses ESM-2 protein language model embeddings
- Beautiful visualizations - Understand why your protein may or may not express
- Actionable recommendations - Get suggestions to improve expression
pip install git+https://github.com/61-Keys/oracle-solubility.git- Python 3.8+
- PyTorch 2.0+
- ~500MB disk space (for ESM-2 model)
from oracle import Oracle
# Initialize predictor (downloads ESM-2 model on first run)
predictor = Oracle()
# Predict solubility
sequence = "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLK..."
result = predictor.predict(sequence)
# Access results
print(result.soluble) # True or False
print(result.confidence) # 0.0 to 1.0
print(result.solubility_score) # Probability of being soluble
print(result.recommendations) # List of suggestions
# Get detailed summary
print(result.summary())
# Visualize results
result.visualize() # Full dashboard
result.visualize("minimal") # Compact card view# Basic prediction
oracle predict MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLK...
# With visualization
oracle predict MSKGEELFTGVVPILVELDGDVNGHK... --visualize
# Minimal visualization
oracle predict MSKGEELFTGVVPILVELDGDVNGHK... --visualize --minimal
# Model information
oracle info| Metric | Value |
|---|---|
| Test Set Accuracy | 66.8% |
| F1 Score | 0.67 |
| AUC-ROC | 0.73 |
| Protein | Expected | Predicted | Correct |
|---|---|---|---|
| GFP | Soluble | β Soluble (60%) | β |
| Ubiquitin | Soluble | β Soluble (75%) | β |
| Thioredoxin | Soluble | β Soluble (86%) | β |
| SUMO | Soluble | β Soluble (88%) | β |
| MBP | Soluble | β Soluble (71%) | β |
| GPCR (membrane) | Insoluble | β Insoluble (74%) | β |
| Amyloid Beta | Insoluble | β Insoluble (82%) | β |
| Alpha-Synuclein | Insoluble | β Insoluble (78%) | β |
| Spider Silk | Insoluble | β Insoluble (91%) | β |
| Prion Protein | Insoluble | β Insoluble (85%) | β |
Overall: 12/14 correct
ORACLE combines two types of features:
- Uses Facebook's ESM-2 protein language model
- Captures evolutionary and structural information
- Pre-trained on millions of protein sequences
- Sequence length
- Hydrophobicity
- Charge distribution
- Disorder propensity
- Aggregation propensity
- Amino acid composition
Input Sequence
β
βββββββββββββββββββ ββββββββββββββββββββ
β ESM-2 Model β β Sequence Features β
β (320-dim emb) β β (14 features) β
ββββββββββ¬βββββββββ ββββββββββ¬ββββββββββ
β β
ββββββββββββ¬ββββββββββββ
β
βββββββββββββββββ
β Neural Net β
β 512β256β128β2 β
βββββββββ¬ββββββββ
β
Soluble / Insoluble
ORACLE was trained on 60,000 proteins from TargetTrack (PSI:Biology):
- 35 research centers worldwide
- Real experimental outcomes (not computational predictions)
- Balanced dataset (30,000 soluble, 30,000 insoluble)
- 17 years of structural genomics data
| Status | Label | Meaning |
|---|---|---|
| In PDB | Soluble | Successfully crystallized |
| Purified | Soluble | Expressed and purified |
| Soluble | Soluble | Confirmed soluble expression |
| Work Stopped | Insoluble | Failed to express/purify |
| Cloned | Insoluble | Never made it past cloning |
Initialize the predictor.
| Parameter | Type | Default | Description |
|---|---|---|---|
device |
str | "auto" | Device to use: "auto", "cpu", "mps", "cuda" |
verbose |
bool | True | Print loading messages |
Predict solubility for a protein sequence.
| Parameter | Type | Description |
|---|---|---|
sequence |
str | Protein sequence (single letter amino acids) |
| Attribute | Type | Description |
|---|---|---|
.soluble |
bool | True if predicted soluble |
.confidence |
float | Prediction confidence (0-1) |
.solubility_score |
float | Probability of being soluble |
.features |
dict | Computed sequence features |
.recommendations |
list | Suggestions for improving expression |
.visualize(style) |
method | Show visualization ("full" or "minimal") |
.summary() |
method | Get text summary |
.to_dict() |
method | Convert to dictionary |
ORACLE provides actionable recommendations based on sequence analysis:
| Issue | Recommendation |
|---|---|
| High hydrophobicity | Try MBP or SUMO fusion tags |
| Aggregation-prone | Express at 16-18Β°C |
| Contains cysteines | Use Origami/SHuffle strains |
| Large protein (>500 aa) | Consider domain truncations |
| Disordered regions | May need binding partner |
If you use ORACLE in your research, please cite:
@software{oracle2024,
title={ORACLE: Protein Solubility Predictor},
author={Rath, Asutosh},
year={2024},
url={https://github.com/61-Keys/oracle-solubility}
}@article{targettrack,
title={TargetTrack: A resource for tracking targets in structural genomics},
journal={Zenodo},
doi={10.5281/zenodo.821654}
}MIT License - see LICENSE for details.
Made with 𧬠by Asutosh Rath