A computational drug discovery pipeline that automates the retrieval, processing, and analysis of chemical potential drug candidates. This project compares Natural Substrates vs. Synthetic Inhibitors to analyze binding affinity trends.
- Automated Data Retrieval: Fetches real-time chemical data (SMILES, Molecular Weight) from the PubChem PUG REST API.
- Cheminformatics: Calculates key molecular descriptors like LogP (Lipophilicity) using RDKit.
- Database Management: Stores structured data in a local SQLite database (
results.db). - Virtual Screening Simulation: Simulates docking scores to model binding affinity.
- Visualization: Generates a professional dashboard comparing Natural vs. Synthetic compounds.
- Python 3
- RDKit: For molecular descriptor calculation.
- Pandas & Matplotlib: For data analysis and visualization.
- SQLite: For lightweight relational database storage.
- PubChem API: For chemical data sourcing.
-
Clone the repository
git clone https://github.com/yourusername/virtual-screening-pipeline.git cd virtual-screening-pipeline -
Install dependencies
pip install -r requirements.txt
Execute the main script to fetch data, calculate properties, and populate the database.
python3 main.pyOutput: Creates results.db and exports natvssynt.csv.
Create the visualization suite to analyze the results.
python3 visualization.pyOutput: Generates dashboard.png.
The pipeline analyzes two distinct groups:
- Natural Substrates (e.g., Folic Acid, Dihydrofolate)
- Synthetic Inhibitors (e.g., Methotrexate, Pemetrexed)
Hypothesis: Synthetic inhibitors are designed to bind more tightly (lower score) than natural substrates. Check the "Avg Score by Type" graph to verify!
Want to query the database directly?
-- Find the top 3 strongest binders
SELECT name, score FROM screening_results ORDER BY score ASC LIMIT 3;(See sql_queries.md for more examples)

