Skip to content

chitrakulkarni2830/docking-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 Virtual Screening Data Pipeline

Python RDKit Status

A computational drug discovery pipeline that automates the retrieval, processing, and analysis of chemical potential drug candidates. This project compares Natural Substrates vs. Synthetic Inhibitors to analyze binding affinity trends.

Header

🚀 Key Features

  • Automated Data Retrieval: Fetches real-time chemical data (SMILES, Molecular Weight) from the PubChem PUG REST API.
  • Cheminformatics: Calculates key molecular descriptors like LogP (Lipophilicity) using RDKit.
  • Database Management: Stores structured data in a local SQLite database (results.db).
  • Virtual Screening Simulation: Simulates docking scores to model binding affinity.
  • Visualization: Generates a professional dashboard comparing Natural vs. Synthetic compounds.

Dashboard Preview

🛠️ Tech Stack

  • Python 3
  • RDKit: For molecular descriptor calculation.
  • Pandas & Matplotlib: For data analysis and visualization.
  • SQLite: For lightweight relational database storage.
  • PubChem API: For chemical data sourcing.

📦 Installation

  1. Clone the repository

    git clone https://github.com/yourusername/virtual-screening-pipeline.git
    cd virtual-screening-pipeline
  2. Install dependencies

    pip install -r requirements.txt

⚡ Usage

1. Run the Pipeline

Execute the main script to fetch data, calculate properties, and populate the database.

python3 main.py

Output: Creates results.db and exports natvssynt.csv.

2. Generate the Dashboard

Create the visualization suite to analyze the results.

python3 visualization.py

Output: Generates dashboard.png.

📊 Data Analysis

The pipeline analyzes two distinct groups:

  1. Natural Substrates (e.g., Folic Acid, Dihydrofolate)
  2. Synthetic Inhibitors (e.g., Methotrexate, Pemetrexed)

Hypothesis: Synthetic inhibitors are designed to bind more tightly (lower score) than natural substrates. Check the "Avg Score by Type" graph to verify!

📝 SQL Exploration

Want to query the database directly?

-- Find the top 3 strongest binders
SELECT name, score FROM screening_results ORDER BY score ASC LIMIT 3;

(See sql_queries.md for more examples)

About

An automated end-to-end bioinformatics ETL pipeline using Python, PubChem API, and SQL to simulate virtual screening of natural substrates vs. synthetic inhibitors.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages