BGC-QDR: Biosynthetic Gene Cluster Detection & Analysis Pipeline

A complete full-stack application for detecting, analyzing, and ranking Biosynthetic Gene Clusters (BGCs) from environmental DNA samples.

🎯 Overview

BGC-QDR (Biosynthetic Gene Cluster - Quality Detection & Ranking) is an integrated pipeline that combines:

Frontend: Modern web interface with Palantir-style design
Backend: Flask REST API for pipeline orchestration
Pipeline: Python-based BGC detection and analysis tools

🏗️ Architecture

┌─────────────────────────────────────────────────────────┐
│                  Web Interface                          │
│  (Modern UI with DNA video background)                 │
│  - Sample upload                                        │
│  - Real-time progress tracking                          │
│  - Interactive results visualization                    │
└────────────────────┬────────────────────────────────────┘
                     │ REST API
┌────────────────────▼────────────────────────────────────┐
│              Flask Backend API                          │
│  - File upload handling                                 │
│  - Job management                                       │
│  - Pipeline orchestration                               │
└────────────────────┬────────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────────┐
│           BGC-QDR Pipeline (6 Phases)                   │
│                                                          │
│  Phase 1-2: BGC Detection                               │
│    → ORF prediction (Prodigal)                          │
│    → Domain annotation                                  │
│    → BGC classification                                 │
│                                                          │
│  Phase 3: Graph Reconstruction                          │
│    → Build BGC similarity graphs                        │
│    → Identify virtual BGCs                              │
│                                                          │
│  Phase 4-5: Novelty Assessment                          │
│    → Compare against MIBiG database                     │
│    → Calculate novelty scores                           │
│                                                          │
│  Phase 6: VQC Ranking                                   │
│    → Virtual Quality Control scoring                    │
│    → Rank candidates by confidence                      │
└─────────────────────────────────────────────────────────┘

📊 Pipeline Workflow

Input

FASTA file containing genomic sequences (contigs/scaffolds)
Environmental DNA (eDNA) samples from metagenomic sequencing

Phase 1-2: BGC Detection

ORF Prediction (call_orfs.py)
- Uses Prodigal in metagenomic mode
- Predicts protein-coding genes
- Outputs: proteins.faa, nucleotides.fna, genes.gbk
BGC Classification (classify_bgcs.py)
- Applies biological rules to identify BGC types
- Classifies into: PKS, NRPS, RiPP, Terpene, etc.
- Filters candidates by confidence score

Phase 3: Graph Reconstruction

Builds similarity graphs between detected BGCs
Identifies "virtual BGCs" (consensus sequences)
Reduces redundancy in metagenomic data

Phase 4-5: Novelty Assessment

Compares BGCs against known clusters (MIBiG database)
Calculates novelty percentage
Identifies potentially novel natural products

Phase 6: VQC Ranking

Virtual Quality Control scoring
Ranks candidates by:
- Confidence score (0-1)
- Novelty percentage
- BGC class completeness
Outputs top candidates for further analysis

Output

JSON results with ranked BGC candidates
Detailed reports with confidence scores
Downloadable data for downstream analysis

🚀 Quick Start

Prerequisites

Python 3.8+
Prodigal (optional, for ORF prediction)
Modern web browser

Installation

Clone the repository:
```
git clone <repository-url>
cd web.dv
```
Install Python dependencies:
```
pip install -r backend_requirements.txt
```

Set up directories:

mkdir -p uploads results frontend/assets

Copy frontend assets:

cp "New folder/index.html" frontend/index.html
cp "New folder/DNA.mp4" frontend/assets/DNA.mp4

Running the Application

Option 1: Automated Startup (Windows)

.\start_fullstack.ps1

Option 2: Manual Startup

Terminal 1 - Backend:

python backend_api.py

Terminal 2 - Frontend:

cd frontend
python -m http.server 3000

Access the application:

Frontend: http://localhost:3000
Backend API: http://localhost:5000/api

📁 Project Structure

web.dv/
├── frontend/                    # Web interface
│   ├── index.html              # Main HTML page
│   ├── app.js                  # JavaScript with API integration
│   ├── styles.css              # Additional styles
│   └── assets/
│       └── DNA.mp4             # Background video
│
├── backend_api.py              # Flask REST API server
├── backend_requirements.txt    # Python dependencies
│
├── call_orfs.py               # ORF prediction wrapper
├── classify_bgcs.py           # BGC classification engine
├── benchmark_bgcqdr.py        # Performance benchmarking
├── compare_with_deepbgc.py    # Comparison with DeepBGC
│
├── uploads/                   # User-uploaded FASTA files
├── results/                   # Pipeline results (JSON)
├── edna_fasta/               # Sample eDNA datasets
├── benchmark_results/        # Benchmark data
│
├── README.md                 # This file
├── FULLSTACK_README.md       # Detailed setup guide
└── BACKEND_FIXES.md          # Backend implementation details

🔬 Pipeline Components

1. ORF Calling (`call_orfs.py`)

python call_orfs.py \
  --input regions.fasta \
  --output-dir orfs/ \
  --threads 4

Purpose: Predict protein-coding genes from DNA sequences

Tool: Prodigal (metagenomic mode)

Output:

regions_proteins.faa - Amino acid sequences
regions_nucleotides.fna - Nucleotide sequences
regions_genes.gbk - GenBank format

2. BGC Classification (`classify_bgcs.py`)

python classify_bgcs.py \
  --domain-table domains.csv \
  --output bgc_candidates.csv \
  --min-score 0.4 \
  --min-domains 2

Purpose: Classify and filter BGC candidates

Rules Engine:

PKS: Requires PKS + ACP domains
NRPS: Requires A + PCP domains
RiPP: Lanthipeptide, Thiopeptide markers
Terpene: Terpene synthase/cyclase
Hybrid: Multiple biosynthetic systems

Output:

Classified BGC candidates
Confidence scores (high/medium/low)
Domain composition

3. Benchmarking (`benchmark_bgcqdr.py`)

python benchmark_bgcqdr.py \
  --input-dir edna_fasta/ \
  --output-dir benchmark_results/

Purpose: Evaluate pipeline performance

Metrics:

Detection accuracy
False positive rate
Processing time
Comparison with DeepBGC

🌐 Web Interface

Features

Hero Section
- Animated DNA video background
- Text scramble effect
- Call-to-action buttons
Pipeline Visualization
- 4 interactive phases
- Hover effects with details
- SVG illustrations
Sample Upload
- Drag-and-drop interface
- FASTA file validation
- Sample data option
Results Display
- Interactive table
- Confidence score bars
- BGC class visualization
- Download functionality

API Endpoints

Endpoint	Method	Description
`/api/health`	GET	Health check
`/api/stats`	GET	Pipeline statistics
`/api/detect`	POST	BGC detection (Phase 1-2)
`/api/reconstruct`	POST	Graph reconstruction (Phase 3)
`/api/novelty`	POST	Novelty assessment (Phase 4-5)
`/api/rank`	POST	VQC ranking (Phase 6)
`/api/results/<job_id>`	GET	Download results

📊 Example Results

Input

Sample: GCA_000205625.1.fasta
Sequences: 3 contigs
Size: 45.6 KB

Output

{
  "job_id": "job_1715097201",
  "bgc_count": 3,
  "virtual_bgc_count": 1,
  "novel_count": 1,
  "vqc_accuracy": 0.823,
  "top_candidates": [
    {
      "bgc_id": "VBGC_0000",
      "score": 0.891,
      "bgc_class": "NRPS",
      "novelty": 24.56
    }
  ]
}

🧪 Testing

Test Backend API

python test_backend.py

Test Pipeline Components

# Test ORF calling
python call_orfs.py --input edna_fasta/GCA_000205625.1.fasta --output-dir test_orfs/

# Test BGC classification
python classify_bgcs.py --domain-table test_domains.csv --output test_bgcs.csv

Test Web Interface

Open http://localhost:3000
Click "Analyse a Sample"
Choose "Use Sample Data"
Verify pipeline executes all phases
Check results display correctly

📈 Performance

Benchmarks (on test dataset)

Metric	Value
Total BGCs detected	68
Virtual BGCs	14
Novel BGCs	11 (78.6%)
VQC Accuracy	80.4%
Processing time	~2 minutes

See benchmark_results/benchmark_report.txt for detailed metrics.

🔧 Configuration

Backend Configuration

Edit backend_api.py:

# Server settings
app.run(debug=True, host='0.0.0.0', port=5000)

# File paths
UPLOAD_FOLDER = Path('uploads')
RESULTS_FOLDER = Path('results')
SAMPLE_FASTA = Path('edna_fasta/GCA_000205625.1.fasta')

Frontend Configuration

Edit frontend/app.js:

// API base URL
const API_BASE_URL = 'http://localhost:5000/api';

🐛 Troubleshooting

Video Not Playing

Verify frontend/assets/DNA.mp4 exists
Check browser console for errors
Try different browser

Buttons Not Working

Verify app.js and styles.css are linked in HTML
Check browser console (F12)
Hard refresh: Ctrl+F5

Backend Errors

Check Python version: python --version (need 3.8+)
Install dependencies: pip install -r backend_requirements.txt
Verify port 5000 is available

Pipeline Errors

Install Prodigal: conda install -c bioconda prodigal
Check input file format (valid FASTA)
Verify file permissions

📚 Documentation

FULLSTACK_README.md - Complete setup guide
FULLSTACK_INTEGRATION.md - Technical architecture
BACKEND_FIXES.md - Backend implementation details
TEST_INTEGRATION.md - Testing procedures
CURRENT_STATUS.md - Project status and roadmap

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

📄 License

This project is part of the BGC-QDR pipeline research.

🙏 Acknowledgments

Frontend Design: Modern web design principles
Fonts: DM Sans, DM Serif Display, DM Mono (Google Fonts)
Pipeline: BGC-QDR research project
Tools: Prodigal, BioPython, Flask

📞 Support

For issues or questions:

Check the troubleshooting section
Review documentation files
Check browser/backend console for errors
Verify all dependencies are installed

Version: 2.1.0
Last Updated: 2026-05-12
Status: ✅ Production Ready (with 9 Priority Bug Fixes)

🆕 What's New in v2.1.0

9 Priority Bug Fixes Implemented ✅

All critical bug fixes have been successfully implemented and tested:

✅ Input QC Module - Robust quality control with BioPython
✅ Novelty Caching - Intelligent caching for performance
✅ Domain Completeness Scoring - Accurate BGC scoring
✅ Per-Contig Logging - Comprehensive detection logging
✅ VQC Score Distribution - Statistical analysis of scores
✅ Sequence QC in Output - Quality metrics in results
✅ API Cache Middleware - Fast repeated queries
✅ Frontend QC Warnings - Visual quality indicators
✅ Unified Pipeline Runner - One-command execution
✅ Synthetic Sequence Detection - Prevents inflated BGC counts
✅ antiSMASH Validation - Validated against gold standard

Test Results: 14/14 tests passing (100% success rate)

Validation: ✅ 100% agreement with antiSMASH (gold standard)

Documentation:

QUICK_START.md - User guide for new features
BUGFIX_SUMMARY.md - Technical implementation details
TEST_RESULTS.md - Comprehensive test results
COMPLETION_SUMMARY.md - Project completion summary
SYNTHETIC_DETECTION.md - Synthetic sequence handling
ANTISMASH_VALIDATION_RESULTS.md - Gold standard validation

Quick Start with New Features

# Run complete pipeline with QC and synthetic exclusion
python scripts/run_pipeline.py --input sample.fasta --output results/ --exclude-synthetic

# Run input QC only
python scripts/input_qc.py --input sample.fasta --output filtered.fasta --report qc_report.json --exclude-synthetic

# Validate against antiSMASH
python test_antismash_validation.py

# Run tests
python test_bugfixes.py        # Unit tests
python test_integration.py     # Integration tests

Version: 2.1.0
Last Updated: 2026-05-12
Status: ✅ Production Ready

Quick Links:

Frontend: http://localhost:3000
Backend API: http://localhost:5000/api
Health Check: http://localhost:5000/api/health

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
ablation_results		ablation_results
backend		backend
benchmarking		benchmarking
cache		cache
docker		docker
docs		docs
edna_fasta		edna_fasta
frontend		frontend
pipeline		pipeline
pipeline_real_bgcs		pipeline_real_bgcs
pipeline_run_output10		pipeline_run_output10
pipeline_run_output2		pipeline_run_output2
pipeline_run_output3		pipeline_run_output3
pipeline_run_output4		pipeline_run_output4
pipeline_run_output5		pipeline_run_output5
pipeline_run_output6		pipeline_run_output6
pipeline_run_output7		pipeline_run_output7
pipeline_run_output8		pipeline_run_output8
pipeline_run_output9		pipeline_run_output9
pipeline_vqc_run1		pipeline_vqc_run1
qc_reports		qc_reports
scripts		scripts
setup		setup
stage2		stage2
tests		tests
validation		validation
website		website
.gitattributes		.gitattributes
.gitignore		.gitignore
ANTISMASH_VALIDATION_GUIDE.md		ANTISMASH_VALIDATION_GUIDE.md
ANTISMASH_VALIDATION_RESULTS.md		ANTISMASH_VALIDATION_RESULTS.md
ANTISMASH_VALIDATION_SUMMARY.md		ANTISMASH_VALIDATION_SUMMARY.md
BUGFIX_SUMMARY.md		BUGFIX_SUMMARY.md
COMPLETION_SUMMARY.md		COMPLETION_SUMMARY.md
FINAL_CHECKLIST.md		FINAL_CHECKLIST.md
FINAL_IMPLEMENTATION_SUMMARY.md		FINAL_IMPLEMENTATION_SUMMARY.md
IMPLEMENTATION_COMPLETE.md		IMPLEMENTATION_COMPLETE.md
NEXT_STEPS_VALIDATION.md		NEXT_STEPS_VALIDATION.md
QUICK_START.md		QUICK_START.md
README.md		README.md
SYNTHETIC_CONCERN_RESOLVED.md		SYNTHETIC_CONCERN_RESOLVED.md
SYNTHETIC_DETECTION.md		SYNTHETIC_DETECTION.md
TEST_RESULTS.md		TEST_RESULTS.md
VALIDATION_STATUS.md		VALIDATION_STATUS.md
antismash_validation_results.json		antismash_validation_results.json
check_hmmer.py		check_hmmer.py
check_pfam.py		check_pfam.py
edna_sample2_bgc.fasta		edna_sample2_bgc.fasta
env_only.fasta		env_only.fasta
env_qc_report.json		env_qc_report.json
filtered.fasta		filtered.fasta
organize_files.ps1		organize_files.ps1
package-lock.json		package-lock.json
qc_report.json		qc_report.json
real_bgcs.fasta		real_bgcs.fasta
real_bgcs2.fasta		real_bgcs2.fasta
run_final.py		run_final.py
run_validation_test.py		run_validation_test.py
sample.fasta		sample.fasta
start_fullstack.ps1		start_fullstack.ps1
start_servers.ps1		start_servers.ps1
start_website.ps1		start_website.ps1
test_antismash_validation.py		test_antismash_validation.py
test_backend_integration.py		test_backend_integration.py
test_bugfixes.py		test_bugfixes.py
test_integration.py		test_integration.py
test_out.fasta		test_out.fasta
test_pipeline_qc.fasta		test_pipeline_qc.fasta
test_pipeline_qc.json		test_pipeline_qc.json
test_qc_output.fasta		test_qc_output.fasta
test_qc_report.json		test_qc_report.json
test_report.json		test_report.json
wait_and_run.py		wait_and_run.py

Folders and files

Latest commit

History

Repository files navigation

BGC-QDR: Biosynthetic Gene Cluster Detection & Analysis Pipeline

🎯 Overview

🏗️ Architecture

📊 Pipeline Workflow

Input

Phase 1-2: BGC Detection

Phase 3: Graph Reconstruction

Phase 4-5: Novelty Assessment

Phase 6: VQC Ranking

Output

🚀 Quick Start

Prerequisites

Installation

Running the Application

📁 Project Structure

🔬 Pipeline Components

1. ORF Calling (call_orfs.py)

2. BGC Classification (classify_bgcs.py)

3. Benchmarking (benchmark_bgcqdr.py)

🌐 Web Interface

Features

API Endpoints

📊 Example Results

Input

Output

🧪 Testing

Test Backend API

Test Pipeline Components

Test Web Interface

📈 Performance

Benchmarks (on test dataset)

🔧 Configuration

Backend Configuration

Frontend Configuration

🐛 Troubleshooting

Video Not Playing

Buttons Not Working

Backend Errors

Pipeline Errors

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

🆕 What's New in v2.1.0

9 Priority Bug Fixes Implemented ✅

Quick Start with New Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. ORF Calling (`call_orfs.py`)

2. BGC Classification (`classify_bgcs.py`)

3. Benchmarking (`benchmark_bgcqdr.py`)

Packages