A complete full-stack application for detecting, analyzing, and ranking Biosynthetic Gene Clusters (BGCs) from environmental DNA samples.
BGC-QDR (Biosynthetic Gene Cluster - Quality Detection & Ranking) is an integrated pipeline that combines:
- Frontend: Modern web interface with Palantir-style design
- Backend: Flask REST API for pipeline orchestration
- Pipeline: Python-based BGC detection and analysis tools
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Web Interface β
β (Modern UI with DNA video background) β
β - Sample upload β
β - Real-time progress tracking β
β - Interactive results visualization β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β REST API
ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β Flask Backend API β
β - File upload handling β
β - Job management β
β - Pipeline orchestration β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β BGC-QDR Pipeline (6 Phases) β
β β
β Phase 1-2: BGC Detection β
β β ORF prediction (Prodigal) β
β β Domain annotation β
β β BGC classification β
β β
β Phase 3: Graph Reconstruction β
β β Build BGC similarity graphs β
β β Identify virtual BGCs β
β β
β Phase 4-5: Novelty Assessment β
β β Compare against MIBiG database β
β β Calculate novelty scores β
β β
β Phase 6: VQC Ranking β
β β Virtual Quality Control scoring β
β β Rank candidates by confidence β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- FASTA file containing genomic sequences (contigs/scaffolds)
- Environmental DNA (eDNA) samples from metagenomic sequencing
-
ORF Prediction (
call_orfs.py)- Uses Prodigal in metagenomic mode
- Predicts protein-coding genes
- Outputs: proteins.faa, nucleotides.fna, genes.gbk
-
BGC Classification (
classify_bgcs.py)- Applies biological rules to identify BGC types
- Classifies into: PKS, NRPS, RiPP, Terpene, etc.
- Filters candidates by confidence score
- Builds similarity graphs between detected BGCs
- Identifies "virtual BGCs" (consensus sequences)
- Reduces redundancy in metagenomic data
- Compares BGCs against known clusters (MIBiG database)
- Calculates novelty percentage
- Identifies potentially novel natural products
- Virtual Quality Control scoring
- Ranks candidates by:
- Confidence score (0-1)
- Novelty percentage
- BGC class completeness
- Outputs top candidates for further analysis
- JSON results with ranked BGC candidates
- Detailed reports with confidence scores
- Downloadable data for downstream analysis
- Python 3.8+
- Prodigal (optional, for ORF prediction)
- Modern web browser
-
Clone the repository:
git clone <repository-url> cd web.dv
-
Install Python dependencies:
pip install -r backend_requirements.txt
-
Set up directories:
mkdir -p uploads results frontend/assets
-
Copy frontend assets:
cp "New folder/index.html" frontend/index.html cp "New folder/DNA.mp4" frontend/assets/DNA.mp4
Option 1: Automated Startup (Windows)
.\start_fullstack.ps1Option 2: Manual Startup
Terminal 1 - Backend:
python backend_api.pyTerminal 2 - Frontend:
cd frontend
python -m http.server 3000Access the application:
- Frontend: http://localhost:3000
- Backend API: http://localhost:5000/api
web.dv/
βββ frontend/ # Web interface
β βββ index.html # Main HTML page
β βββ app.js # JavaScript with API integration
β βββ styles.css # Additional styles
β βββ assets/
β βββ DNA.mp4 # Background video
β
βββ backend_api.py # Flask REST API server
βββ backend_requirements.txt # Python dependencies
β
βββ call_orfs.py # ORF prediction wrapper
βββ classify_bgcs.py # BGC classification engine
βββ benchmark_bgcqdr.py # Performance benchmarking
βββ compare_with_deepbgc.py # Comparison with DeepBGC
β
βββ uploads/ # User-uploaded FASTA files
βββ results/ # Pipeline results (JSON)
βββ edna_fasta/ # Sample eDNA datasets
βββ benchmark_results/ # Benchmark data
β
βββ README.md # This file
βββ FULLSTACK_README.md # Detailed setup guide
βββ BACKEND_FIXES.md # Backend implementation details
python call_orfs.py \
--input regions.fasta \
--output-dir orfs/ \
--threads 4Purpose: Predict protein-coding genes from DNA sequences
Tool: Prodigal (metagenomic mode)
Output:
regions_proteins.faa- Amino acid sequencesregions_nucleotides.fna- Nucleotide sequencesregions_genes.gbk- GenBank format
python classify_bgcs.py \
--domain-table domains.csv \
--output bgc_candidates.csv \
--min-score 0.4 \
--min-domains 2Purpose: Classify and filter BGC candidates
Rules Engine:
- PKS: Requires PKS + ACP domains
- NRPS: Requires A + PCP domains
- RiPP: Lanthipeptide, Thiopeptide markers
- Terpene: Terpene synthase/cyclase
- Hybrid: Multiple biosynthetic systems
Output:
- Classified BGC candidates
- Confidence scores (high/medium/low)
- Domain composition
python benchmark_bgcqdr.py \
--input-dir edna_fasta/ \
--output-dir benchmark_results/Purpose: Evaluate pipeline performance
Metrics:
- Detection accuracy
- False positive rate
- Processing time
- Comparison with DeepBGC
-
Hero Section
- Animated DNA video background
- Text scramble effect
- Call-to-action buttons
-
Pipeline Visualization
- 4 interactive phases
- Hover effects with details
- SVG illustrations
-
Sample Upload
- Drag-and-drop interface
- FASTA file validation
- Sample data option
-
Results Display
- Interactive table
- Confidence score bars
- BGC class visualization
- Download functionality
| Endpoint | Method | Description |
|---|---|---|
/api/health |
GET | Health check |
/api/stats |
GET | Pipeline statistics |
/api/detect |
POST | BGC detection (Phase 1-2) |
/api/reconstruct |
POST | Graph reconstruction (Phase 3) |
/api/novelty |
POST | Novelty assessment (Phase 4-5) |
/api/rank |
POST | VQC ranking (Phase 6) |
/api/results/<job_id> |
GET | Download results |
Sample: GCA_000205625.1.fasta
Sequences: 3 contigs
Size: 45.6 KB
{
"job_id": "job_1715097201",
"bgc_count": 3,
"virtual_bgc_count": 1,
"novel_count": 1,
"vqc_accuracy": 0.823,
"top_candidates": [
{
"bgc_id": "VBGC_0000",
"score": 0.891,
"bgc_class": "NRPS",
"novelty": 24.56
}
]
}python test_backend.py# Test ORF calling
python call_orfs.py --input edna_fasta/GCA_000205625.1.fasta --output-dir test_orfs/
# Test BGC classification
python classify_bgcs.py --domain-table test_domains.csv --output test_bgcs.csv- Open http://localhost:3000
- Click "Analyse a Sample"
- Choose "Use Sample Data"
- Verify pipeline executes all phases
- Check results display correctly
| Metric | Value |
|---|---|
| Total BGCs detected | 68 |
| Virtual BGCs | 14 |
| Novel BGCs | 11 (78.6%) |
| VQC Accuracy | 80.4% |
| Processing time | ~2 minutes |
See benchmark_results/benchmark_report.txt for detailed metrics.
Edit backend_api.py:
# Server settings
app.run(debug=True, host='0.0.0.0', port=5000)
# File paths
UPLOAD_FOLDER = Path('uploads')
RESULTS_FOLDER = Path('results')
SAMPLE_FASTA = Path('edna_fasta/GCA_000205625.1.fasta')Edit frontend/app.js:
// API base URL
const API_BASE_URL = 'http://localhost:5000/api';- Verify
frontend/assets/DNA.mp4exists - Check browser console for errors
- Try different browser
- Verify
app.jsandstyles.cssare linked in HTML - Check browser console (F12)
- Hard refresh: Ctrl+F5
- Check Python version:
python --version(need 3.8+) - Install dependencies:
pip install -r backend_requirements.txt - Verify port 5000 is available
- Install Prodigal:
conda install -c bioconda prodigal - Check input file format (valid FASTA)
- Verify file permissions
- FULLSTACK_README.md - Complete setup guide
- FULLSTACK_INTEGRATION.md - Technical architecture
- BACKEND_FIXES.md - Backend implementation details
- TEST_INTEGRATION.md - Testing procedures
- CURRENT_STATUS.md - Project status and roadmap
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is part of the BGC-QDR pipeline research.
- Frontend Design: Modern web design principles
- Fonts: DM Sans, DM Serif Display, DM Mono (Google Fonts)
- Pipeline: BGC-QDR research project
- Tools: Prodigal, BioPython, Flask
For issues or questions:
- Check the troubleshooting section
- Review documentation files
- Check browser/backend console for errors
- Verify all dependencies are installed
Version: 2.1.0
Last Updated: 2026-05-12
Status: β
Production Ready (with 9 Priority Bug Fixes)
All critical bug fixes have been successfully implemented and tested:
- β Input QC Module - Robust quality control with BioPython
- β Novelty Caching - Intelligent caching for performance
- β Domain Completeness Scoring - Accurate BGC scoring
- β Per-Contig Logging - Comprehensive detection logging
- β VQC Score Distribution - Statistical analysis of scores
- β Sequence QC in Output - Quality metrics in results
- β API Cache Middleware - Fast repeated queries
- β Frontend QC Warnings - Visual quality indicators
- β Unified Pipeline Runner - One-command execution
- β Synthetic Sequence Detection - Prevents inflated BGC counts
- β antiSMASH Validation - Validated against gold standard
Test Results: 14/14 tests passing (100% success rate)
Validation: β 100% agreement with antiSMASH (gold standard)
Documentation:
- QUICK_START.md - User guide for new features
- BUGFIX_SUMMARY.md - Technical implementation details
- TEST_RESULTS.md - Comprehensive test results
- COMPLETION_SUMMARY.md - Project completion summary
- SYNTHETIC_DETECTION.md - Synthetic sequence handling
- ANTISMASH_VALIDATION_RESULTS.md - Gold standard validation
# Run complete pipeline with QC and synthetic exclusion
python scripts/run_pipeline.py --input sample.fasta --output results/ --exclude-synthetic
# Run input QC only
python scripts/input_qc.py --input sample.fasta --output filtered.fasta --report qc_report.json --exclude-synthetic
# Validate against antiSMASH
python test_antismash_validation.py
# Run tests
python test_bugfixes.py # Unit tests
python test_integration.py # Integration testsVersion: 2.1.0
Last Updated: 2026-05-12
Status: β
Production Ready
Quick Links:
- Frontend: http://localhost:3000
- Backend API: http://localhost:5000/api
- Health Check: http://localhost:5000/api/health