Skip to content

ML-dev-crypto/BGC-PIPELINE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BGC-QDR: Biosynthetic Gene Cluster Detection & Analysis Pipeline

A complete full-stack application for detecting, analyzing, and ranking Biosynthetic Gene Clusters (BGCs) from environmental DNA samples.

🎯 Overview

BGC-QDR (Biosynthetic Gene Cluster - Quality Detection & Ranking) is an integrated pipeline that combines:

  • Frontend: Modern web interface with Palantir-style design
  • Backend: Flask REST API for pipeline orchestration
  • Pipeline: Python-based BGC detection and analysis tools

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Web Interface                          β”‚
β”‚  (Modern UI with DNA video background)                 β”‚
β”‚  - Sample upload                                        β”‚
β”‚  - Real-time progress tracking                          β”‚
β”‚  - Interactive results visualization                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚ REST API
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Flask Backend API                          β”‚
β”‚  - File upload handling                                 β”‚
β”‚  - Job management                                       β”‚
β”‚  - Pipeline orchestration                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           BGC-QDR Pipeline (6 Phases)                   β”‚
β”‚                                                          β”‚
β”‚  Phase 1-2: BGC Detection                               β”‚
β”‚    β†’ ORF prediction (Prodigal)                          β”‚
β”‚    β†’ Domain annotation                                  β”‚
β”‚    β†’ BGC classification                                 β”‚
β”‚                                                          β”‚
β”‚  Phase 3: Graph Reconstruction                          β”‚
β”‚    β†’ Build BGC similarity graphs                        β”‚
β”‚    β†’ Identify virtual BGCs                              β”‚
β”‚                                                          β”‚
β”‚  Phase 4-5: Novelty Assessment                          β”‚
β”‚    β†’ Compare against MIBiG database                     β”‚
β”‚    β†’ Calculate novelty scores                           β”‚
β”‚                                                          β”‚
β”‚  Phase 6: VQC Ranking                                   β”‚
β”‚    β†’ Virtual Quality Control scoring                    β”‚
β”‚    β†’ Rank candidates by confidence                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“Š Pipeline Workflow

Input

  • FASTA file containing genomic sequences (contigs/scaffolds)
  • Environmental DNA (eDNA) samples from metagenomic sequencing

Phase 1-2: BGC Detection

  1. ORF Prediction (call_orfs.py)

    • Uses Prodigal in metagenomic mode
    • Predicts protein-coding genes
    • Outputs: proteins.faa, nucleotides.fna, genes.gbk
  2. BGC Classification (classify_bgcs.py)

    • Applies biological rules to identify BGC types
    • Classifies into: PKS, NRPS, RiPP, Terpene, etc.
    • Filters candidates by confidence score

Phase 3: Graph Reconstruction

  • Builds similarity graphs between detected BGCs
  • Identifies "virtual BGCs" (consensus sequences)
  • Reduces redundancy in metagenomic data

Phase 4-5: Novelty Assessment

  • Compares BGCs against known clusters (MIBiG database)
  • Calculates novelty percentage
  • Identifies potentially novel natural products

Phase 6: VQC Ranking

  • Virtual Quality Control scoring
  • Ranks candidates by:
    • Confidence score (0-1)
    • Novelty percentage
    • BGC class completeness
  • Outputs top candidates for further analysis

Output

  • JSON results with ranked BGC candidates
  • Detailed reports with confidence scores
  • Downloadable data for downstream analysis

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • Prodigal (optional, for ORF prediction)
  • Modern web browser

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd web.dv
  2. Install Python dependencies:

    pip install -r backend_requirements.txt
  3. Set up directories:

    mkdir -p uploads results frontend/assets
  4. Copy frontend assets:

    cp "New folder/index.html" frontend/index.html
    cp "New folder/DNA.mp4" frontend/assets/DNA.mp4

Running the Application

Option 1: Automated Startup (Windows)

.\start_fullstack.ps1

Option 2: Manual Startup

Terminal 1 - Backend:

python backend_api.py

Terminal 2 - Frontend:

cd frontend
python -m http.server 3000

Access the application:

πŸ“ Project Structure

web.dv/
β”œβ”€β”€ frontend/                    # Web interface
β”‚   β”œβ”€β”€ index.html              # Main HTML page
β”‚   β”œβ”€β”€ app.js                  # JavaScript with API integration
β”‚   β”œβ”€β”€ styles.css              # Additional styles
β”‚   └── assets/
β”‚       └── DNA.mp4             # Background video
β”‚
β”œβ”€β”€ backend_api.py              # Flask REST API server
β”œβ”€β”€ backend_requirements.txt    # Python dependencies
β”‚
β”œβ”€β”€ call_orfs.py               # ORF prediction wrapper
β”œβ”€β”€ classify_bgcs.py           # BGC classification engine
β”œβ”€β”€ benchmark_bgcqdr.py        # Performance benchmarking
β”œβ”€β”€ compare_with_deepbgc.py    # Comparison with DeepBGC
β”‚
β”œβ”€β”€ uploads/                   # User-uploaded FASTA files
β”œβ”€β”€ results/                   # Pipeline results (JSON)
β”œβ”€β”€ edna_fasta/               # Sample eDNA datasets
β”œβ”€β”€ benchmark_results/        # Benchmark data
β”‚
β”œβ”€β”€ README.md                 # This file
β”œβ”€β”€ FULLSTACK_README.md       # Detailed setup guide
└── BACKEND_FIXES.md          # Backend implementation details

πŸ”¬ Pipeline Components

1. ORF Calling (call_orfs.py)

python call_orfs.py \
  --input regions.fasta \
  --output-dir orfs/ \
  --threads 4

Purpose: Predict protein-coding genes from DNA sequences

Tool: Prodigal (metagenomic mode)

Output:

  • regions_proteins.faa - Amino acid sequences
  • regions_nucleotides.fna - Nucleotide sequences
  • regions_genes.gbk - GenBank format

2. BGC Classification (classify_bgcs.py)

python classify_bgcs.py \
  --domain-table domains.csv \
  --output bgc_candidates.csv \
  --min-score 0.4 \
  --min-domains 2

Purpose: Classify and filter BGC candidates

Rules Engine:

  • PKS: Requires PKS + ACP domains
  • NRPS: Requires A + PCP domains
  • RiPP: Lanthipeptide, Thiopeptide markers
  • Terpene: Terpene synthase/cyclase
  • Hybrid: Multiple biosynthetic systems

Output:

  • Classified BGC candidates
  • Confidence scores (high/medium/low)
  • Domain composition

3. Benchmarking (benchmark_bgcqdr.py)

python benchmark_bgcqdr.py \
  --input-dir edna_fasta/ \
  --output-dir benchmark_results/

Purpose: Evaluate pipeline performance

Metrics:

  • Detection accuracy
  • False positive rate
  • Processing time
  • Comparison with DeepBGC

🌐 Web Interface

Features

  1. Hero Section

    • Animated DNA video background
    • Text scramble effect
    • Call-to-action buttons
  2. Pipeline Visualization

    • 4 interactive phases
    • Hover effects with details
    • SVG illustrations
  3. Sample Upload

    • Drag-and-drop interface
    • FASTA file validation
    • Sample data option
  4. Results Display

    • Interactive table
    • Confidence score bars
    • BGC class visualization
    • Download functionality

API Endpoints

Endpoint Method Description
/api/health GET Health check
/api/stats GET Pipeline statistics
/api/detect POST BGC detection (Phase 1-2)
/api/reconstruct POST Graph reconstruction (Phase 3)
/api/novelty POST Novelty assessment (Phase 4-5)
/api/rank POST VQC ranking (Phase 6)
/api/results/<job_id> GET Download results

πŸ“Š Example Results

Input

Sample: GCA_000205625.1.fasta
Sequences: 3 contigs
Size: 45.6 KB

Output

{
  "job_id": "job_1715097201",
  "bgc_count": 3,
  "virtual_bgc_count": 1,
  "novel_count": 1,
  "vqc_accuracy": 0.823,
  "top_candidates": [
    {
      "bgc_id": "VBGC_0000",
      "score": 0.891,
      "bgc_class": "NRPS",
      "novelty": 24.56
    }
  ]
}

πŸ§ͺ Testing

Test Backend API

python test_backend.py

Test Pipeline Components

# Test ORF calling
python call_orfs.py --input edna_fasta/GCA_000205625.1.fasta --output-dir test_orfs/

# Test BGC classification
python classify_bgcs.py --domain-table test_domains.csv --output test_bgcs.csv

Test Web Interface

  1. Open http://localhost:3000
  2. Click "Analyse a Sample"
  3. Choose "Use Sample Data"
  4. Verify pipeline executes all phases
  5. Check results display correctly

πŸ“ˆ Performance

Benchmarks (on test dataset)

Metric Value
Total BGCs detected 68
Virtual BGCs 14
Novel BGCs 11 (78.6%)
VQC Accuracy 80.4%
Processing time ~2 minutes

See benchmark_results/benchmark_report.txt for detailed metrics.

πŸ”§ Configuration

Backend Configuration

Edit backend_api.py:

# Server settings
app.run(debug=True, host='0.0.0.0', port=5000)

# File paths
UPLOAD_FOLDER = Path('uploads')
RESULTS_FOLDER = Path('results')
SAMPLE_FASTA = Path('edna_fasta/GCA_000205625.1.fasta')

Frontend Configuration

Edit frontend/app.js:

// API base URL
const API_BASE_URL = 'http://localhost:5000/api';

πŸ› Troubleshooting

Video Not Playing

  • Verify frontend/assets/DNA.mp4 exists
  • Check browser console for errors
  • Try different browser

Buttons Not Working

  • Verify app.js and styles.css are linked in HTML
  • Check browser console (F12)
  • Hard refresh: Ctrl+F5

Backend Errors

  • Check Python version: python --version (need 3.8+)
  • Install dependencies: pip install -r backend_requirements.txt
  • Verify port 5000 is available

Pipeline Errors

  • Install Prodigal: conda install -c bioconda prodigal
  • Check input file format (valid FASTA)
  • Verify file permissions

πŸ“š Documentation

  • FULLSTACK_README.md - Complete setup guide
  • FULLSTACK_INTEGRATION.md - Technical architecture
  • BACKEND_FIXES.md - Backend implementation details
  • TEST_INTEGRATION.md - Testing procedures
  • CURRENT_STATUS.md - Project status and roadmap

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

πŸ“„ License

This project is part of the BGC-QDR pipeline research.

πŸ™ Acknowledgments

  • Frontend Design: Modern web design principles
  • Fonts: DM Sans, DM Serif Display, DM Mono (Google Fonts)
  • Pipeline: BGC-QDR research project
  • Tools: Prodigal, BioPython, Flask

πŸ“ž Support

For issues or questions:

  1. Check the troubleshooting section
  2. Review documentation files
  3. Check browser/backend console for errors
  4. Verify all dependencies are installed

Version: 2.1.0
Last Updated: 2026-05-12
Status: βœ… Production Ready (with 9 Priority Bug Fixes)

πŸ†• What's New in v2.1.0

9 Priority Bug Fixes Implemented βœ…

All critical bug fixes have been successfully implemented and tested:

  1. βœ… Input QC Module - Robust quality control with BioPython
  2. βœ… Novelty Caching - Intelligent caching for performance
  3. βœ… Domain Completeness Scoring - Accurate BGC scoring
  4. βœ… Per-Contig Logging - Comprehensive detection logging
  5. βœ… VQC Score Distribution - Statistical analysis of scores
  6. βœ… Sequence QC in Output - Quality metrics in results
  7. βœ… API Cache Middleware - Fast repeated queries
  8. βœ… Frontend QC Warnings - Visual quality indicators
  9. βœ… Unified Pipeline Runner - One-command execution
  10. βœ… Synthetic Sequence Detection - Prevents inflated BGC counts
  11. βœ… antiSMASH Validation - Validated against gold standard

Test Results: 14/14 tests passing (100% success rate)

Validation: βœ… 100% agreement with antiSMASH (gold standard)

Documentation:

  • QUICK_START.md - User guide for new features
  • BUGFIX_SUMMARY.md - Technical implementation details
  • TEST_RESULTS.md - Comprehensive test results
  • COMPLETION_SUMMARY.md - Project completion summary
  • SYNTHETIC_DETECTION.md - Synthetic sequence handling
  • ANTISMASH_VALIDATION_RESULTS.md - Gold standard validation

Quick Start with New Features

# Run complete pipeline with QC and synthetic exclusion
python scripts/run_pipeline.py --input sample.fasta --output results/ --exclude-synthetic

# Run input QC only
python scripts/input_qc.py --input sample.fasta --output filtered.fasta --report qc_report.json --exclude-synthetic

# Validate against antiSMASH
python test_antismash_validation.py

# Run tests
python test_bugfixes.py        # Unit tests
python test_integration.py     # Integration tests

Version: 2.1.0
Last Updated: 2026-05-12
Status: βœ… Production Ready

Quick Links:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors