LAAVA is a comprehensive internal pipeline for detecting, characterizing, and visualizing recombinant AAV (rAAV) integration sites using Oxford Nanopore long-read sequencing data.
⚠️ This repository is a summary only. The full pipeline code is not publicly available due to licensing restrictions and internal use.
LAAVA supports scalable rAAV integration analysis from raw FASTQ reads through to annotated visual summaries across multiple gene therapy research projects.
- Primary alignment to project-specific AAV transgene or plasmid sequences
- Secondary alignment to full reference genome (e.g., mm39)
- Assembly of target-mapped reads using Canu
- Integration site detection and chromosomal distribution analysis
- Visualization of mapping rates, read characteristics, and coverage profiles
- Designed and implemented the pipeline core using Bash, R, and shell scripts
- Automated dual-alignment strategy with metadata-driven sample handling
- Built modular project structure to support different targets across studies
- Developed Docker-compatible workflow for HPC (LSF) execution
- Generated publication-ready plots and integration summaries in R
| Tool | Purpose |
|---|---|
minimap2 |
Long-read alignment to target and genome references |
samtools |
BAM processing and QC |
Canu |
Local assembly of high-coverage integration loci |
R |
Summary statistics and ggplot2 visualizations |
Docker |
Containerization for reproducible, portable execution |
LSF |
HPC job scheduling |
- Sample metadata CSV — defines sample names, references, and FASTQ paths
- FASTQ files — stored in a central
fasta_files/directory - Target reference FASTA files — e.g., rAAV, transgene, or plasmid
- Full genome FASTA — for secondary genome-wide integration alignment
- BAMs for primary and secondary alignments
- BED files for predicted integration coordinates
- Coverage reports and per-sample summary statistics
- Plots:
combined_capture_efficiency_plot.pngsecondary_alignment_mapping_rates.pngcombined_coverage_plot.pngviolin_plot.png(read length distributions)
- Hotspot identification from BED coordinates
- Chromosome-level distribution analysis
- Coverage threshold tracking at 25x, 50x, 100x, etc.
- Comparative read mapping efficiency across multiple samples
| Observation | Interpretation |
|---|---|
| High target alignment + low genome mapping | High integration specificity |
| Widespread genome mapping + hotspots | Off-target insertion sites present |
| Low target alignment | Poor capture / sample quality |
If referencing this pipeline in your work:
Sophia DeGeorgia. LAAVA: Long-read AAV Integration Analysis Pipeline. Internal research tool. 2024.
Tools include: minimap2, samtools, Canu, ggplot2
Also cite key dependencies:
- Li H. Minimap2. Bioinformatics (2018).
- Koren S. et al. Canu. Genome Research (2017).
- Li H. et al. SAMtools. Bioinformatics (2009).
This repository summarizes an internally developed bioinformatics pipeline.
The source code is not open-source and is not available for distribution.
- CANVAS-summary — CNV analysis from low-pass WGS