This repository contains workflows, scripts, validation examples, and representative results used for fragmentomics analysis of cell-free DNA (cfDNA) from Whole Genome Sequencing (WGS) datasets.
The project focuses on identifying fragmentation patterns, fragment-length distributions, and end-motif signatures associated with healthy and cancer samples. The workflow was developed and executed on the Washington University Compute1 HPC environment using large-scale sequencing datasets and cohort-level analyses. https://github.com/Daark-Devil/Fragmentomics_Biomarker/blob/main/Doc/fragmentomics_detailed_pipeline.svg
https://github.com/Daark-Devil/Fragmentomics_Biomarker/blob/main/Doc/fragmentomics_compact_summary.svg
Built FASTQ-to-BAM WGS processing workflow.
Extracted TLEN-based fragment lengths from BAM files.
Compared healthy and cancer fragment-length distributions.
Added 4-mer and 5-mer end-motif analysis.
Built EDTA-vs-Streck and healthy-vs-cancer comparison outputs.
Generated interactive dashboard with statistical summaries.
- Cancer and healthy cfDNA samples exhibited distinct fragmentation patterns and end-motif profiles.
- Cohort-level fragment-length distributions were compared using WGS-derived cfDNA data and found both Cancer and Healthy has diffrent significant patterns, which varies over Cohorts.
- 4-mer and 5-mer end-motif analyses identified differences in terminal sequence composition between cohorts.
- Interactive dashboards were generated to visualize fragment lengths, motif frequencies, and statistical summaries.
- The workflow was validated using representative samples and scaled to cohort-level analyses on HPC infrastructure.
Cell-free DNA fragmentomics has emerged as a powerful approach for studying biological processes and disease-associated signatures. Differences in fragment lengths and end-motif patterns can provide information about chromatin organization, tissue origin, and disease status.
The primary goals of this project were:
- Process raw sequencing data from healthy and cancer cohorts
- Generate aligned BAM files from paired-end sequencing reads
- Extract fragment-length information
- Compare fragment-length distributions between cohorts
- Calculate cohort-level fragment statistics
- Perform 4-mer and 5-mer end-motif analysis
- Compare motif enrichment patterns between healthy and cancer samples
- Validate workflow reproducibility using representative test samples
- Generate publication-style visualizations and summary statistics
“The repository includes an interactive EDTA WGS 5-mer fragmentomics dashboard for healthy-vs-cancer comparison. It summarizes motif abundance, fragment-length density, cancer-minus-healthy difference curves, fragment category shifts, Cohen’s d, binned KS distance, mean/median/Q1/Q3, and cohort-level motif comparisons(https://github.com/Daark-Devil/Fragmentomics_Biomarker/blob/main/Doc/edta_wgs_5mer_fragmentomics_dashboard_v3.html).”
Paired-End FASTQ Files
↓
BWA Alignment
↓
SAM Files
↓
BAM Conversion & Sorting
↓
Fragment-Length Extraction
↓
Fragment Binning & Summaries
↓
Healthy vs Cancer Comparisons
↓
End-Motif Analysis
(4-mer / 5-mer)
↓
Quality Control
↓
Visualization & Reporting
Fragmentomics-WGS/
├── README.md
├── scripts/
├── results/
├── samples_test/
├── qc_check/
└── Doc/
The scripts directory contains the analysis workflows used throughout the project.
Major components include:
- BWA alignment workflows
- SAM generation
- BAM conversion and sorting
- Batch processing of multiple samples
- Fragment-length extraction
- Length binning
- Cohort-level summary generation
- Healthy vs cancer comparisons
- 4-mer motif analysis
- 5-mer motif analysis
- Cohort-level motif enrichment studies
- Fragment-length distributions
- Healthy vs cancer comparisons
- End-motif enrichment plots
- Cohort summary figures
Separate workflows were developed to support analysis of EDTA and Streck tube sample collections.
The repository includes representative outputs generated during analysis.
Included result categories:
- Per-sample fragment-length distributions
- Cohort-level fragment-length summaries
- Binned fragment-length comparisons
- Mean fragment-size calculations
- 4-mer motif frequencies
- 5-mer motif frequencies
- Healthy vs cancer motif comparisons
- Differential motif enrichment analyses
- Workflow progress tracking
- Sample counts
- Validation summaries
- Normalized fragment-length distributions
- Healthy vs cancer comparisons
- Motif enrichment plots
- EDTA vs Streck comparisons
- Fragment-length summary figures
The samples_test directory contains representative validation examples used to verify workflow reproducibility.
Examples include:
Validation sample:
seq003213
Used to compare regenerated fragment-length outputs against original results.
Included files:
- SAM alignment
- Sorted BAM
- BAM index
- Fragment-length outputs
- Binned fragment distributions
Representative motif-count outputs generated from individual WGS samples.
These examples were used to verify motif extraction and motif-frequency calculations prior to cohort-level processing.
bash wgs_stage1_bwa_worker.sh
bash wgs_stage2_samtools_worker.shbash WGS_length_count_and_bins.shbash run_wgs_motifs_group_5mer_fixed.shpython plot_01_normalized_line.py
python plot_pilot_healthy_vs_cancer.py- BWA
- SAMtools
- Bash
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Washington University Compute1 HPC
- Linux
- Docker-based batch jobs
Large sequencing datasets, FASTQ files, BAM files, and cohort-scale intermediate files are not included in this repository due to storage constraints.
Only representative scripts, validation examples, summary outputs, and visualization results are provided.
Many scripts contain Compute1-specific file paths and may require modification before running on a different system.
Devansh Pancholi
M.S. Bioinformatics and Computational Biology
Saint Louis University
Research Areas:
- Computational Genomics
- Fragmentomics
- Cancer Genomics
- Sequencing Data Analysis
- Bioinformatics Workflow Development
- Machine Learning for Biological Data