This repository contains core computational pipelines and analysis scripts used to generate the results and figures for the manuscript:
Chen, H. et al. GuFi phages represent the most prevalent viral family-level clusters in the human gut microbiome. bioRxiv. https://doi.org/10.64898/2026.01.26.701711
- Identifying viral OTUs from metagenomic assemblies
- Constructing viral family-level clusters
- Estimating prevalence and abundance in metagenomic samples
- Predicting virus-host associations from Hi-C data
- Variant calling and pN/pS analysis
-
notebooks: Python notebooks and R scripts to reproduce all figures in the manuscript (tested with python v3.10.11 and R v4.5.0). To reproduce the figures, clone this repository and execute the code from within this directory.
-
data: Input data files. Before executing the code, also download the Supplementary Data files from Zenodo and save them in this directory.
- All Supplementary Data files, identified viral sequences, and representative vOTU sequences are available via Zenodo at DOI:10.5281/zenodo.18253939.
- The hybrid MAGs used for host association and read coverage analysis are available from the European Nucleotide Archive (ENA) under project accession PRJEB49168.
- Hi-C (n=84) and VLP (n=64) metagenomic sequencing reads are available from the European Nucleotide Archive (ENA) under project accession PRJEB106095. Illumina (n=109), Oxford Nanopore (n=109), and Hi-C (n=24) metagenomic sequencing reads are available under project accession PRJEB49168.
For questions regarding the code and data in this repository, please contact Hanrong Chen and Niranjan Nagarajan.
If you use this repository, please cite:
Chen, H. et al. GuFi phages represent the most prevalent viral family-level clusters in the human gut microbiome. bioRxiv. https://doi.org/10.64898/2026.01.26.701711
Please also cite the lab's previous paper on the SPMP cohort:
Gounot, J.-S. et al. Genome-centric analysis of short and long read metagenomes reveals uncharacterized microbiome diversity in Southeast Asians. Nature Communications, 13, 6044 (2022).