Simple, rapid analysis of omics data
A comprehensive Bash pipeline for processing bulk ATAC-Seq data, from raw data to peak calling and visualization files.
This pipeline automates the processing of bulk ATAC-Seq data, handling various input formats and performing quality control, alignment, peak calling, and bigwig file generation. It's designed to be user-friendly while maintaining flexibility for advanced users.
- Multiple input formats: Supports SRA, FASTQ (single/paired-end), BAM, and SAM files
- Comprehensive quality control: Uses fastp for read trimming and quality filtering
- Efficient alignment: BWA for read alignment
- Duplicate removal: Sambamba for marking and removing PCR duplicates
- Peak calling: MACS3 for identifying accessible chromatin regions
- Visualization: BigWig file generation with deepTools
- Resume capability: Skip completed steps with
-soption - Clean output: Optional removal of intermediate files
The pipeline requires the following software:
| Software | Purpose | Installation |
|---|---|---|
| sra-tools | SRA file conversion | conda install -c bioconda sra-tools |
| fastp | Quality control | conda install -c bioconda fastp |
| bwa | Read alignment | conda install -c bioconda bwa |
| samtools | SAM/BAM processing | conda install -c bioconda samtools |
| sambamba | Duplicate marking | conda install -c bioconda sambamba |
| macs3 | Peak calling | conda install -c bioconda macs3 |
| deeptools | BigWig generation | conda install -c bioconda deeptools |
# Clone the repository
git clone https://github.com/BioOmics/atacseq-pipeline.git
cd atacseq-pipeline
# Make the script executable
chmod +x atacseq-pipeline.sh
# Run the pipeline (single-end FASTQ example)
./atacseq-pipeline.sh -i sample.fq.gz -g genome.fa -a annotation.gtf -t 8./atacseq-pipeline.sh -i INPUT -g GENOME -a ANNOTATION [OPTIONS]| Option | Description |
|---|---|
-i, --input |
Input file (SRA, FASTQ, BAM, or SAM) |
-I, --input2 |
Second input file for paired-end FASTQ |
-g, --genome |
Reference genome FASTA file |
-a, --annotation |
Annotation file (GTF or GFF3 format) |
| Option | Description | Default |
|---|---|---|
-q, --qualityBase |
Minimum base quality score | 20 |
-b, --binSize |
Generate bigwig with specified bin size | No bigwig |
-o, --output |
Output directory | ./[input]_pipeline_result |
-t, --threads |
Number of CPU threads | 8 |
-f, --force |
Force overwrite output directory | No |
-s, --skip |
Skip completed steps | No |
-r, --remove |
Remove intermediate files | No |
-h, --help |
Show help message | - |
-v, --version |
Show version | - |
[input]_pipeline_result/
βββ fastq/ # Clean FASTQ files
βββ bam/ # Alignment files (sorted, deduplicated)
βββ peaks/ # MACS3 peak calling results
βββ bigwig/ # BigWig files for visualization (if -b used)
βββ report/ # Statistics and log files
./atacseq-pipeline.sh -i SRR12345678.sra -g hg38.fa -a hg38.gtf./atacseq-pipeline.sh -i sample_R1.fq.gz -I sample_R2.fq.gz -g mm10.fa -a mm10.gtf -t 16./atacseq-pipeline.sh -i sample.bam -g genome.fa -a genes.gtf -b 10 -o ./atac_results -t 8./atacseq-pipeline.sh -i sample.fq.gz -g genome.fa -a genes.gtf -s -r- Input conversion (if SRA):
fasterq-dumpβ FASTQ - Quality control:
fastpβ Clean FASTQ - Alignment:
bwa memβ SAM - SAM to BAM:
samtools viewβ BAM - Sorting:
sambamba sortβ Sorted BAM - Duplicate marking:
sambamba markdupβ Deduplicated BAM - Indexing:
samtools indexβ BAM index - Peak calling:
macs3β Narrow peaks - BigWig generation (optional):
bamCoverageβ BigWig
The pipeline automatically checks for required software dependencies before execution. Modify the quality threshold or thread count according to your needs using the command-line options.
- For paired-end data, both input files must be provided
- The genome FASTA file must be indexed with
bwa indexbeforehand - Annotation file format (GTF/GFF3) is required for MACS3
- Intermediate files can consume significant disk space; use
-rto clean up
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
This project is licensed under the MIT License - see the LICENSE file for details.
BioOmics (haoyuchao@zju.edu.cn)
If you use this pipeline in your research, please cite:
@software{atacseq_pipeline,
author = {BioOmics},
title = {Bulk ATAC-Seq Analysis Pipeline},
year = {2024},
url = {https://github.com/BioOmics/atacseq-pipeline}
}
- The developers of all dependency tools
- Zhejiang University for support
A comprehensive Bash pipeline for processing bulk ChIP-Seq data, with integrated control sample processing for robust peak calling.
This pipeline automates the processing of bulk ChIP-Seq data, handling both IP and control samples simultaneously. It performs quality control, alignment, peak calling with control comparison, and visualization file generation. Designed for reproducibility and ease of use.
- Dual-sample processing: Simultaneously handles IP and control samples
- Multiple input formats: Supports SRA, FASTQ (single/paired-end), BAM, and SAM files
- Integrated control comparison: Proper peak calling with input/IgG control
- Comprehensive quality control: Fastp for read trimming and quality filtering
- Efficient alignment: BWA for read alignment
- Duplicate removal: Sambamba for marking and removing PCR duplicates
- Peak calling: MACS3 with experimental vs control comparison
- Visualization: BigWig file generation with deepTools
- Resume capability: Skip completed steps with
-soption - Clean output: Optional removal of intermediate files
| Software | Purpose | Installation |
|---|---|---|
| sra-tools | SRA file conversion | conda install -c bioconda sra-tools |
| fastp | Quality control | conda install -c bioconda fastp |
| bwa | Read alignment | conda install -c bioconda bwa |
| samtools | SAM/BAM processing | conda install -c bioconda samtools |
| sambamba | Duplicate marking | conda install -c bioconda sambamba |
| macs3 | Peak calling | conda install -c bioconda macs3 |
| deeptools | BigWig generation | conda install -c bioconda deeptools |
# Clone the repository
git clone https://github.com/BioOmics/chipseq-pipeline.git
cd chipseq-pipeline
# Make the script executable
chmod +x chipseq-pipeline.sh
# Run the pipeline (single-end FASTQ example)
./chipseq-pipeline.sh -i IP_sample.fq.gz -c control.fq.gz -g genome.fa -a annotation.gtf -t 8./chipseq-pipeline.sh -i INPUT -c CONTROL -g GENOME -a ANNOTATION [OPTIONS]| Option | Description |
|---|---|
-i, --input |
IP sample file (SRA, FASTQ, BAM, or SAM) |
-I, --input2 |
Second IP file for paired-end FASTQ |
-c, --control |
Control sample file (SRA, FASTQ, BAM, or SAM) |
-C, --control2 |
Second control file for paired-end FASTQ |
-g, --genome |
Reference genome FASTA file |
-a, --annotation |
Annotation file (GTF or GFF3 format) |
| Option | Description | Default |
|---|---|---|
-q, --qualityBase |
Minimum base quality score | 20 |
-b, --binSize |
Generate bigwig with specified bin size | No bigwig |
-o, --output |
Output directory | ./[input]_pipeline_result |
-t, --threads |
Number of CPU threads | 8 |
-f, --force |
Force overwrite output directory | No |
-s, --skip |
Skip completed steps | No |
-r, --remove |
Remove intermediate files | No |
-h, --help |
Show help message | - |
-v, --version |
Show version | - |
[input]_pipeline_result/
βββ fastq/
β βββ IP_clean/ # Clean IP FASTQ files
β βββ control_clean/ # Clean control FASTQ files
βββ bam/
β βββ IP/ # IP alignment files
β βββ control/ # Control alignment files
βββ peaks/ # MACS3 peak calling results (IP vs control)
βββ bigwig/ # BigWig files for visualization (if -b used)
βββ report/ # Statistics and log files
./chipseq-pipeline.sh -i IP_sample.sra -c control.sra -g hg38.fa -a hg38.gtf./chipseq-pipeline.sh -i IP_R1.fq.gz -I IP_R2.fq.gz -c control_R1.fq.gz -C control_R2.fq.gz -g mm10.fa -a mm10.gtf -t 16./chipseq-pipeline.sh -i IP.bam -c control.bam -g genome.fa -a genes.gtf -b 10./chipseq-pipeline.sh -i IP.fq.gz -c control.fq.gz -g genome.fa -a genes.gtf -s -r- Input conversion (if SRA):
fasterq-dumpβ FASTQ (both samples) - Quality control:
fastpβ Clean FASTQ (both samples) - Alignment:
bwa memβ SAM (both samples) - SAM to BAM:
samtools viewβ BAM (both samples) - Sorting:
sambamba sortβ Sorted BAM (both samples) - Duplicate marking:
sambamba markdupβ Deduplicated BAM (both samples) - Indexing:
samtools indexβ BAM index (both samples) - Peak calling:
macs3with treatment vs control β Narrow peaks - BigWig generation (optional):
bamCoverageβ BigWig (both samples)
- Control samples are essential: Always provide appropriate control (input/IgG) for proper peak calling
- Paired-end consistency: If using paired-end data for IP, control must also be paired-end
- Library compatibility: IP and control samples should be sequenced with the same library preparation method
- Peak calling parameters: MACS3 automatically adjusts for control sample size differences
- Both IP and control files must be in the same format (e.g., both SRA or both FASTQ)
- The genome FASTA file must be indexed with
bwa indexbeforehand - For paired-end data, both pairs must be provided for each sample
- Control sample processing follows the same QC and alignment steps as IP
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
This project is licensed under the MIT License - see the LICENSE file for details.
BioOmics (haoyuchao@zju.edu.cn)
If you use this pipeline in your research, please cite:
@software{chipseq_pipeline,
author = {BioOmics},
title = {Bulk ChIP-Seq Analysis Pipeline},
year = {2024},
url = {https://github.com/BioOmics/chipseq-pipeline}
}
- The developers of all dependency tools
- Zhejiang University for support
A comprehensive Bash pipeline for processing bulk RNA-Seq data, from raw reads to gene expression quantification and visualization.
This pipeline automates the processing of bulk RNA-Seq data, handling various input formats and performing quality control, splice-aware alignment, transcript quantification, and bigwig file generation. It uses STAR for fast and accurate alignment and RSEM for precise transcript/gene-level quantification.
- Multiple input formats: Supports SRA, FASTQ (single/paired-end), BAM, and SAM files
- Splice-aware alignment: STAR for accurate mapping across splice junctions
- Transcript quantification: RSEM for gene and isoform-level expression estimates
- Comprehensive quality control: Fastp for read trimming and quality filtering
- Duplicate awareness: Sambamba for marking duplicates (optional)
- Visualization: BigWig file generation with deepTools
- Resume capability: Skip completed steps with
-soption - Clean output: Optional removal of intermediate files
| Software | Purpose | Installation |
|---|---|---|
| sra-tools | SRA file conversion | conda install -c bioconda sra-tools |
| fastp | Quality control | conda install -c bioconda fastp |
| STAR | Splice-aware alignment | conda install -c bioconda star |
| rsem | Gene/transcript quantification | conda install -c bioconda rsem |
| sambamba | BAM processing | conda install -c bioconda sambamba |
| deeptools | BigWig generation | conda install -c bioconda deeptools |
# Clone the repository
git clone https://github.com/Haoyu-Chao/rnaseq-pipeline.git
cd rnaseq-pipeline
# Make the script executable
chmod +x rnaseq-pipeline.sh
# Run the pipeline (single-end FASTQ example)
./rnaseq-pipeline.sh -i sample.fq.gz -g genome.fa -a annotation.gtf -t 8./rnaseq-pipeline.sh -i INPUT -g GENOME -a ANNOTATION [OPTIONS]| Option | Description |
|---|---|
-i, --input |
Input file (SRA, FASTQ, BAM, or SAM) |
-I, --input2 |
Second input file for paired-end FASTQ |
-g, --genome |
Reference genome FASTA file |
-a, --annotation |
Annotation file (GTF or GFF3 format) |
| Option | Description | Default |
|---|---|---|
-q, --qualityBase |
Minimum base quality score | 20 |
-b, --binSize |
Generate bigwig with specified bin size | No bigwig |
-o, --output |
Output directory | ./[input]_pipeline_result |
-t, --threads |
Number of CPU threads | 8 |
-f, --force |
Force overwrite output directory | No |
-s, --skip |
Skip completed steps | No |
-r, --remove |
Remove intermediate files | No |
-h, --help |
Show help message | - |
-v, --version |
Show version | - |
[input]_pipeline_result/
βββ fastp/ # Fastp quality control reports (HTML/JSON)
βββ bam/ # Sorted BAM files with indexes
βββ quantification/
β βββ rsem/ # RSEM output files (genes/isoforms results)
β βββ counts/ # Raw count matrices
βββ bigwig/ # BigWig files for visualization (if -b used)
βββ logs/ # Pipeline execution logs
RSEM generates multiple output files:
.genes.results: Gene-level expression estimates (TPM/expected counts).isoforms.results: Transcript-level expression estimates.transcript.bam: Transcriptome-aligned BAM file.stat: Alignment statistics
./rnaseq-pipeline.sh -i SRR12345678.sra -g hg38.fa -a hg38.gtf./rnaseq-pipeline.sh -i sample_R1.fq.gz -I sample_R2.fq.gz -g mm10.fa -a mm10.gtf -t 16./rnaseq-pipeline.sh -i sample.bam -g genome.fa -a genes.gtf -b 10 -o ./rnaseq_results./rnaseq-pipeline.sh -i sample.fq.gz -g genome.fa -a genes.gtf -s -r- Input conversion (if SRA):
fasterq-dumpβ FASTQ - Quality control:
fastpβ Clean FASTQ + HTML report - Genome index preparation (if needed):
STAR --runMode genomeGenerate - Alignment:
STAR --runMode alignReadsβ SAM - BAM processing:
samtools view β sambamba sort β sambamba markdup - RSEM preparation:
rsem-prepare-reference(if needed) - Quantification:
rsem-calculate-expressionβ Gene/transcript counts - BigWig generation (optional):
bamCoverageβ Normalized coverage tracks
-
Genome indexing: Both STAR and RSEM require indexed genomes. The pipeline will check and generate them if missing:
# STAR index (required in genome directory) STAR --runMode genomeGenerate --genomeDir ./star_index --genomeFastaFiles genome.fa --sjdbGTFfile annotation.gtf # RSEM index (required in genome directory) rsem-prepare-reference --gtf annotation.gtf genome.fa ./rsem_index/reference
-
Memory requirements: STAR alignment can be memory-intensive. For large genomes (e.g., human), ensure sufficient RAM (β₯30GB recommended)
-
Stranded libraries: RSEM automatically detects library strandedness; specify with
--strandednessif needed
The pipeline generates several quality metrics:
- Fastp reports: Base quality, GC content, adapter contamination
- STAR logs: Mapping rates, unique/multi-mapping reads, splice junctions
- RSEM statistics: Alignment rates, gene detection rates
- BigWig files: Genome coverage tracks for visualization (IGV, UCSC)
- The genome FASTA and annotation GTF files must be compatible (same chromosome names, coordinates)
- For paired-end data, both files must be provided with
-iand-I - STAR and RSEM indices should be generated once per genome/annotation combination
- Consider using
-rflag to remove intermediate files (SAM, unsorted BAM) to save disk space
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
This project is licensed under the MIT License - see the LICENSE file for details.
Haoyu Chao (haoyuchao@zju.edu.cn)
If you use this pipeline in your research, please cite:
@software{rnaseq_pipeline,
author = {Chao, Haoyu},
title = {Bulk RNA-Seq Quantification Pipeline},
year = {2024},
url = {https://github.com/Haoyu-Chao/rnaseq-pipeline}
}
- The developers of STAR, RSEM, and all dependency tools
- Zhejiang University for support
A comprehensive Bash pipeline for processing whole-genome bisulfite sequencing (WGBS/BS-Seq) data, from raw reads to methylation calls.
This pipeline automates the processing of bisulfite sequencing data, handling various input formats and performing quality control, bisulfite-aware alignment, methylation extraction, and annotation. It uses bwameth for accurate alignment of bisulfite-converted reads and MethylDackel for precise methylation calling.
- Multiple input formats: Supports SRA, FASTQ (single/paired-end), BAM, and SAM files
- Bisulfite-aware alignment: bwameth for accurate mapping of CβT converted reads
- Methylation extraction: MethylDackel for per-base methylation levels
- Quality control: Fastp for read trimming and quality filtering
- Duplicate marking: Sambamba for identifying PCR duplicates
- Region-specific analysis: Extract methylation levels for specific genomic regions
- Resume capability: Skip completed steps with checkpoint system
- Clean output: Optional removal of intermediate files
| Software | Purpose | Installation |
|---|---|---|
| sra-tools | SRA file conversion | conda install -c bioconda sra-tools |
| fastp | Quality control | conda install -c bioconda fastp |
| bwameth | Bisulfite-aware alignment | conda install -c bioconda bwameth |
| sambamba | BAM processing | conda install -c bioconda sambamba |
| samtools | SAM/BAM manipulation | conda install -c bioconda samtools |
| MethylDackel | Methylation extraction | conda install -c bioconda methyldackel |
| bedtools | Intersection operations | conda install -c bioconda bedtools |
# Clone the repository
git clone https://github.com/Haoyu-Chao/bsseq-pipeline.git
cd bsseq-pipeline
# Make the script executable
chmod +x bsseq-pipeline.sh
# Run the pipeline (single-end FASTQ example)
./bsseq-pipeline.sh -i sample.fq.gz -g genome.fa -a annotation.gff3 -t 8./bsseq-pipeline.sh -i INPUT -g GENOME -a ANNOTATION [OPTIONS]| Option | Description |
|---|---|
-i, --input |
Input file (SRA, FASTQ, BAM, or SAM) |
-I, --input2 |
Second input file for paired-end FASTQ |
-g, --genome |
Reference genome FASTA file |
-a, --annotation |
Annotation file (GFF3 or GTF format) |
| Option | Description | Default |
|---|---|---|
-q, --mapq |
Minimum mapping quality for BAM filtering | 10 |
-o, --output |
Output directory | ./[input]_pipeline_result |
-t, --threads |
Number of CPU threads | 8 |
-f, --force |
Force overwrite output directory | No |
-r, --remove |
Remove intermediate files | No |
-h, --help |
Show help message | - |
-v, --version |
Show version | - |
[input]_pipeline_result/
βββ fastq/ # Clean FASTQ files after QC
βββ bam/
β βββ alignment/ # Raw aligned BAM files
β βββ sorted/ # Sorted BAM files
β βββ dedup/ # Deduplicated BAM files
β βββ filtered/ # Filtered BAM files (by MAPQ)
βββ methylation/
β βββ perBase/ # Per-base methylation calls (BEDGraph)
β βββ perRegion/ # Regional methylation levels
β βββ CGmap/ # CGmap format files
βββ reports/ # QC reports and statistics
βββ logs/ # Pipeline execution logs
MethylDackel generates several output formats:
- BEDGraph: Per-base methylation levels (
chr start end methylation_level depth) - CGmap: Complete Genomics methylation format
- perRegion: Aggregated methylation over genomic features (promoters, genes, CpG islands)
./bsseq-pipeline.sh -i SRR12345678.sra -g hg38.fa -a hg38.gff3./bsseq-pipeline.sh -i sample_R1.fq.gz -I sample_R2.fq.gz -g mm10.fa -a mm10.gtf -t 16./bsseq-pipeline.sh -i sample.fq.gz -g genome.fa -a genes.gff3 -q 20 -t 8./bsseq-pipeline.sh -i sample.bam -g genome.fa -a annotation.gff3 -r- Input conversion (if SRA):
fasterq-dumpβ FASTQ - Quality control:
fastpβ Clean FASTQ + HTML report - Bisulfite alignment:
bwameth.pyβ SAM - BAM processing:
samtools viewβ BAMsambamba sortβ Sorted BAMsambamba markdupβ Deduplicated BAMsamtools view -qβ Filtered BAM
- Methylation extraction:
MethylDackel extractβ Per-base methylation - Regional methylation:
MethylDackel mbias+ bedtools β Region-level methylation - Quality reports: Generate alignment statistics and methylation summary
Before running the pipeline, prepare the bwameth index:
# Convert reference genome for bwameth
bwameth.py index genome.faThe pipeline automatically detects and handles:
- Directional libraries: Standard BS-Seq protocol
- Non-directional libraries: Uses appropriate parameters in MethylDackel
Methylation calls are generated for all contexts (CpG, CHG, CHH) and can be filtered in downstream analysis.
The pipeline generates comprehensive quality metrics:
- Bisulfite conversion rate: Estimated from non-CpG contexts or spike-ins
- Mapping statistics: Overall alignment rate, unique mapping rate
- Coverage depth: Per-base and regional coverage statistics
- Methylation bias: Position-specific methylation bias plots
- CpG coverage: Genome-wide CpG coverage distribution
- CpG methylation: Most relevant for gene regulation studies
- CHG/CHH methylation: Important in plants, rare in mammals
- Coverage thresholds: Typically require β₯5x coverage for reliable methylation calls
- Strand-specificity: MethylDackel maintains strand-specific information
- bwameth requires the reference genome to be indexed with
bwameth.py index - For large genomes (e.g., human), ensure sufficient disk space (β₯100GB for intermediate files)
- The pipeline maintains strand-specific methylation information throughout
- Consider using
-rflag for large datasets to manage disk usage
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
This project is licensed under the MIT License - see the LICENSE file for details.
Haoyu Chao (haoyuchao@zju.edu.cn)
If you use this pipeline in your research, please cite:
@software{bsseq_pipeline,
author = {Chao, Haoyu},
title = {Bulk BS-Seq (WGBS) Analysis Pipeline},
year = {2023},
url = {https://github.com/Haoyu-Chao/bsseq-pipeline}
}
- The developers of bwameth, MethylDackel, and all dependency tools
- Zhejiang University for support
Last updated: 2023-03-30
Last updated: 2024-11-01