Branch: modernize-2025
Base: pairend branch
Date: December 2025
Purpose: Update the ChIP-seq pipeline to work with modern tools and infrastructure
This document summarizes all changes made to modernize the pyflow-ChIPseq pipeline. The modernization ensures compatibility with current bioinformatics tools, Python 3, and modern cluster computing environments while maintaining the original functionality.
Old Approach:
- MACS1 v1.4.2 (Python 2.7)
- MACS2 v2.1.1 (Python 2.7)
- Required
source activate py27in shell commands
New Approach:
- MACS3 v3.0.1+ (Python 3.9+)
- Native Python 3 compatibility throughout
- No environment activation in shell commands (handled by conda directives)
Impact:
- ✅ Compatible with modern Python ecosystems
- ✅ Better maintained and actively developed tools
- ✅ Improved performance and bug fixes
| Tool | Old Version | New Version | Changes |
|---|---|---|---|
| samtools | 1.3.1 | 1.19+ | Updated command syntax (-Sb → -b, added -@ for threads) |
| deepTools | 2.3.3 | 3.5.5+ | --ratio subtract → --operation subtract, --normalizeUsingRPKM → --normalizeUsing RPKM |
| BWA | 0.7.x | 0.7.17+ | No major changes, but recommended update |
| samblaster | 0.1.22 | 0.1.26+ | Bug fixes and improvements |
| sambamba | < 1.0 | 1.0+ | Performance improvements |
| FastQC | 0.11.x | 0.12.1+ | Updated reports |
| MultiQC | 1.x | 1.19+ | Improved aggregation |
| bedtools | 2.x | 2.31.1+ | Bug fixes |
| ChromHMM | old | 1.25+ | Updated via conda |
Shell Prefix/Suffix (Removed)
# OLD
shell.prefix("set -eo pipefail; echo BEGIN at $(date); ")
shell.suffix("; exitstat=$?; echo END at $(date); exit $exitstat")
# NEW - removed, Snakemake handles this better nowModule Load Commands (Removed)
# OLD
shell: """
module load fastqc
fastqc ...
"""
# NEW - use conda directives
conda: "envs/qc.yaml"
shell: """
fastqc ...
"""cluster.json (Replaced)
# OLD
CLUSTER = json.load(open(config['CLUSTER_JSON']))
threads: CLUSTER["align"]["n"]
# NEW - use resources directive
threads: 10
resources:
mem_mb=32000,
runtime=720Conda Environment Integration
rule example:
conda: "envs/alignment.yaml" # Automatic environment management
shell: "..."Resource Specifications
resources:
mem_mb=16000, # Memory in MB
runtime=240 # Runtime in minutesNamed Outputs
# OLD
output: "03aln/{sample}.sorted.bam", "03aln/{sample}.sorted.bam.bai"
# NEW
output:
bam = "03aln/{sample}.sorted.bam",
bai = "03aln/{sample}.sorted.bam.bai"Created 7 isolated conda environments for different tool groups:
- alignment.yaml - BWA, samtools, samblaster, sambamba
- qc.yaml - FastQC, MultiQC
- peakcalling.yaml - MACS3
- bigwig.yaml - deepTools (bamCoverage, bamCompare)
- phantompeakqualtools.yaml - R, ChIPpeakAnno, required R packages
- bedtools.yaml - bedtools
- chromhmm.yaml - ChromHMM, OpenJDK
Benefits:
- ✅ Reproducible software versions
- ✅ Isolated dependencies (no conflicts)
- ✅ Easy installation (automatic via Snakemake)
- ✅ Portable across systems
Old Approach (cluster.json):
{
"align": {
"__default__": {
"mem": "32000",
"time": "12:00:00",
"cpus": "10"
}
}
}New Approach (Snakemake Profiles):
# profiles/slurm/config.yaml
jobs: 100
use-conda: True
conda-frontend: "mamba"
cluster: "sbatch --parsable --mem={resources.mem_mb} ..."
default-resources:
- mem_mb=8000
- runtime=120Usage:
# OLD
snakemake --cluster-config cluster.json --cluster '...' -j 99
# NEW
snakemake --profile profiles/slurmconfig.yaml changes:
- ✅ Removed hardcoded user-specific paths
- ✅ Added comprehensive comments and documentation
- ✅ Organized into logical sections
- ✅ Changed MACS p-values to q-values (FDR) for MACS3
- ✅ Updated default parameters for modern standards
- ✅ Made paths configurable with examples
Key Parameter Updates:
# OLD - p-value (not FDR corrected)
macs_pvalue: 1e-9
# NEW - q-value (FDR corrected, more appropriate)
macs_pvalue: 0.05MACS1/MACS2 → MACS3 Migration
Changes:
- Unified command:
macs3 callpeak(instead ofmacsandmacs2 callpeak) - Changed from
-p(p-value) to-q(q-value/FDR) as primary threshold - Updated output file extensions:
- MACS1:
.bed→ MACS3:.narrowPeak - MACS2:
.xls→ MACS3:.broadPeak+.xls
- MACS1:
- Removed separate "nomodel" calls (MACS3 handles this internally)
Output Directories:
08peak_macs1/→08peak_macs3/(narrow peaks)09peak_macs2/→09peak_macs3/(broad peaks)
Updated Commands:
# OLD - basic samtools without thread support
samtools sort -m 2G -@ 5 -T {output[0]}.tmp -o {output[0]}
# NEW - modern samtools with proper threading
samtools sort -m 2G -@ {threads} -T {output.bam}.tmp -o {output.bam}
samtools index -@ {threads} {output.bam}Improved deepTools:
# OLD
bamCompare --bamfile1 {input[1]} --bamfile2 {input[0]} --normalizeUsingRPKM --ratio subtract -p 5
# NEW
bamCompare --bamfile1 {input.case_bam} --bamfile2 {input.control_bam} \
--normalizeUsing RPKM --operation subtract \
--numberOfProcessors {threads}New Files:
- INSTALL.md - Comprehensive installation guide
- MODERNIZATION.md - This file, documenting all changes
- CLAUDE.md - Repository structure analysis
- profiles/slurm/README.md - SLURM profile documentation
Updated Files:
config.yaml- Extensive inline documentationREADME.md- Updated for modern usage
Named Inputs/Outputs:
# Easier to read and maintain
input:
control_bam = "...",
case_bam = "..."
output:
peaks = "...",
xls = "..."Consistent Threading:
# OLD - hardcoded
bwa mem -t 5 ...
# NEW - uses Snakemake threads
bwa mem -t {threads} ...Better Resource Management:
- All rules now have explicit resource requirements
- Default resources defined in profile
- Easier to tune for different cluster environments
| Old | New | Notes |
|---|---|---|
08peak_macs1/{sample}_macs1_peaks.bed |
08peak_macs3/{sample}_narrow_peaks.narrowPeak |
MACS3 narrow peaks |
08peak_macs1/{sample}_macs1_nomodel_peaks.bed |
Combined into above | MACS3 handles nomodel internally |
09peak_macs2/{sample}_macs2_peaks.xls |
09peak_macs3/{sample}_broad_peaks.broadPeak |
MACS3 broad peaks |
# Parameter name changes
macs_pvalue: 1e-8 → macs_pvalue: 0.05 # Now expects q-value
phantom_path: /path/to/run_spp.R → Removed # Auto-downloaded
chromHMM_path: /path/to/ChromHMM/ → Removed # Conda-managed# OLD
./pyflow-ChIPseq.sh
# NEW
snakemake --profile profiles/slurm
# or
snakemake --use-conda --cores 8- ✅ Sample metadata format (
meta.txt) - ✅
sample2json.pyscript (already Python 3) - ✅
sbatch_cluster.pyscript (already Python 3) - ✅ Directory structure for outputs
- ✅ Input file organization
- ✅ Core workflow logic
- ❌ Old conda environments (need Python 3)
- ❌ cluster.json (replaced by profiles)
- ❌ Execution scripts (use Snakemake directly)
- ❌ Hardcoded paths in config.yaml
-
Update config.yaml
# Review and update these settings - ref_fa: point to your genome - macs_pvalue: change from 1e-9 to 0.05 (or desired q-value) - Remove phantom_path, chromHMM_path (now conda-managed) -
Create Conda Environments
snakemake --use-conda --conda-create-envs-only --cores 1
-
Test with Dry Run
snakemake -n --use-conda
-
Run Pipeline
# Local snakemake --use-conda --cores 8 # SLURM cluster snakemake --profile profiles/slurm
- Using Mamba (faster conda alternative) reduces environment creation time
- Pre-creating environments recommended for cluster runs
- Better thread utilization with parameterized
{threads} - Scales with available resources
- More accurate resource requests reduce job failures
- Better cluster utilization
- ✅ Snakefile syntax validation
- ✅ Conda environment creation
- ✅ Dry run completion
- ✅ Tool version compatibility checks
- ✅ Configuration parsing
Before running on production data:
# 1. Dry run
snakemake -n --use-conda
# 2. Create environments
snakemake --use-conda --conda-create-envs-only --cores 1
# 3. Test on subset
# Limit to first few rules
snakemake --use-conda --cores 4 02fqc/test_sample_R1_fastqc.zip- Container Support - Add Docker/Singularity containers for full reproducibility
- Automated Testing - Add CI/CD with test datasets
- ROSE Alternative - Replace Python 2.7 dependent ROSE with modern alternative
- BWA-MEM2 - Switch to faster BWA-MEM2 aligner
- Benchmark Directives - Add runtime and memory benchmarking
- Report Generation - Automated MultiQC-style reports for peak calls
- Checkpoints - Dynamic rule execution based on results
- Module System - Modularize Snakefile for easier maintenance
- Parameter Optimization - Auto-tune parameters based on data quality
- Original pipeline: Ming Tang (https://github.com/crazyhottommy)
- Modernization: Claude Code (2025)
- Community: Snakemake developers and bioinformatics community
For questions or issues with the modernized version:
- Check INSTALL.md for setup help
- Review CLAUDE.md for architecture understanding
- Open an issue on GitHub
- Consult Snakemake documentation
Snakefile- Complete modernizationconfig.yaml- Reorganized and documented.gitignore- Add Snakemake artifacts (if needed)
envs/*.yaml- 7 conda environment filesprofiles/slurm/config.yaml- SLURM profileprofiles/slurm/README.md- Profile documentationINSTALL.md- Installation guideMODERNIZATION.md- This documentCLAUDE.md- Repository structure documentation
cluster.json- Use profiles insteadpyflow-ChIPseq.sh- Usesnakemake --profileinsteadpyflow-drmaa-ChIPseq.sh- Legacy DRMAA support
The pipeline is now ready for modern ChIP-seq analysis!