Skip to content

stewart-lab/ATAC-seq

Repository files navigation

ATAC-seq: From FASTQs to Peak–Gene Correlation

This repository holds the inputs, configs, and analysis scripts for the Stewart Lab ATAC-seq time-course experiments. The work runs in three stages:

  1. Alignment & peak calling — the ENCODE ATAC-seq pipeline (via caper/cromwell + Docker) takes raw FASTQs to QC'd, peak-called output.
  2. Differential accessibilityCOBRA and/or DiffBind call differentially accessible peaks across timepoints, annotating each with its nearby gene.
  3. Peak rankingpeak_ranking.py correlates each peak's accessibility time-course against matched RNA-seq TPM time-courses to rank peaks most likely to regulate genes of interest.

Experiments

Two ENCODE submissions are tracked here:

Submission Input dir Biology Timepoints
Sub 488 sub_488_input_ENCODE/ H1 hESC differentiation: mesendoderm (T⁺) → definitive endoderm (SOX17⁺). Expected T peak ≈ 30h, SOX17 peak ≈ 90h. 0, 30, 50, 70, 90h
Sub 790 sub_790_input_ENCODE/ hESC-derived presomitic mesoderm (PSM) maintained in PSM medium from day 3; hourly course capturing one HES7 oscillation cycle. 0–7 (hourly)

Environment

Two conda spec files are provided:

  • environment.yml — curated, portable top-level dependencies (Python 3.11, pandas, numpy, scikit-learn, matplotlib, pyyaml, bedtools, the Bioconductor DiffBind/DESeq2 stack, and caper/croo/qc2tsv via pip). Use this on a new machine.
  • atac_seq_env.yml — full conda env export lockfile with every build hash pinned, for an exact reproduction of the original MIR run.

On a MIR machine, activate the prebuilt env:

conda activate /isiseqruns/jfreeman_tmp_home/bin/conda_envs/atac_seq

Otherwise build it:

conda env create -f environment.yml   # or atac_seq_env.yml for the exact lock
conda activate atac_seq

Configuring paths (config.yaml)

Machine-specific paths used by the Python analysis (RNA-seq TPM tables, the COBRA differential-peak directory, gene-pathway lists, and the plot output dir) live in config.yaml, loaded by config.py. The committed values reproduce the original MIR run — edit them for your machine, or point ATAC_CONFIG at an edited copy:

ATAC_CONFIG=/path/to/my_config.yaml python peak_ranking.py

Note. The COBRA configs (comparison*_input_COBRA/config_*.yaml), the DiffBind .rmd, and the ENCODE driver scripts (sub_*.sh) still contain hard-coded absolute paths under /isiseqruns/...; edit those directly before re-running those stages.


Stage 1 — ENCODE pipeline (FASTQs → peaks)

Prerequisites

Confirm Docker works for your user:

docker run ubuntu:latest echo hello

If hello is echoed you're set; otherwise see the Docker docs.

The ENCODE pipeline uses caper, a Python wrapper for Cromwell (a workflow manager for bioinformatics). Initialize the local backend:

caper init local

Then set your output directory in ~/.caper/default.conf.

This pipeline requires Java > 18. On MIR a JDK 19 is available; add it to your PATH:

/isitools/jdk-19.0.1/bin/java

Clone the ENCODE pipeline and enter it:

git clone https://github.com/ENCODE-DCC/atac-seq-pipeline
cd atac-seq-pipeline

Input JSONs and driver scripts

Each timepoint has an input JSON describing genome, FASTQ paths, and run options — see sub_488_input_ENCODE/input_JSONs/time_point*.json and the sub_790_input_ENCODE/ equivalents. Use them as templates (genome is hg38; paired_end: true, multimapping: 4).

A driver script runs the pipeline over all timepoints in sequence — e.g. sub_488_input_ENCODE/script/sub_488.sh:

cd /isiseqruns/jfreeman_tmp_home/ENCODE/atac-seq-pipeline/
INPUT_JSON=.../input_JSONs/time_point0.json
caper run atac.wdl -i "${INPUT_JSON}" --docker
# ...repeated per timepoint

Edit line 1 (cd ...) to point at your cloned atac-seq-pipeline directory. Runs can take days — nohup the script. See the ENCODE input JSON guide.

Collecting output and QC

With the conda env active, cd into a per-timepoint output dir and organize the Cromwell output with croo:

croo /isiseqruns/jfreeman_tmp_home/CAPER/outdir/atac/<TIMESTAMP>/metadata.json

Repeat per sample, then funnel each sample's QC into a single TSV with qc2tsv:

qc2tsv .../time0/qc/qc.json .../time1/qc/qc.json ... > comparison1.tsv

Aggregated QC for these runs lives in ATAC_SEQ_QC.xlsx; cross-comparison summary slides are in ATAC-seq_comparisons.pptx.


Stage 2 — Differential accessibility

COBRA

Each comparison has a COBRA config (config_*.yaml) plus a metasheet defining the pairwise contrasts. COBRA aligns/quantifies peaks, runs DESeq2, and produces PCA, sample–sample and sample–feature plots, motif analysis, and per-contrast differential-peak tables annotated with the nearby gene (*.deseq.with.Nearby.Gene.csv).

Comparison Config / metasheet Contrasts
comparison1_input_COBRA/ config_1.yaml, comparison1_metasheet.csv Sub 790 PSM time-series — all pairwise contrasts across the 8 hourly timepoints
comparison2_input_COBRA/ config_2.yaml, comparison2_metasheet.csv Sub 488 time-course — pairwise contrasts among 0/30/50/70/90h
comparison3_input_COBRA/ config_3.yaml, comparison3_metasheet.csv Cross-submission: 790_0h vs 488_0h, 790_0h vs 488_30h, 488_0h vs 488_30h

Key config knobs (see config_1.yaml for the fully commented example): enhancer: all, quantile scaling (scale: q), Pearson correlation, k-means clustering on the sample–feature heatmap, DESeq cutoffs Padj: 0.05, motif analysis on, hg38 assembly.

Two preprocessing/diagnostic helpers live alongside comparison1:

  • PSM_peak_count_distribution.py — plots the rank-ordered per-sample peak count distribution (log–log) for each timepoint replicate, to inspect count distributions before normalization.
  • quantile_norm_PSM_peaks.py — quantile-normalizes the raw counts matrix to the 99th percentile and writes an updated counts CSV, with before/after column-sum bar charts.

DiffBind (alternative)

comparison1_input_COBRA/Diffbind_differential_accessability.rmd is an R Markdown workflow that runs the same differential-binding analysis with the DiffBind Bioconductor package. It reads Diffbind_metasheet.csv (one row per sample → BAM + narrowPeak + condition/replicate) and builds a dba() object. Update the setwd(...) path at the top before knitting.


Stage 3 — Peak ranking (peak_ranking.py)

Ranks differentially accessible peaks by how well their accessibility time-course tracks the mRNA time-course of a gene of interest. For each peak near a gene, it fits a linear regression of peak treat_mean vs gene TPM across timepoints, scores it by R², sorts peaks per gene, and plots those above a correlation threshold (saved to peak_ranking_outputs/<GENE>_peaks.png).

python peak_ranking.py --threshold 0.9 --gene_set rons --comparison 3 --min_tpm 5
Flag Default Meaning
--threshold 0.9 Minimum R² to report/plot a peak
--gene_set rons Gene set: rons, notch, wnt, or mapk
--comparison 3 2 (within Sub 488 time-course) or 3 (488/790 combined)
--min_tpm 5 Minimum TPM for a gene to be retained

Gene sets

  • rons — hard-coded developmental markers: TBXT, POU5F1, SOX2, SOX17, EOMES, NANOG.
  • notch, wnt, mapk — read from gene_pathways_of_interest/{notch,wnt,mapk}.txt (KEGG-style dumps; symbols are parsed by regex as the token preceding ;).

Data plumbing (get_peak_ranking_dfs.py)

This module assembles the two inputs the regression needs:

  • Peak time-courses — read from the COBRA *.deseq.with.Nearby.Gene.csv differential-peak tables (Sub 488 timepoints 30/50/70/90 via read_488_peak_data; Sub 790 via read_790_peak_data), filtered to the chosen gene set, cleaned, and scaled per peak.
  • mRNA TPM time-courses — read from GUP collation tables (tpm_files in config.yaml), filtered to the gene set, replicate-averaged, relabeled by timepoint, and scaled.

comparison == 3 merges the 488 and 790 peak/mRNA frames into a combined timeline.

Note: this module uses DataFrame.append and groupby(..., axis=1), both removed in pandas ≥ 2.0. It runs against the pinned environment (atac_seq_env.yml); newer pandas will require edits.

Retrieving TPM files (get_TPM_files_from_GUP.py)

Standalone helper that pulls RNA-seq samples for a given submission number from the GUP Mongo database (gup package) and runs an hg38 collation calc to produce the genes_NM.tpm.tab tables consumed above. The submission number comes from gup_submission in config.yaml.


Repository layout

sub_488_input_ENCODE/        ENCODE input JSONs + driver script (Sub 488)
sub_790_input_ENCODE/        ENCODE input JSONs + driver script (Sub 790)
comparison1_input_COBRA/     COBRA config + metasheet + DiffBind rmd + preprocessing scripts
comparison2_input_COBRA/     COBRA config + metasheet
comparison3_input_COBRA/     COBRA config + metasheet
gene_pathways_of_interest/   notch / wnt / mapk gene lists
peak_ranking.py              Stage 3: regression-based peak ranking + plots
get_peak_ranking_dfs.py      Stage 3: peak/mRNA dataframe assembly
get_TPM_files_from_GUP.py    Retrieve RNA-seq TPM tables from GUP
config.py / config.yaml      Machine-specific path configuration
environment.yml              Curated, portable conda environment
atac_seq_env.yml             Full conda env lockfile (exact build pins)
ATAC_SEQ_QC.xlsx             Aggregated QC metrics
ATAC-seq_comparisons.pptx    Cross-comparison summary slides

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors