ATAC-seq: From FASTQs to Peak–Gene Correlation

This repository holds the inputs, configs, and analysis scripts for the Stewart Lab ATAC-seq time-course experiments. The work runs in three stages:

Alignment & peak calling — the ENCODE ATAC-seq pipeline (via caper/cromwell + Docker) takes raw FASTQs to QC'd, peak-called output.
Differential accessibility — COBRA and/or DiffBind call differentially accessible peaks across timepoints, annotating each with its nearby gene.
Peak ranking — peak_ranking.py correlates each peak's accessibility time-course against matched RNA-seq TPM time-courses to rank peaks most likely to regulate genes of interest.

Experiments

Two ENCODE submissions are tracked here:

Submission	Input dir	Biology	Timepoints
Sub 488	`sub_488_input_ENCODE/`	H1 hESC differentiation: mesendoderm (T⁺) → definitive endoderm (SOX17⁺). Expected T peak ≈ 30h, SOX17 peak ≈ 90h.	0, 30, 50, 70, 90h
Sub 790	`sub_790_input_ENCODE/`	hESC-derived presomitic mesoderm (PSM) maintained in PSM medium from day 3; hourly course capturing one HES7 oscillation cycle.	0–7 (hourly)

Environment

Two conda spec files are provided:

environment.yml — curated, portable top-level dependencies (Python 3.11, pandas, numpy, scikit-learn, matplotlib, pyyaml, bedtools, the Bioconductor DiffBind/DESeq2 stack, and caper/croo/qc2tsv via pip). Use this on a new machine.
atac_seq_env.yml — full conda env export lockfile with every build hash pinned, for an exact reproduction of the original MIR run.

On a MIR machine, activate the prebuilt env:

conda activate /isiseqruns/jfreeman_tmp_home/bin/conda_envs/atac_seq

Otherwise build it:

conda env create -f environment.yml   # or atac_seq_env.yml for the exact lock
conda activate atac_seq

Configuring paths (`config.yaml`)

Machine-specific paths used by the Python analysis (RNA-seq TPM tables, the COBRA differential-peak directory, gene-pathway lists, and the plot output dir) live in config.yaml, loaded by config.py. The committed values reproduce the original MIR run — edit them for your machine, or point ATAC_CONFIG at an edited copy:

ATAC_CONFIG=/path/to/my_config.yaml python peak_ranking.py

Note. The COBRA configs (comparison*_input_COBRA/config_*.yaml), the DiffBind .rmd, and the ENCODE driver scripts (sub_*.sh) still contain hard-coded absolute paths under /isiseqruns/...; edit those directly before re-running those stages.

Stage 1 — ENCODE pipeline (FASTQs → peaks)

Prerequisites

Confirm Docker works for your user:

docker run ubuntu:latest echo hello

If hello is echoed you're set; otherwise see the Docker docs.

The ENCODE pipeline uses caper, a Python wrapper for Cromwell (a workflow manager for bioinformatics). Initialize the local backend:

caper init local

Then set your output directory in ~/.caper/default.conf.

This pipeline requires Java > 18. On MIR a JDK 19 is available; add it to your PATH:

/isitools/jdk-19.0.1/bin/java

Clone the ENCODE pipeline and enter it:

git clone https://github.com/ENCODE-DCC/atac-seq-pipeline
cd atac-seq-pipeline

Input JSONs and driver scripts

Each timepoint has an input JSON describing genome, FASTQ paths, and run options — see sub_488_input_ENCODE/input_JSONs/time_point*.json and the sub_790_input_ENCODE/ equivalents. Use them as templates (genome is hg38; paired_end: true, multimapping: 4).

A driver script runs the pipeline over all timepoints in sequence — e.g. sub_488_input_ENCODE/script/sub_488.sh:

cd /isiseqruns/jfreeman_tmp_home/ENCODE/atac-seq-pipeline/
INPUT_JSON=.../input_JSONs/time_point0.json
caper run atac.wdl -i "${INPUT_JSON}" --docker
# ...repeated per timepoint

Edit line 1 (cd ...) to point at your cloned atac-seq-pipeline directory. Runs can take days — nohup the script. See the ENCODE input JSON guide.

Collecting output and QC

With the conda env active, cd into a per-timepoint output dir and organize the Cromwell output with croo:

croo /isiseqruns/jfreeman_tmp_home/CAPER/outdir/atac/<TIMESTAMP>/metadata.json

Repeat per sample, then funnel each sample's QC into a single TSV with qc2tsv:

qc2tsv .../time0/qc/qc.json .../time1/qc/qc.json ... > comparison1.tsv

Aggregated QC for these runs lives in ATAC_SEQ_QC.xlsx; cross-comparison summary slides are in ATAC-seq_comparisons.pptx.

Stage 2 — Differential accessibility

COBRA

Each comparison has a COBRA config (config_*.yaml) plus a metasheet defining the pairwise contrasts. COBRA aligns/quantifies peaks, runs DESeq2, and produces PCA, sample–sample and sample–feature plots, motif analysis, and per-contrast differential-peak tables annotated with the nearby gene (*.deseq.with.Nearby.Gene.csv).

Comparison	Config / metasheet	Contrasts
`comparison1_input_COBRA/`	`config_1.yaml`, `comparison1_metasheet.csv`	Sub 790 PSM time-series — all pairwise contrasts across the 8 hourly timepoints
`comparison2_input_COBRA/`	`config_2.yaml`, `comparison2_metasheet.csv`	Sub 488 time-course — pairwise contrasts among 0/30/50/70/90h
`comparison3_input_COBRA/`	`config_3.yaml`, `comparison3_metasheet.csv`	Cross-submission: 790_0h vs 488_0h, 790_0h vs 488_30h, 488_0h vs 488_30h

Key config knobs (see config_1.yaml for the fully commented example): enhancer: all, quantile scaling (scale: q), Pearson correlation, k-means clustering on the sample–feature heatmap, DESeq cutoffs Padj: 0.05, motif analysis on, hg38 assembly.

Two preprocessing/diagnostic helpers live alongside comparison1:

PSM_peak_count_distribution.py — plots the rank-ordered per-sample peak count distribution (log–log) for each timepoint replicate, to inspect count distributions before normalization.
quantile_norm_PSM_peaks.py — quantile-normalizes the raw counts matrix to the 99th percentile and writes an updated counts CSV, with before/after column-sum bar charts.

DiffBind (alternative)

comparison1_input_COBRA/Diffbind_differential_accessability.rmd is an R Markdown workflow that runs the same differential-binding analysis with the DiffBind Bioconductor package. It reads Diffbind_metasheet.csv (one row per sample → BAM + narrowPeak + condition/replicate) and builds a dba() object. Update the setwd(...) path at the top before knitting.

Stage 3 — Peak ranking (`peak_ranking.py`)

Ranks differentially accessible peaks by how well their accessibility time-course tracks the mRNA time-course of a gene of interest. For each peak near a gene, it fits a linear regression of peak treat_mean vs gene TPM across timepoints, scores it by R², sorts peaks per gene, and plots those above a correlation threshold (saved to peak_ranking_outputs/<GENE>_peaks.png).

python peak_ranking.py --threshold 0.9 --gene_set rons --comparison 3 --min_tpm 5

Flag	Default	Meaning
`--threshold`	`0.9`	Minimum R² to report/plot a peak
`--gene_set`	`rons`	Gene set: `rons`, `notch`, `wnt`, or `mapk`
`--comparison`	`3`	`2` (within Sub 488 time-course) or `3` (488/790 combined)
`--min_tpm`	`5`	Minimum TPM for a gene to be retained

Gene sets

rons — hard-coded developmental markers: TBXT, POU5F1, SOX2, SOX17, EOMES, NANOG.
notch, wnt, mapk — read from gene_pathways_of_interest/{notch,wnt,mapk}.txt (KEGG-style dumps; symbols are parsed by regex as the token preceding ;).

Data plumbing (`get_peak_ranking_dfs.py`)

This module assembles the two inputs the regression needs:

Peak time-courses — read from the COBRA *.deseq.with.Nearby.Gene.csv differential-peak tables (Sub 488 timepoints 30/50/70/90 via read_488_peak_data; Sub 790 via read_790_peak_data), filtered to the chosen gene set, cleaned, and scaled per peak.
mRNA TPM time-courses — read from GUP collation tables (tpm_files in config.yaml), filtered to the gene set, replicate-averaged, relabeled by timepoint, and scaled.

comparison == 3 merges the 488 and 790 peak/mRNA frames into a combined timeline.

Note: this module uses DataFrame.append and groupby(..., axis=1), both removed in pandas ≥ 2.0. It runs against the pinned environment (atac_seq_env.yml); newer pandas will require edits.

Retrieving TPM files (`get_TPM_files_from_GUP.py`)

Standalone helper that pulls RNA-seq samples for a given submission number from the GUP Mongo database (gup package) and runs an hg38 collation calc to produce the genes_NM.tpm.tab tables consumed above. The submission number comes from gup_submission in config.yaml.

Repository layout

sub_488_input_ENCODE/        ENCODE input JSONs + driver script (Sub 488)
sub_790_input_ENCODE/        ENCODE input JSONs + driver script (Sub 790)
comparison1_input_COBRA/     COBRA config + metasheet + DiffBind rmd + preprocessing scripts
comparison2_input_COBRA/     COBRA config + metasheet
comparison3_input_COBRA/     COBRA config + metasheet
gene_pathways_of_interest/   notch / wnt / mapk gene lists
peak_ranking.py              Stage 3: regression-based peak ranking + plots
get_peak_ranking_dfs.py      Stage 3: peak/mRNA dataframe assembly
get_TPM_files_from_GUP.py    Retrieve RNA-seq TPM tables from GUP
config.py / config.yaml      Machine-specific path configuration
environment.yml              Curated, portable conda environment
atac_seq_env.yml             Full conda env lockfile (exact build pins)
ATAC_SEQ_QC.xlsx             Aggregated QC metrics
ATAC-seq_comparisons.pptx    Cross-comparison summary slides

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ATAC-seq: From FASTQs to Peak–Gene Correlation

Experiments

Environment

Configuring paths (`config.yaml`)

Stage 1 — ENCODE pipeline (FASTQs → peaks)

Prerequisites

Input JSONs and driver scripts

Collecting output and QC

Stage 2 — Differential accessibility

COBRA

DiffBind (alternative)

Stage 3 — Peak ranking (`peak_ranking.py`)

Gene sets

Data plumbing (`get_peak_ranking_dfs.py`)

Retrieving TPM files (`get_TPM_files_from_GUP.py`)

Repository layout

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
comparison1_input_COBRA		comparison1_input_COBRA
comparison2_input_COBRA		comparison2_input_COBRA
comparison3_input_COBRA		comparison3_input_COBRA
gene_pathways_of_interest		gene_pathways_of_interest
sub_488_input_ENCODE		sub_488_input_ENCODE
sub_790_input_ENCODE		sub_790_input_ENCODE
.gitignore		.gitignore
ATAC-seq_comparisons.pptx		ATAC-seq_comparisons.pptx
ATAC_SEQ_QC.xlsx		ATAC_SEQ_QC.xlsx
README.md		README.md
atac_seq_env.yml		atac_seq_env.yml
config.py		config.py
config.yaml		config.yaml
environment.yml		environment.yml
get_TPM_files_from_GUP.py		get_TPM_files_from_GUP.py
get_peak_ranking_dfs.py		get_peak_ranking_dfs.py
peak_ranking.py		peak_ranking.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ATAC-seq: From FASTQs to Peak–Gene Correlation

Experiments

Environment

Configuring paths (config.yaml)

Stage 1 — ENCODE pipeline (FASTQs → peaks)

Prerequisites

Input JSONs and driver scripts

Collecting output and QC

Stage 2 — Differential accessibility

COBRA

DiffBind (alternative)

Stage 3 — Peak ranking (peak_ranking.py)

Gene sets

Data plumbing (get_peak_ranking_dfs.py)

Retrieving TPM files (get_TPM_files_from_GUP.py)

Repository layout

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuring paths (`config.yaml`)

Stage 3 — Peak ranking (`peak_ranking.py`)

Data plumbing (`get_peak_ranking_dfs.py`)

Retrieving TPM files (`get_TPM_files_from_GUP.py`)

Packages