This repository holds the inputs, configs, and analysis scripts for the Stewart Lab ATAC-seq time-course experiments. The work runs in three stages:
- Alignment & peak calling — the ENCODE ATAC-seq pipeline (via
caper/cromwell+ Docker) takes raw FASTQs to QC'd, peak-called output. - Differential accessibility — COBRA and/or DiffBind call differentially accessible peaks across timepoints, annotating each with its nearby gene.
- Peak ranking —
peak_ranking.pycorrelates each peak's accessibility time-course against matched RNA-seq TPM time-courses to rank peaks most likely to regulate genes of interest.
Two ENCODE submissions are tracked here:
| Submission | Input dir | Biology | Timepoints |
|---|---|---|---|
| Sub 488 | sub_488_input_ENCODE/ |
H1 hESC differentiation: mesendoderm (T⁺) → definitive endoderm (SOX17⁺). Expected T peak ≈ 30h, SOX17 peak ≈ 90h. | 0, 30, 50, 70, 90h |
| Sub 790 | sub_790_input_ENCODE/ |
hESC-derived presomitic mesoderm (PSM) maintained in PSM medium from day 3; hourly course capturing one HES7 oscillation cycle. | 0–7 (hourly) |
Two conda spec files are provided:
environment.yml— curated, portable top-level dependencies (Python 3.11,pandas,numpy,scikit-learn,matplotlib,pyyaml,bedtools, the BioconductorDiffBind/DESeq2stack, andcaper/croo/qc2tsvvia pip). Use this on a new machine.atac_seq_env.yml— fullconda env exportlockfile with every build hash pinned, for an exact reproduction of the original MIR run.
On a MIR machine, activate the prebuilt env:
conda activate /isiseqruns/jfreeman_tmp_home/bin/conda_envs/atac_seq
Otherwise build it:
conda env create -f environment.yml # or atac_seq_env.yml for the exact lock
conda activate atac_seq
Machine-specific paths used by the Python analysis (RNA-seq TPM tables, the COBRA
differential-peak directory, gene-pathway lists, and the plot output dir) live in config.yaml,
loaded by config.py. The committed values reproduce the original MIR run — edit them for your
machine, or point ATAC_CONFIG at an edited copy:
ATAC_CONFIG=/path/to/my_config.yaml python peak_ranking.py
Note. The COBRA configs (
comparison*_input_COBRA/config_*.yaml), the DiffBind.rmd, and the ENCODE driver scripts (sub_*.sh) still contain hard-coded absolute paths under/isiseqruns/...; edit those directly before re-running those stages.
Confirm Docker works for your user:
docker run ubuntu:latest echo hello
If hello is echoed you're set; otherwise see the Docker docs.
The ENCODE pipeline uses caper, a Python wrapper for Cromwell (a workflow manager for bioinformatics). Initialize the local backend:
caper init local
Then set your output directory in ~/.caper/default.conf.
This pipeline requires Java > 18. On MIR a JDK 19 is available; add it to your PATH:
/isitools/jdk-19.0.1/bin/java
Clone the ENCODE pipeline and enter it:
git clone https://github.com/ENCODE-DCC/atac-seq-pipeline
cd atac-seq-pipeline
Each timepoint has an input JSON describing genome, FASTQ paths, and run options — see
sub_488_input_ENCODE/input_JSONs/time_point*.json and the sub_790_input_ENCODE/ equivalents.
Use them as templates (genome is hg38; paired_end: true, multimapping: 4).
A driver script runs the pipeline over all timepoints in sequence — e.g.
sub_488_input_ENCODE/script/sub_488.sh:
cd /isiseqruns/jfreeman_tmp_home/ENCODE/atac-seq-pipeline/
INPUT_JSON=.../input_JSONs/time_point0.json
caper run atac.wdl -i "${INPUT_JSON}" --docker
# ...repeated per timepointEdit line 1 (cd ...) to point at your cloned atac-seq-pipeline directory. Runs can take
days — nohup the script. See the ENCODE
input JSON guide.
With the conda env active, cd into a per-timepoint output dir and organize the Cromwell output
with croo:
croo /isiseqruns/jfreeman_tmp_home/CAPER/outdir/atac/<TIMESTAMP>/metadata.json
Repeat per sample, then funnel each sample's QC into a single TSV with qc2tsv:
qc2tsv .../time0/qc/qc.json .../time1/qc/qc.json ... > comparison1.tsv
Aggregated QC for these runs lives in ATAC_SEQ_QC.xlsx; cross-comparison summary slides are in
ATAC-seq_comparisons.pptx.
Each comparison has a COBRA config (config_*.yaml) plus a metasheet defining the pairwise
contrasts. COBRA aligns/quantifies peaks, runs DESeq2, and produces PCA, sample–sample and
sample–feature plots, motif analysis, and per-contrast differential-peak tables annotated with the
nearby gene (*.deseq.with.Nearby.Gene.csv).
| Comparison | Config / metasheet | Contrasts |
|---|---|---|
comparison1_input_COBRA/ |
config_1.yaml, comparison1_metasheet.csv |
Sub 790 PSM time-series — all pairwise contrasts across the 8 hourly timepoints |
comparison2_input_COBRA/ |
config_2.yaml, comparison2_metasheet.csv |
Sub 488 time-course — pairwise contrasts among 0/30/50/70/90h |
comparison3_input_COBRA/ |
config_3.yaml, comparison3_metasheet.csv |
Cross-submission: 790_0h vs 488_0h, 790_0h vs 488_30h, 488_0h vs 488_30h |
Key config knobs (see config_1.yaml for the fully commented example): enhancer: all,
quantile scaling (scale: q), Pearson correlation, k-means clustering on the sample–feature heatmap,
DESeq cutoffs Padj: 0.05, motif analysis on, hg38 assembly.
Two preprocessing/diagnostic helpers live alongside comparison1:
PSM_peak_count_distribution.py— plots the rank-ordered per-sample peak count distribution (log–log) for each timepoint replicate, to inspect count distributions before normalization.quantile_norm_PSM_peaks.py— quantile-normalizes the raw counts matrix to the 99th percentile and writes an updated counts CSV, with before/after column-sum bar charts.
comparison1_input_COBRA/Diffbind_differential_accessability.rmd is an R Markdown workflow that runs
the same differential-binding analysis with the DiffBind Bioconductor package. It reads
Diffbind_metasheet.csv (one row per sample → BAM + narrowPeak + condition/replicate) and builds a
dba() object. Update the setwd(...) path at the top before knitting.
Ranks differentially accessible peaks by how well their accessibility time-course tracks the mRNA
time-course of a gene of interest. For each peak near a gene, it fits a linear regression of peak
treat_mean vs gene TPM across timepoints, scores it by R², sorts peaks per gene, and plots those
above a correlation threshold (saved to peak_ranking_outputs/<GENE>_peaks.png).
python peak_ranking.py --threshold 0.9 --gene_set rons --comparison 3 --min_tpm 5
| Flag | Default | Meaning |
|---|---|---|
--threshold |
0.9 |
Minimum R² to report/plot a peak |
--gene_set |
rons |
Gene set: rons, notch, wnt, or mapk |
--comparison |
3 |
2 (within Sub 488 time-course) or 3 (488/790 combined) |
--min_tpm |
5 |
Minimum TPM for a gene to be retained |
rons— hard-coded developmental markers:TBXT, POU5F1, SOX2, SOX17, EOMES, NANOG.notch,wnt,mapk— read fromgene_pathways_of_interest/{notch,wnt,mapk}.txt(KEGG-style dumps; symbols are parsed by regex as the token preceding;).
This module assembles the two inputs the regression needs:
- Peak time-courses — read from the COBRA
*.deseq.with.Nearby.Gene.csvdifferential-peak tables (Sub 488 timepoints 30/50/70/90 viaread_488_peak_data; Sub 790 viaread_790_peak_data), filtered to the chosen gene set, cleaned, and scaled per peak. - mRNA TPM time-courses — read from GUP collation tables (
tpm_filesinconfig.yaml), filtered to the gene set, replicate-averaged, relabeled by timepoint, and scaled.
comparison == 3 merges the 488 and 790 peak/mRNA frames into a combined timeline.
Note: this module uses
DataFrame.appendandgroupby(..., axis=1), both removed in pandas ≥ 2.0. It runs against the pinned environment (atac_seq_env.yml); newer pandas will require edits.
Standalone helper that pulls RNA-seq samples for a given submission number from the GUP Mongo
database (gup package) and runs an hg38 collation calc to produce the genes_NM.tpm.tab tables
consumed above. The submission number comes from gup_submission in config.yaml.
sub_488_input_ENCODE/ ENCODE input JSONs + driver script (Sub 488)
sub_790_input_ENCODE/ ENCODE input JSONs + driver script (Sub 790)
comparison1_input_COBRA/ COBRA config + metasheet + DiffBind rmd + preprocessing scripts
comparison2_input_COBRA/ COBRA config + metasheet
comparison3_input_COBRA/ COBRA config + metasheet
gene_pathways_of_interest/ notch / wnt / mapk gene lists
peak_ranking.py Stage 3: regression-based peak ranking + plots
get_peak_ranking_dfs.py Stage 3: peak/mRNA dataframe assembly
get_TPM_files_from_GUP.py Retrieve RNA-seq TPM tables from GUP
config.py / config.yaml Machine-specific path configuration
environment.yml Curated, portable conda environment
atac_seq_env.yml Full conda env lockfile (exact build pins)
ATAC_SEQ_QC.xlsx Aggregated QC metrics
ATAC-seq_comparisons.pptx Cross-comparison summary slides