Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,15 @@ name: CI
on:
push:
branches: [main]
paths:
- 'method/**'
- '.github/workflows/ci.yml'
pull_request:
branches: [main]
paths:
- 'method/**'
- '.github/workflows/ci.yml'
workflow_dispatch:

jobs:
rust:
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,6 @@ workflow/Rmd/*_plots/

# Misc
*.cpg

# Local staging area, not part of the Snakemake workflow.
tmp/
43 changes: 24 additions & 19 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,48 +1,51 @@
## amet workflow entrypoints. Use these to run simulations + the three real
## datasets (Argelaguet, CRC, Ecker) whole-genome on a server with enough RAM.
## datasets (Argelaguet, CRC, Ecker) on a server with enough RAM.
##
## Not intended for laptops: the whole-genome runs allocate hundreds of GB of
## virtual memory across many parallel amet jobs. The recipes set ulimit -v
## Not intended for laptops: the runs allocate hundreds of GB of virtual
## memory across many parallel amet jobs. The recipes set ulimit -v
## 200 GB as a soft safeguard and let snakemake fan out across CORES cores.
##
## Usage:
## make argelaguet # whole-genome Argelaguet (4 Rmds)
## make crc # whole-genome CRC (6 Rmds)
## make ecker # whole-genome Ecker (4 Rmds)
## make simulations # simulations report
## make all # everything above
## make dryrun # snakemake -n for everything
## make unlock # release a stale snakemake lock
## make argelaguet # proto by default (results/argelaguet_proto/)
## make crc MODE=full # full grid (results/crc_full/)
## make ecker MODE=proto # explicit proto
## make simulations # simulations report (MODE-agnostic)
## make all MODE=full # simulations + 3 datasets in full mode
## make dryrun MODE=full # snakemake -n for everything (full)
## make unlock # release a stale snakemake lock
##
## Tunable variables (override on the command line):
## Variables (override on the command line):
## MODE proto | full which dataset config file to load
## (default: proto)
## CORES snakemake --cores value (default 16)
## ULIMIT_KB virtual memory cap in KB (default 209715200, i.e. 200 GB)
## CONDA_ENV name of the conda env that holds snakemake (default snakemake)
## CONDA_INIT path to the conda activation script
## (default ~/miniconda3/bin/activate)

MODE ?= proto
CORES ?= 16
ULIMIT_KB ?= 209715200
CONDA_ENV ?= snakemake
CONDA_INIT ?= $(HOME)/miniconda3/bin/activate

WORKFLOW_DIR := workflow
DATASETS_CONFIG := config/datasets_$(MODE).yaml

## Standard preamble: activate the snakemake conda env and apply the
## virtual-memory ulimit. snakemake's per-job shells inherit the ulimit, so
## individual amet jobs are bounded by the same cap.
ACTIVATE := source $(CONDA_INIT) && conda activate $(CONDA_ENV) && \
ulimit -v $(ULIMIT_KB)

SNAKEMAKE := snakemake --use-conda --cores $(CORES) -p
## The trailing `--` stops snakemake's option parsing so subsequent positional
## tokens are unambiguously targets, not extra --configfile values.
SNAKEMAKE := snakemake --use-conda --cores $(CORES) -p \
--configfile $(DATASETS_CONFIG) --

.PHONY: all simulations argelaguet crc ecker dryrun unlock clean help \
setup-barbara

## Set up symlinks from results/{dataset}/ to a pre-existing data tree on
## barbara so amet does not re-download or re-rsync anything. Run once before
## `make all`. See workflow/scripts/internal/setup_barbara_links.sh for the
## env vars it honors.
setup-barbara:
bash $(WORKFLOW_DIR)/scripts/internal/setup_barbara_links.sh

Expand All @@ -62,11 +65,13 @@ ecker:

dryrun:
cd $(WORKFLOW_DIR) && bash -c '$(ACTIVATE) && \
snakemake --cores $(CORES) -n simulations argelaguet crc ecker'
snakemake --cores $(CORES) --configfile $(DATASETS_CONFIG) \
-n -- simulations argelaguet crc ecker'

unlock:
cd $(WORKFLOW_DIR) && bash -c '$(ACTIVATE) && snakemake --unlock'

help:
@echo "Targets: all simulations argelaguet crc ecker dryrun unlock"
@echo "Variables: CORES=$(CORES) ULIMIT_KB=$(ULIMIT_KB) CONDA_ENV=$(CONDA_ENV)"
@echo "Targets: all simulations argelaguet crc ecker dryrun unlock setup-barbara"
@echo "Variables: MODE=$(MODE) CORES=$(CORES) ULIMIT_KB=$(ULIMIT_KB) CONDA_ENV=$(CONDA_ENV)"
@echo "Selected dataset config: $(DATASETS_CONFIG)"
30 changes: 18 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ amet/
Snakefile simulations + dispatch to the three dataset rule files
config/
sim.yaml simulation parameters
datasets.yaml dataset paths, prototype subsets, window sizes
datasets_proto.yaml dataset paths + small proto strata + cell cap
datasets_full.yaml same paths, full grid, larger cell cap
envs/ conda envs (rust, bedtools, r-tools, python)
rules/
common.smk build_amet, fetch FASTAs, build_cpg_reference
Expand Down Expand Up @@ -79,32 +80,36 @@ amet --build-cpg-only --genome mm10.fa
The `Makefile` is the top-level entry point. From the repo root:

```
make argelaguet # whole-genome Argelaguet (4 Rmds)
make crc # whole-genome CRC (6 Rmds)
make ecker # whole-genome Ecker (4 Rmds)
make simulations # simulation report
make all # everything above
make dryrun # snakemake -n for all targets
make unlock # release a stale snakemake lock
make help # target list
make argelaguet # proto by default (results/argelaguet_proto/)
make crc MODE=full # full grid (results/crc_full/)
make ecker MODE=proto # explicit proto
make simulations # simulation report (MODE-agnostic)
make all MODE=full # simulations + 3 datasets in full mode
make dryrun MODE=full # snakemake -n for everything (full)
make unlock # release a stale snakemake lock
make help # target list + active config
```

Tunable variables:

| Variable | Default | Description |
|---|---|---|
| `MODE` | `proto` | `proto` or `full`; picks `workflow/config/datasets_$(MODE).yaml` |
| `CORES` | 16 | Snakemake `--cores` value |
| `ULIMIT_KB` | 209715200 (200 GB) | Virtual-memory cap; inherited by every amet job |
| `CONDA_ENV` | `snakemake` | Conda env that holds snakemake |
| `CONDA_INIT` | `~/miniconda3/bin/activate` | Conda activation script |

Override on the command line, e.g. `make argelaguet CORES=32`.
Override on the command line, e.g. `make argelaguet MODE=full CORES=32`.

### Prototype vs full-run

`workflow/config/datasets.yaml` carries a `prototype:` block. With `prototype.enabled: true` (the default), each dataset's manifest builder caps cells per stratum and restricts to a small list of strata so the runs finish in minutes. Output goes to `results/<dataset>_proto/`.
Two config files in `workflow/config/`:

For full publication runs, set `prototype.enabled: false` and change each dataset's `run_name` (e.g., from `argelaguet_proto` to `argelaguet`) so the full outputs don't overwrite the prototype ones.
- **`datasets_proto.yaml`** -- restricts CRC to `CRC01` x `NC/PT/LN`, Ecker to a handful of MOp sub_types, and caps cells per per-combo stratum at 10. Outputs land in `results/<dataset>_proto/`. Picked when `MODE=proto` (the default).
- **`datasets_full.yaml`** -- runs every patient x location for CRC, every (sub_region, sub_type) in the configured region for Ecker, every (stage, lineage) for Argelaguet. Caps cells per per-combo stratum at 30, coverage-ranked and plate-balanced (Argelaguet, Ecker). Outputs land in `results/<dataset>_full/`. Picked when `MODE=full`.

The two modes use distinct output directories (`<dataset>_proto/` vs `<dataset>_full/`), so a full run does not clobber an earlier proto run, and vice versa.

### Window sizes

Expand Down Expand Up @@ -140,6 +145,7 @@ If you are in the Mark Robinson lab at UZH, `workflow/scripts/internal/setup_bar
| `--min-cpgs-per-feature` | `5` | A `(cell, feature)` is scored only if at least this many CpGs are covered. Below the threshold, scores are reported as `NA`. |
| `--min-cells-per-group` | `10` | A `(feature, group)` reports `jsd` only if at least this many cells pass the per-cell coverage filter. Otherwise `jsd` is `NA`. |
| `--i-max-lag` | `3` | Maximum CpG lag k for `I_total = sum_{k=1..max} I_k`. |
| `--max-pair-distance` | `0` (off) | Maximum nucleotide distance allowed between two CpGs of a pair. Pairs whose genomic distance exceeds this value are not counted, at any lag. `0` disables the cap. |
| `--threads` | `0` (all) | Number of threads. |

### Manifest (`--cells`)
Expand Down
11 changes: 11 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# amet TODO

## Reconcile `i_total_resid` vs `i_norm`

amet currently has two distinct ways of expressing a "methylation-decoupled" within-cell entropy, and the workflow uses both under different names:

- `i_norm` (in `workflow/scripts/eval_*.R` and `simulations_report.Rmd`): analytical normalization, defined as `i_total / (k_max * H(p_hat))`. Headline score used in the simulations and tool-comparison benchmarks.
- `i_total_resid` (in `workflow/Rmd/crc_windows_sce.Rmd`): empirical residuals from a per-window `lm(i_total ~ mean_meth + I(mean_meth^2))` fit. Used as the input to the SCE-based differential entropy testing and the per-cell embeddings.

These are different quantities computed by different math. Pick one canonical decoupling strategy (or document the regimes where each is preferred) and harmonize naming across the simulations, evals, dataset Rmds, and figure Rmds.

5 changes: 5 additions & 0 deletions method/src/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,11 @@ pub struct Cli {
#[arg(long, default_value_t = 3)]
pub i_max_lag: u32,

/// Maximum nucleotide distance allowed between paired CpGs. Pairs whose genomic
/// distance exceeds this value are not counted. 0 disables the cap.
#[arg(long, default_value_t = 0)]
pub max_pair_distance: u64,

/// Number of threads. 0 means all available.
#[arg(long, default_value_t = 0)]
pub threads: usize,
Expand Down
Loading
Loading