Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,10 @@
## datasets (Argelaguet, CRC, Ecker) on a server with enough RAM.
##
## Not intended for laptops: the runs allocate hundreds of GB of virtual
## memory across many parallel amet jobs. The recipes set ulimit -v
## 200 GB as a soft safeguard and let snakemake fan out across CORES cores.
## memory across many parallel amet jobs. The recipes set ulimit -v as a
## soft safeguard (100 GB per process, see ULIMIT_KB) and let snakemake fan
## out across CORES cores. The ulimit is per-process: every job shell
## inherits the same cap, it is not a shared budget across rules.
##
## Usage:
## make argelaguet # proto by default (results/argelaguet_proto/)
Expand All @@ -17,15 +19,16 @@
## Variables (override on the command line):
## MODE proto | full which dataset config file to load
## (default: proto)
## CORES snakemake --cores value (default 16)
## ULIMIT_KB virtual memory cap in KB (default 209715200, i.e. 200 GB)
## CORES snakemake --cores value (default 40)
## ULIMIT_KB per-process virtual memory cap in KB, inherited by every
## job shell (default 104857600, i.e. 100 GB)
## CONDA_ENV name of the conda env that holds snakemake (default snakemake)
## CONDA_INIT path to the conda activation script
## (default ~/miniconda3/bin/activate)

MODE ?= proto
CORES ?= 16
ULIMIT_KB ?= 209715200
CORES ?= 40
ULIMIT_KB ?= 104857600
CONDA_ENV ?= snakemake
CONDA_INIT ?= $(HOME)/miniconda3/bin/activate

Expand Down
39 changes: 34 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@ Tunable variables:
| Variable | Default | Description |
|---|---|---|
| `MODE` | `proto` | `proto` or `full`; picks `workflow/config/datasets_$(MODE).yaml` |
| `CORES` | 16 | Snakemake `--cores` value |
| `ULIMIT_KB` | 209715200 (200 GB) | Virtual-memory cap; inherited by every amet job |
| `CORES` | 40 | Snakemake `--cores` value |
| `ULIMIT_KB` | 104857600 (100 GB) | Per-process virtual-memory cap, inherited by every amet job shell. Not a shared budget: each of the `CORES` concurrent jobs is capped independently. |
| `CONDA_ENV` | `snakemake` | Conda env that holds snakemake |
| `CONDA_INIT` | `~/miniconda3/bin/activate` | Conda activation script |

Expand Down Expand Up @@ -125,7 +125,7 @@ The three dataset rule files expand over a fixed list of annotations defined at

### Server deployment

The Makefile is designed for a workstation with enough RAM for the whole-genome amet runs (hundreds of GB virtual memory under parallel jobs; the recipes apply `ulimit -v` as a soft cap). It is not designed for laptops.
The Makefile is designed for a workstation with enough RAM for the whole-genome amet runs. The recipes apply `ulimit -v` as a soft per-process cap (`ULIMIT_KB`, default 100 GB); it is inherited by every job shell and bounds each amet process independently, so peak machine memory is roughly `CORES` times one job's actual usage, not the cap. Each amet job scores all of a dataset's annotation BEDs in one pass, so its footprint grows with the total number of features across those BEDs. It is not designed for laptops.

If you are in the Mark Robinson lab at UZH, `workflow/scripts/internal/setup_barbara_links.sh` populates `results/<dataset>/{cells,raw,features,mm10,hg19}` as symlinks to the pre-staged data tree on `barbara`'s filesystem. See `workflow/scripts/internal/README.md`. Outside that lab, run the per-dataset download rules in each `.smk` instead.

Expand All @@ -136,15 +136,15 @@ If you are in the Mark Robinson lab at UZH, `workflow/scripts/internal/setup_bar
| `--genome` | (required, mutually exclusive with `--cpg-reference`) | FASTA of the reference genome. amet derives all CpG positions on first use and caches them to `<fasta>.cpg`. |
| `--cpg-reference` | (required, mutually exclusive with `--genome`) | Tab-separated `chrom\tpos` of every CpG, 0-based. Defines adjacency: an uncovered reference CpG breaks 2-mer pairing across it. |
| `--cells` | (required unless `--build-cpg-only`) | Manifest TSV (see below). |
| `--features` | (required unless `--build-cpg-only`) | BED of regions to score. Features must not overlap. |
| `--features` | (required unless `--build-cpg-only`) | BED of regions to score. Features within a BED must not overlap. Pass `--features` multiple times to score against several BEDs in one cell-read pass; see "Multiple feature sets" below. |
| `--output-prefix` | (required unless `--build-cpg-only`) | Prefix for the output files. |
| `--build-cpg-only` | off | Only materialise `<fasta>.cpg` and exit. Requires `--genome`. |
| `--group-column` | `group` | Manifest column to use as the group label. |
| `--meth-call-threshold` | `0.0` | Methylation fraction `m/t` above which a CpG is called methylated. `0.5` for majority rule; `0.1` calls any position with > 10% methylated reads as methylated. |
| `--min-reads-per-cpg` | `1` | A CpG is observed only if covered by at least this many reads. Bulk WGBS users typically set 5-10. |
| `--min-cpgs-per-feature` | `5` | A `(cell, feature)` is scored only if at least this many CpGs are covered. Below the threshold, scores are reported as `NA`. |
| `--min-cells-per-group` | `10` | A `(feature, group)` reports `jsd` only if at least this many cells pass the per-cell coverage filter. Otherwise `jsd` is `NA`. |
| `--i-max-lag` | `3` | Maximum CpG lag k for `I_total = sum_{k=1..max} I_k`. |
| `--i-max-lag` | `3` | Maximum CpG lag k for `I_total = sum_{k=1..max} I_k`. Must be at least 1; lag 1 underpins JSD. |
| `--max-pair-distance` | `0` (off) | Maximum nucleotide distance allowed between two CpGs of a pair. Pairs whose genomic distance exceeds this value are not counted, at any lag. `0` disables the cap. |
| `--threads` | `0` (all) | Number of threads. |

Expand Down Expand Up @@ -189,6 +189,35 @@ chr1 1000 2000 promoter_GENE1
chr1 5000 7000 cgi_chr1_5000
```

### Multiple feature sets

`--features` can be passed more than once to score the same cells against several BEDs in a single run, so each cell file is parsed only once regardless of how many feature sets are used. This is the recommended way to compare, for example, promoters, enhancers, and heterochromatin in one pass.

```
amet \
--genome mm10.fa \
--cells cells.tsv \
--features promoters.bed \
--features enhancers.bed \
--features heterochromatin.bed \
--output-prefix run1
```

With a single `--features` the output paths are exactly as documented above (`run1.cell_feature.tsv.gz`, etc.). With two or more, amet writes one output triplet per BED, keyed by the BED basename:

```
run1.promoters.cell_feature.tsv.gz
run1.promoters.feature.tsv.gz
run1.promoters.pair_counts.tsv.gz
run1.enhancers.cell_feature.tsv.gz
run1.enhancers.feature.tsv.gz
run1.enhancers.pair_counts.tsv.gz
run1.heterochromatin.cell_feature.tsv.gz
...
```

The label is the BED file name with any of `.bed.gz`, `.bed.bgz`, `.bed`, `.gz`, or `.bgz` stripped. If two BEDs resolve to the same label (for example `regions.bed` in two different directories), amet exits with an error before doing any work; rename one of the inputs.

### Outputs

`<prefix>.cell_feature.tsv.gz`. One row per `(cell, feature)`:
Expand Down
17 changes: 13 additions & 4 deletions method/src/cli.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,17 @@ pub struct Cli {
#[arg(long, value_name = "TSV", required_unless_present = "build_cpg_only")]
pub cells: Option<PathBuf>,

/// BED file of features to score. Features should not overlap.
#[arg(long, value_name = "BED", required_unless_present = "build_cpg_only")]
pub features: Option<PathBuf>,
/// BED file of features to score. Features within a single BED should not overlap.
/// Pass --features multiple times to score the same cells against several feature
/// sets in one cell-read pass; each set writes its own output triplet keyed by the
/// BED basename. With a single --features the output paths are unchanged.
#[arg(
long,
value_name = "BED",
action = clap::ArgAction::Append,
required_unless_present = "build_cpg_only"
)]
pub features: Vec<PathBuf>,

/// FASTA of the reference genome. amet derives all CpG positions from it on first
/// use and caches them to <fasta>.cpg next to the input. Subsequent runs reuse the
Expand Down Expand Up @@ -63,7 +71,8 @@ pub struct Cli {
pub min_cells_per_group: u32,

/// Maximum CpG lag k for the I_total within-cell score: I_total = sum_{k=1..max} I_k.
#[arg(long, default_value_t = 3)]
/// Must be at least 1; lag 1 is required to compute JSD.
#[arg(long, default_value_t = 3, value_parser = clap::value_parser!(u32).range(1..))]
pub i_max_lag: u32,

/// Maximum nucleotide distance allowed between paired CpGs. Pairs whose genomic
Expand Down
5 changes: 3 additions & 2 deletions method/src/io.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,9 @@ pub fn open_read(path: &Path) -> Result<Box<dyn BufRead>> {
}
}

/// Open a file for writing, gzipping if the path ends with .gz.
pub fn open_write(path: &Path) -> Result<Box<dyn Write>> {
/// Open a file for writing, gzipping if the path ends with .gz. The returned
/// writer is `Send` so it can be shared across worker threads behind a `Mutex`.
pub fn open_write(path: &Path) -> Result<Box<dyn Write + Send>> {
let file = File::create(path)?;
let ext = path.extension().and_then(|s| s.to_str()).unwrap_or("");
if ext == "gz" {
Expand Down
Loading
Loading