Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,21 @@
Notable changes to amet. Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/);
versioning: [SemVer](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Added

- `cell_feature.tsv.gz` gains an `i_norm` column and `feature.tsv.gz` gains a
`jsd_norm` column: the methylation-normalized within-cell and across-cell
scores, `i_total / (k_max * H(mean_meth))` and `jsd / (2 * H(mean_meth_mean))`.
Both are NA outside the allowed methylation range [0.1, 0.9); no winsorizing.

### Changed

- Normalization moved into the binary. The workflow no longer recomputes the
normalized scores in R: `workflow/scripts/score_adjust.R` is removed, and the
eval scripts and report Rmds read amet's `i_norm`/`jsd_norm` columns directly.

## [0.1.1] - 2026-05-15

### Added
Expand Down
14 changes: 10 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
##
## Not intended for laptops: the runs allocate hundreds of GB of virtual
## memory across many parallel amet jobs. The recipes set ulimit -v as a
## soft safeguard (100 GB per process, see ULIMIT_KB) and let snakemake fan
## soft safeguard (200 GB per process, see ULIMIT_KB) and let snakemake fan
## out across CORES cores. The ulimit is per-process: every job shell
## inherits the same cap, it is not a shared budget across rules.
##
Expand All @@ -21,14 +21,14 @@
## (default: proto)
## CORES snakemake --cores value (default 40)
## ULIMIT_KB per-process virtual memory cap in KB, inherited by every
## job shell (default 104857600, i.e. 100 GB)
## job shell (default 209715200, i.e. 200 GB)
## CONDA_ENV name of the conda env that holds snakemake (default snakemake)
## CONDA_INIT path to the conda activation script
## (default ~/miniconda3/bin/activate)

MODE ?= proto
CORES ?= 40
ULIMIT_KB ?= 104857600
ULIMIT_KB ?= 209715200
CONDA_ENV ?= snakemake
CONDA_INIT ?= $(HOME)/miniconda3/bin/activate

Expand All @@ -41,9 +41,14 @@ DATASETS_CONFIG := config/datasets_$(MODE).yaml
ACTIVATE := source $(CONDA_INIT) && conda activate $(CONDA_ENV) && \
ulimit -v $(ULIMIT_KB)

## --rerun-triggers mtime: a job is rerun only when an input file is newer
## than its outputs, not when rule code, params, or the conda env change.
## This keeps report reruns from cascading into the expensive amet jobs.
##
## The trailing `--` stops snakemake's option parsing so subsequent positional
## tokens are unambiguously targets, not extra --configfile values.
SNAKEMAKE := snakemake --use-conda --cores $(CORES) -p \
--rerun-triggers mtime \
--configfile $(DATASETS_CONFIG) --

.PHONY: all simulations argelaguet crc ecker dryrun unlock clean help \
Expand All @@ -68,7 +73,8 @@ ecker:

dryrun:
cd $(WORKFLOW_DIR) && bash -c '$(ACTIVATE) && \
snakemake --cores $(CORES) --configfile $(DATASETS_CONFIG) \
snakemake --cores $(CORES) --rerun-triggers mtime \
--configfile $(DATASETS_CONFIG) \
-n -- simulations argelaguet crc ecker'

unlock:
Expand Down
30 changes: 15 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@

amet quantifies within- and across-cells epigenetic heterogeneity from single-cell DNA methylation data.

Two complementary scores:
Two complementary axes, each reported as a raw and a methylation-normalized score:

- Within-cell regularity along consecutive CpGs in one cell, scored by `I_total`, the sum of mutual information across CpG lags 1..k.
- Across-cell heterogeneity at a feature within a cell group, scored by `JSD` on per-cell lag-1 2-mer distributions.
- Within-cell regularity along consecutive CpGs in one cell, scored by `i_total`, the sum of mutual information across CpG lags 1..k, and its normalized form `i_norm`.
- Across-cell heterogeneity at a feature within a cell group, scored by `jsd` on per-cell lag-1 2-mer distributions, and its normalized form `jsd_norm`.

A sequence with no comethylation structure scores zero regardless of its methylation level, so no marginal-methylation adjustment is needed.
A sequence with no comethylation structure scores zero on `i_total` regardless of its methylation level. The normalized scores `i_norm`/`jsd_norm` divide by a binary-entropy term to correct the practical methylation dependence seen in real data; they are the headline scores.

## Status

v0.1. The Rust binary is feature-complete for the two scores. The Snakemake workflow runs the simulator-based validation grid and three reference dataset analyses (Argelaguet 2019 mouse gastrulation scNMT-seq, Liu 2021 mouse MOp snmC-seq2, Bian 2018 colorectal scTrio-seq2).
v0.1. The Rust binary is feature-complete for the four scores (`i_total`, `i_norm`, `jsd`, `jsd_norm`). The Snakemake workflow runs the simulator-based validation grid and three reference dataset analyses (Argelaguet 2019 mouse gastrulation scNMT-seq, Liu 2021 mouse MOp snmC-seq2, Bian 2018 colorectal scTrio-seq2).

## Repository layout

Expand Down Expand Up @@ -223,18 +223,18 @@ The label is the BED file name with any of `.bed.gz`, `.bed.bgz`, `.bed`, `.gz`,
`<prefix>.cell_feature.tsv.gz`. One row per `(cell, feature)`:

```
cell_id group feature_id n_covered mean_meth n_zeros n_ones i_total i_1 i_2 ... i_k
cell_id group feature_id n_covered mean_meth n_zeros n_ones i_total i_norm i_1 i_2 ... i_k
```

`i_k` columns are present up to `--i-max-lag`. When fewer than `--min-cpgs-per-feature` CpGs are covered, score columns are `NA`.
`i_norm` is `i_total / (k_max * H(mean_meth))`, the methylation-normalized within-cell score; it is `NA` when `mean_meth` is below 0.1 or at/above 0.9. `i_k` columns are present up to `--i-max-lag`. When fewer than `--min-cpgs-per-feature` CpGs are covered, score columns are `NA`.

`<prefix>.feature.tsv.gz`. One row per `(feature, group)`:

```
feature_id group n_cells mean_coverage mean_meth_mean mean_meth_var i_total_mean i_total_var jsd
feature_id group n_cells mean_coverage mean_meth_mean mean_meth_var i_total_mean i_total_var jsd jsd_norm
```

`jsd` is the multi-distribution Jensen-Shannon divergence across the cells in each group, using each cell's lag-1 2-mer distribution. If a group has fewer than `--min-cells-per-group` eligible cells, `jsd` is `NA`. Per-group is the only meaningful axis; pooled JSD is not reported.
`jsd` is the multi-distribution Jensen-Shannon divergence across the cells in each group, using each cell's lag-1 2-mer distribution. If a group has fewer than `--min-cells-per-group` eligible cells, `jsd` is `NA`. Per-group is the only meaningful axis; pooled JSD is not reported. `jsd_norm` is `jsd / (2 * H(mean_meth_mean))`, the methylation-normalized across-cell score; it is `NA` when `mean_meth_mean` is below 0.1 or at/above 0.9.

`<prefix>.pair_counts.tsv.gz`. One row per `(cell, feature, lag)` with the four 2-mer counts:

Expand All @@ -246,22 +246,22 @@ Useful for downstream analyses that want to recompute scores under alternative t

## Scores

### Within-cell: `I_total`
### Within-cell: `i_total` and `i_norm`

For a single cell, with binarised CpG calls along a feature, compute mutual information between CpGs separated by lag k:

```
I_k = H(X_i) + H(X_{i+k}) - H(X_i, X_{i+k})
I_total = sum_{k=1..k_max} I_k
i_total = sum_{k=1..k_max} I_k
```

For an i.i.d. sequence with marginal p, every `I_k = 0` for any p, so `I_total` has a p-invariant zero baseline. No methylation-level adjustment is needed.
For an i.i.d. sequence with marginal p, every `I_k = 0` for any p, so `i_total` has a p-invariant zero baseline.

`I_total` carries practical methylation dependence in real data (cells with mostly homozygous calls have less information to capture). The zero-baseline property covers the null only; for differential testing across conditions with different marginal methylation, residualise on `mean_meth`.
`i_total` still carries practical methylation dependence in real data (cells with mostly homozygous calls have less information to capture). `i_norm = i_total / (k_max * H(mean_meth))` divides by that entropy ceiling to correct it, where `H` is binary Shannon entropy in bits. `H` collapses near methylation 0 and 1, so `i_norm` is defined only over an allowed methylation range and is `NA` when `mean_meth` is below 0.1 or at/above 0.9. `i_norm` is the headline within-cell score; no winsorizing is applied. For differential testing across conditions with different marginal methylation, the workflow instead residualises `i_total` on `mean_meth`.

### Across-cell, per group: `JSD` (lag-1 2-mer)
### Across-cell, per group: `jsd` and `jsd_norm` (lag-1 2-mer)

For each cell, build a lag-1 2-mer histogram per feature (4 bins: 00, 01, 10, 11). Compute multi-distribution Jensen-Shannon divergence across the cells in each group. Reported per `(feature, group)`, never pooled.
For each cell, build a lag-1 2-mer histogram per feature (4 bins: 00, 01, 10, 11). Compute multi-distribution Jensen-Shannon divergence across the cells in each group. Reported per `(feature, group)`, never pooled. `jsd_norm = jsd / (2 * H(mean_meth_mean))` is the methylation-normalized form, the headline across-cell score, `NA` when `mean_meth_mean` is below 0.1 or at/above 0.9.

## Simulations

Expand Down
5 changes: 2 additions & 3 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@

amet currently has two distinct ways of expressing a "methylation-decoupled" within-cell entropy, and the workflow uses both under different names:

- `i_norm` (in `workflow/scripts/eval_*.R` and `simulations_report.Rmd`): analytical normalization, defined as `i_total / (k_max * H(p_hat))`. Headline score used in the simulations and tool-comparison benchmarks.
- `i_norm`: analytical normalization, `i_total / (k_max * H(mean_meth))`, NA outside the allowed methylation range [0.1, 0.9). Computed by the amet binary (`method/src/scores/normalize.rs`) and emitted as a column of `cell_feature.tsv.gz`. The headline within-cell score, read directly across the simulations, tool-comparison benchmarks, eval scripts, and dataset Rmds.
- `i_total_resid` (in `workflow/Rmd/crc_windows_sce.Rmd`): empirical residuals from a per-window `lm(i_total ~ mean_meth + I(mean_meth^2))` fit. Used as the input to the SCE-based differential entropy testing and the per-cell embeddings.

These are different quantities computed by different math. Pick one canonical decoupling strategy (or document the regimes where each is preferred) and harmonize naming across the simulations, evals, dataset Rmds, and figure Rmds.

These are different quantities computed by different math. The working split is `i_norm` everywhere as the general score and `i_total_resid` only inside the differential-testing chain. Confirm this is the intended canonical decision, document it, and harmonize naming across the simulations, evals, dataset Rmds, and figure Rmds.
20 changes: 17 additions & 3 deletions method/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ use amet::kmer::{PairCounts, build_window, marginal_counts, pair_counts_all_lags
use amet::manifest::{CellRow, read_manifest};
use amet::parsers::{CellFormat, read_cell};
use amet::reference::read_cpg_reference;
use amet::scores::normalize::{i_norm, jsd_norm};
use amet::scores::{i_total::i_total, jsd::JsdAccumulator};
use anyhow::{Context, Result, anyhow};
use rayon::prelude::*;
Expand Down Expand Up @@ -259,6 +260,11 @@ fn main() -> Result<()> {
write_opt(w, e.itotal.var())?;
write!(w, "\t")?;
write_opt(w, jsd)?;
write!(w, "\t")?;
// jsd_norm: jsd normalized by 2 * H(mean_meth_mean), NA outside the
// allowed methylation range.
let jsd_norm_value = jsd.and_then(|j| e.meth.mean().and_then(|m| jsd_norm(j, m)));
write_opt(w, jsd_norm_value)?;
writeln!(w)?;
}

Expand Down Expand Up @@ -333,11 +339,19 @@ fn write_cell_feature_row(
write!(w, "\t{}\t{}", row.n_zeros, row.n_ones)?;
if row.n_covered >= min_n {
write!(w, "\t{:.6}", row.i_total_value)?;
// i_norm: i_total normalized by k_max * H(mean_meth), NA outside the
// allowed methylation range.
let i_norm_value = row
.mean_meth
.and_then(|m| i_norm(row.i_total_value, i_max_lag as f64, m));
write!(w, "\t")?;
write_opt(w, i_norm_value)?;
for v in &row.i_per_lag {
write!(w, "\t{:.6}", v)?;
}
} else {
write!(w, "\tNA")?;
// i_total, i_norm, then one NA per lag.
write!(w, "\tNA\tNA")?;
for _ in 0..i_max_lag {
write!(w, "\tNA")?;
}
Expand Down Expand Up @@ -407,15 +421,15 @@ fn write_headers(
) -> std::io::Result<()> {
write!(
cf_writer,
"cell_id\tgroup\tfeature_id\tn_covered\tmean_meth\tn_zeros\tn_ones\ti_total"
"cell_id\tgroup\tfeature_id\tn_covered\tmean_meth\tn_zeros\tn_ones\ti_total\ti_norm"
)?;
for k in 1..=i_max_lag {
write!(cf_writer, "\ti_{}", k)?;
}
writeln!(cf_writer)?;
writeln!(
feat_writer,
"feature_id\tgroup\tn_cells\tmean_coverage\tmean_meth_mean\tmean_meth_var\ti_total_mean\ti_total_var\tjsd"
"feature_id\tgroup\tn_cells\tmean_coverage\tmean_meth_mean\tmean_meth_var\ti_total_mean\ti_total_var\tjsd\tjsd_norm"
)?;
writeln!(
pair_writer,
Expand Down
1 change: 1 addition & 0 deletions method/src/scores/mod.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
pub mod i_total;
pub mod jsd;
pub mod normalize;

/// Plug-in Shannon entropy in bits, with optional Miller-Madow bias correction.
/// Counts that sum to 0 return 0.
Expand Down
115 changes: 115 additions & 0 deletions method/src/scores/normalize.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
//! Methylation-normalized scores i_norm and jsd_norm.
//!
//! i_norm and jsd_norm divide a raw score by the binary Shannon entropy H(p)
//! of the average methylation p. H(p) collapses to zero as p approaches 0 or
//! 1, so the ratio is only defined over a fixed allowed methylation range: a
//! value is normalized only when 0.1 <= p < 0.9, and is NA (None) otherwise.
//! The range edges are fixed constants, not data-derived quantiles.

/// Lower edge of the allowed methylation range (inclusive) for normalization.
pub const METH_LO: f64 = 0.1;
/// Upper edge of the allowed methylation range (exclusive) for normalization.
pub const METH_HI: f64 = 0.9;

/// Binary Shannon entropy H(p) in bits. Returns 0 at p <= 0 or p >= 1.
pub fn shannon_binary(p: f64) -> f64 {
if p <= 0.0 || p >= 1.0 {
0.0
} else {
-p * p.log2() - (1.0 - p) * (1.0 - p).log2()
}
}

/// True when p lies in the allowed methylation range [METH_LO, METH_HI).
fn in_allowed_range(p: f64) -> bool {
(METH_LO..METH_HI).contains(&p)
}

/// Within-cell normalized score: i_total / (k_max * H(mean_meth)).
/// None when mean_meth is outside the allowed range or the denominator is not
/// positive.
pub fn i_norm(i_total: f64, k_max: f64, mean_meth: f64) -> Option<f64> {
if !in_allowed_range(mean_meth) {
return None;
}
let denom = k_max * shannon_binary(mean_meth);
if denom > 0.0 {
Some(i_total / denom)
} else {
None
}
}

/// Across-cell normalized score: jsd / (2 * H(mean_meth_mean)).
/// None when mean_meth_mean is outside the allowed range or the denominator is
/// not positive.
pub fn jsd_norm(jsd: f64, mean_meth_mean: f64) -> Option<f64> {
if !in_allowed_range(mean_meth_mean) {
return None;
}
let denom = 2.0 * shannon_binary(mean_meth_mean);
if denom > 0.0 { Some(jsd / denom) } else { None }
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn shannon_binary_half_is_one_bit() {
assert!((shannon_binary(0.5) - 1.0).abs() < 1e-12);
}

#[test]
fn shannon_binary_extremes_are_zero() {
assert_eq!(shannon_binary(0.0), 0.0);
assert_eq!(shannon_binary(1.0), 0.0);
assert_eq!(shannon_binary(-0.1), 0.0);
assert_eq!(shannon_binary(1.5), 0.0);
}

#[test]
fn shannon_binary_at_range_edge() {
// H(0.1) = H(0.9) ~ 0.468996 bits.
assert!((shannon_binary(0.1) - 0.4689956).abs() < 1e-6);
assert!((shannon_binary(0.9) - 0.4689956).abs() < 1e-6);
}

#[test]
fn i_norm_in_band_divides_by_k_times_entropy() {
// mean_meth 0.5 -> H = 1 bit, k_max 3 -> denominator 3.
assert!((i_norm(3.0, 3.0, 0.5).unwrap() - 1.0).abs() < 1e-12);
}

#[test]
fn i_norm_allowed_range_is_lo_inclusive_hi_exclusive() {
assert!(i_norm(1.0, 3.0, 0.1).is_some());
assert!(i_norm(1.0, 3.0, 0.9).is_none());
}

#[test]
fn i_norm_outside_allowed_range_is_none() {
assert!(i_norm(1.0, 3.0, 0.05).is_none());
assert!(i_norm(1.0, 3.0, 0.95).is_none());
assert!(i_norm(1.0, 3.0, 0.0).is_none());
assert!(i_norm(1.0, 3.0, 1.0).is_none());
}

#[test]
fn i_norm_zero_k_max_is_none() {
assert!(i_norm(1.0, 0.0, 0.5).is_none());
}

#[test]
fn jsd_norm_in_band_divides_by_two_entropy() {
// mean_meth 0.5 -> H = 1 bit -> denominator 2.
assert!((jsd_norm(1.0, 0.5).unwrap() - 0.5).abs() < 1e-12);
}

#[test]
fn jsd_norm_outside_allowed_range_is_none() {
assert!(jsd_norm(1.0, 0.05).is_none());
assert!(jsd_norm(1.0, 0.9).is_none());
assert!(jsd_norm(1.0, 0.95).is_none());
}
}
Loading
Loading