imallona · imallona · May 15, 2026 · May 16, 2026 · May 18, 2026 · May 18, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,21 @@
 Notable changes to amet. Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/);
 versioning: [SemVer](https://semver.org/spec/v2.0.0.html).
 
+## [Unreleased]
+
+### Added
+
+- `cell_feature.tsv.gz` gains an `i_norm` column and `feature.tsv.gz` gains a
+  `jsd_norm` column: the methylation-normalized within-cell and across-cell
+  scores, `i_total / (k_max * H(mean_meth))` and `jsd / (2 * H(mean_meth_mean))`.
+  Both are NA outside the allowed methylation range [0.1, 0.9); no winsorizing.
+
+### Changed
+
+- Normalization moved into the binary. The workflow no longer recomputes the
+  normalized scores in R: `workflow/scripts/score_adjust.R` is removed, and the
+  eval scripts and report Rmds read amet's `i_norm`/`jsd_norm` columns directly.
+
 ## [0.1.1] - 2026-05-15
 
 ### Added

diff --git a/Makefile b/Makefile
@@ -3,7 +3,7 @@
 ##
 ## Not intended for laptops: the runs allocate hundreds of GB of virtual
 ## memory across many parallel amet jobs. The recipes set ulimit -v as a
-## soft safeguard (100 GB per process, see ULIMIT_KB) and let snakemake fan
+## soft safeguard (200 GB per process, see ULIMIT_KB) and let snakemake fan
 ## out across CORES cores. The ulimit is per-process: every job shell
 ## inherits the same cap, it is not a shared budget across rules.
 ##
@@ -21,14 +21,14 @@
 ##                                    (default: proto)
 ##   CORES        snakemake --cores value (default 40)
 ##   ULIMIT_KB    per-process virtual memory cap in KB, inherited by every
-##                job shell (default 104857600, i.e. 100 GB)
+##                job shell (default 209715200, i.e. 200 GB)
 ##   CONDA_ENV    name of the conda env that holds snakemake (default snakemake)
 ##   CONDA_INIT   path to the conda activation script
 ##                (default ~/miniconda3/bin/activate)
 
 MODE        ?= proto
 CORES       ?= 40
-ULIMIT_KB   ?= 104857600
+ULIMIT_KB   ?= 209715200
 CONDA_ENV   ?= snakemake
 CONDA_INIT  ?= $(HOME)/miniconda3/bin/activate
 
@@ -41,9 +41,14 @@ DATASETS_CONFIG := config/datasets_$(MODE).yaml
 ACTIVATE := source $(CONDA_INIT) && conda activate $(CONDA_ENV) && \
             ulimit -v $(ULIMIT_KB)
 
+## --rerun-triggers mtime: a job is rerun only when an input file is newer
+## than its outputs, not when rule code, params, or the conda env change.
+## This keeps report reruns from cascading into the expensive amet jobs.
+##
 ## The trailing `--` stops snakemake's option parsing so subsequent positional
 ## tokens are unambiguously targets, not extra --configfile values.
 SNAKEMAKE := snakemake --use-conda --cores $(CORES) -p \
+             --rerun-triggers mtime \
              --configfile $(DATASETS_CONFIG) --
 
 .PHONY: all simulations argelaguet crc ecker dryrun unlock clean help \
@@ -68,7 +73,8 @@ ecker:
 
 dryrun:
 	cd $(WORKFLOW_DIR) && bash -c '$(ACTIVATE) && \
-	  snakemake --cores $(CORES) --configfile $(DATASETS_CONFIG) \
+	  snakemake --cores $(CORES) --rerun-triggers mtime \
+	  --configfile $(DATASETS_CONFIG) \
 	  -n -- simulations argelaguet crc ecker'
 
 unlock:

diff --git a/README.md b/README.md
@@ -2,16 +2,16 @@
 
 amet quantifies within- and across-cells epigenetic heterogeneity from single-cell DNA methylation data.
 
-Two complementary scores:
+Two complementary axes, each reported as a raw and a methylation-normalized score:
 
-- Within-cell regularity along consecutive CpGs in one cell, scored by `I_total`, the sum of mutual information across CpG lags 1..k.
-- Across-cell heterogeneity at a feature within a cell group, scored by `JSD` on per-cell lag-1 2-mer distributions.
+- Within-cell regularity along consecutive CpGs in one cell, scored by `i_total`, the sum of mutual information across CpG lags 1..k, and its normalized form `i_norm`.
+- Across-cell heterogeneity at a feature within a cell group, scored by `jsd` on per-cell lag-1 2-mer distributions, and its normalized form `jsd_norm`.
 
-A sequence with no comethylation structure scores zero regardless of its methylation level, so no marginal-methylation adjustment is needed.
+A sequence with no comethylation structure scores zero on `i_total` regardless of its methylation level. The normalized scores `i_norm`/`jsd_norm` divide by a binary-entropy term to correct the practical methylation dependence seen in real data; they are the headline scores.
 
 ## Status
 
-v0.1. The Rust binary is feature-complete for the two scores. The Snakemake workflow runs the simulator-based validation grid and three reference dataset analyses (Argelaguet 2019 mouse gastrulation scNMT-seq, Liu 2021 mouse MOp snmC-seq2, Bian 2018 colorectal scTrio-seq2).
+v0.1. The Rust binary is feature-complete for the four scores (`i_total`, `i_norm`, `jsd`, `jsd_norm`). The Snakemake workflow runs the simulator-based validation grid and three reference dataset analyses (Argelaguet 2019 mouse gastrulation scNMT-seq, Liu 2021 mouse MOp snmC-seq2, Bian 2018 colorectal scTrio-seq2).
 
 ## Repository layout
 
@@ -223,18 +223,18 @@ The label is the BED file name with any of `.bed.gz`, `.bed.bgz`, `.bed`, `.gz`,
 `<prefix>.cell_feature.tsv.gz`. One row per `(cell, feature)`:
 
 ```
-cell_id  group  feature_id  n_covered  mean_meth  n_zeros  n_ones  i_total  i_1  i_2  ... i_k
+cell_id  group  feature_id  n_covered  mean_meth  n_zeros  n_ones  i_total  i_norm  i_1  i_2  ... i_k
 ```
 
-`i_k` columns are present up to `--i-max-lag`. When fewer than `--min-cpgs-per-feature` CpGs are covered, score columns are `NA`.
+`i_norm` is `i_total / (k_max * H(mean_meth))`, the methylation-normalized within-cell score; it is `NA` when `mean_meth` is below 0.1 or at/above 0.9. `i_k` columns are present up to `--i-max-lag`. When fewer than `--min-cpgs-per-feature` CpGs are covered, score columns are `NA`.
 
 `<prefix>.feature.tsv.gz`. One row per `(feature, group)`:
 
 ```
-feature_id  group  n_cells  mean_coverage  mean_meth_mean  mean_meth_var  i_total_mean  i_total_var  jsd
+feature_id  group  n_cells  mean_coverage  mean_meth_mean  mean_meth_var  i_total_mean  i_total_var  jsd  jsd_norm
 ```
 
-`jsd` is the multi-distribution Jensen-Shannon divergence across the cells in each group, using each cell's lag-1 2-mer distribution. If a group has fewer than `--min-cells-per-group` eligible cells, `jsd` is `NA`. Per-group is the only meaningful axis; pooled JSD is not reported.
+`jsd` is the multi-distribution Jensen-Shannon divergence across the cells in each group, using each cell's lag-1 2-mer distribution. If a group has fewer than `--min-cells-per-group` eligible cells, `jsd` is `NA`. Per-group is the only meaningful axis; pooled JSD is not reported. `jsd_norm` is `jsd / (2 * H(mean_meth_mean))`, the methylation-normalized across-cell score; it is `NA` when `mean_meth_mean` is below 0.1 or at/above 0.9.
 
 `<prefix>.pair_counts.tsv.gz`. One row per `(cell, feature, lag)` with the four 2-mer counts:
 
@@ -246,22 +246,22 @@ Useful for downstream analyses that want to recompute scores under alternative t
 
 ## Scores
 
-### Within-cell: `I_total`
+### Within-cell: `i_total` and `i_norm`
 
 For a single cell, with binarised CpG calls along a feature, compute mutual information between CpGs separated by lag k:
 
 ```
 I_k     = H(X_i) + H(X_{i+k}) - H(X_i, X_{i+k})
-I_total = sum_{k=1..k_max} I_k
+i_total = sum_{k=1..k_max} I_k
 ```
 
-For an i.i.d. sequence with marginal p, every `I_k = 0` for any p, so `I_total` has a p-invariant zero baseline. No methylation-level adjustment is needed.
+For an i.i.d. sequence with marginal p, every `I_k = 0` for any p, so `i_total` has a p-invariant zero baseline.
 
-`I_total` carries practical methylation dependence in real data (cells with mostly homozygous calls have less information to capture). The zero-baseline property covers the null only; for differential testing across conditions with different marginal methylation, residualise on `mean_meth`.
+`i_total` still carries practical methylation dependence in real data (cells with mostly homozygous calls have less information to capture). `i_norm = i_total / (k_max * H(mean_meth))` divides by that entropy ceiling to correct it, where `H` is binary Shannon entropy in bits. `H` collapses near methylation 0 and 1, so `i_norm` is defined only over an allowed methylation range and is `NA` when `mean_meth` is below 0.1 or at/above 0.9. `i_norm` is the headline within-cell score; no winsorizing is applied. For differential testing across conditions with different marginal methylation, the workflow instead residualises `i_total` on `mean_meth`.
 
-### Across-cell, per group: `JSD` (lag-1 2-mer)
+### Across-cell, per group: `jsd` and `jsd_norm` (lag-1 2-mer)
 
-For each cell, build a lag-1 2-mer histogram per feature (4 bins: 00, 01, 10, 11). Compute multi-distribution Jensen-Shannon divergence across the cells in each group. Reported per `(feature, group)`, never pooled.
+For each cell, build a lag-1 2-mer histogram per feature (4 bins: 00, 01, 10, 11). Compute multi-distribution Jensen-Shannon divergence across the cells in each group. Reported per `(feature, group)`, never pooled. `jsd_norm = jsd / (2 * H(mean_meth_mean))` is the methylation-normalized form, the headline across-cell score, `NA` when `mean_meth_mean` is below 0.1 or at/above 0.9.
 
 ## Simulations
 

diff --git a/TODO.md b/TODO.md
@@ -4,8 +4,7 @@
 
 amet currently has two distinct ways of expressing a "methylation-decoupled" within-cell entropy, and the workflow uses both under different names:
 
-- `i_norm` (in `workflow/scripts/eval_*.R` and `simulations_report.Rmd`): analytical normalization, defined as `i_total / (k_max * H(p_hat))`. Headline score used in the simulations and tool-comparison benchmarks.
+- `i_norm`: analytical normalization, `i_total / (k_max * H(mean_meth))`, NA outside the allowed methylation range [0.1, 0.9). Computed by the amet binary (`method/src/scores/normalize.rs`) and emitted as a column of `cell_feature.tsv.gz`. The headline within-cell score, read directly across the simulations, tool-comparison benchmarks, eval scripts, and dataset Rmds.
 - `i_total_resid` (in `workflow/Rmd/crc_windows_sce.Rmd`): empirical residuals from a per-window `lm(i_total ~ mean_meth + I(mean_meth^2))` fit. Used as the input to the SCE-based differential entropy testing and the per-cell embeddings.
 
-These are different quantities computed by different math. Pick one canonical decoupling strategy (or document the regimes where each is preferred) and harmonize naming across the simulations, evals, dataset Rmds, and figure Rmds.
-
+These are different quantities computed by different math. The working split is `i_norm` everywhere as the general score and `i_total_resid` only inside the differential-testing chain. Confirm this is the intended canonical decision, document it, and harmonize naming across the simulations, evals, dataset Rmds, and figure Rmds.
diff --git a/method/src/main.rs b/method/src/main.rs
@@ -6,6 +6,7 @@ use amet::kmer::{PairCounts, build_window, marginal_counts, pair_counts_all_lags
 use amet::manifest::{CellRow, read_manifest};
 use amet::parsers::{CellFormat, read_cell};
 use amet::reference::read_cpg_reference;
+use amet::scores::normalize::{i_norm, jsd_norm};
 use amet::scores::{i_total::i_total, jsd::JsdAccumulator};
 use anyhow::{Context, Result, anyhow};
 use rayon::prelude::*;
@@ -259,6 +260,11 @@ fn main() -> Result<()> {
             write_opt(w, e.itotal.var())?;
             write!(w, "\t")?;
             write_opt(w, jsd)?;
+            write!(w, "\t")?;
+            // jsd_norm: jsd normalized by 2 * H(mean_meth_mean), NA outside the
+            // allowed methylation range.
+            let jsd_norm_value = jsd.and_then(|j| e.meth.mean().and_then(|m| jsd_norm(j, m)));
+            write_opt(w, jsd_norm_value)?;
             writeln!(w)?;
         }
 
@@ -333,11 +339,19 @@ fn write_cell_feature_row(
     write!(w, "\t{}\t{}", row.n_zeros, row.n_ones)?;
     if row.n_covered >= min_n {
         write!(w, "\t{:.6}", row.i_total_value)?;
+        // i_norm: i_total normalized by k_max * H(mean_meth), NA outside the
+        // allowed methylation range.
+        let i_norm_value = row
+            .mean_meth
+            .and_then(|m| i_norm(row.i_total_value, i_max_lag as f64, m));
+        write!(w, "\t")?;
+        write_opt(w, i_norm_value)?;
         for v in &row.i_per_lag {
             write!(w, "\t{:.6}", v)?;
         }
     } else {
-        write!(w, "\tNA")?;
+        // i_total, i_norm, then one NA per lag.
+        write!(w, "\tNA\tNA")?;
         for _ in 0..i_max_lag {
             write!(w, "\tNA")?;
         }
@@ -407,15 +421,15 @@ fn write_headers(
 ) -> std::io::Result<()> {
     write!(
         cf_writer,
-        "cell_id\tgroup\tfeature_id\tn_covered\tmean_meth\tn_zeros\tn_ones\ti_total"
+        "cell_id\tgroup\tfeature_id\tn_covered\tmean_meth\tn_zeros\tn_ones\ti_total\ti_norm"
     )?;
     for k in 1..=i_max_lag {
         write!(cf_writer, "\ti_{}", k)?;
     }
     writeln!(cf_writer)?;
     writeln!(
         feat_writer,
-        "feature_id\tgroup\tn_cells\tmean_coverage\tmean_meth_mean\tmean_meth_var\ti_total_mean\ti_total_var\tjsd"
+        "feature_id\tgroup\tn_cells\tmean_coverage\tmean_meth_mean\tmean_meth_var\ti_total_mean\ti_total_var\tjsd\tjsd_norm"
     )?;
     writeln!(
         pair_writer,

diff --git a/method/src/scores/mod.rs b/method/src/scores/mod.rs
@@ -1,5 +1,6 @@
 pub mod i_total;
 pub mod jsd;
+pub mod normalize;
 
 /// Plug-in Shannon entropy in bits, with optional Miller-Madow bias correction.
 /// Counts that sum to 0 return 0.

diff --git a/method/src/scores/normalize.rs b/method/src/scores/normalize.rs
@@ -0,0 +1,115 @@
+//! Methylation-normalized scores i_norm and jsd_norm.
+//!
+//! i_norm and jsd_norm divide a raw score by the binary Shannon entropy H(p)
+//! of the average methylation p. H(p) collapses to zero as p approaches 0 or
+//! 1, so the ratio is only defined over a fixed allowed methylation range: a
+//! value is normalized only when 0.1 <= p < 0.9, and is NA (None) otherwise.
+//! The range edges are fixed constants, not data-derived quantiles.
+
+/// Lower edge of the allowed methylation range (inclusive) for normalization.
+pub const METH_LO: f64 = 0.1;
+/// Upper edge of the allowed methylation range (exclusive) for normalization.
+pub const METH_HI: f64 = 0.9;
+
+/// Binary Shannon entropy H(p) in bits. Returns 0 at p <= 0 or p >= 1.
+pub fn shannon_binary(p: f64) -> f64 {
+    if p <= 0.0 || p >= 1.0 {
+        0.0
+    } else {
+        -p * p.log2() - (1.0 - p) * (1.0 - p).log2()
+    }
+}
+
+/// True when p lies in the allowed methylation range [METH_LO, METH_HI).
+fn in_allowed_range(p: f64) -> bool {
+    (METH_LO..METH_HI).contains(&p)
+}
+
+/// Within-cell normalized score: i_total / (k_max * H(mean_meth)).
+/// None when mean_meth is outside the allowed range or the denominator is not
+/// positive.
+pub fn i_norm(i_total: f64, k_max: f64, mean_meth: f64) -> Option<f64> {
+    if !in_allowed_range(mean_meth) {
+        return None;
+    }
+    let denom = k_max * shannon_binary(mean_meth);
+    if denom > 0.0 {
+        Some(i_total / denom)
+    } else {
+        None
+    }
+}
+
+/// Across-cell normalized score: jsd / (2 * H(mean_meth_mean)).
+/// None when mean_meth_mean is outside the allowed range or the denominator is
+/// not positive.
+pub fn jsd_norm(jsd: f64, mean_meth_mean: f64) -> Option<f64> {
+    if !in_allowed_range(mean_meth_mean) {
+        return None;
+    }
+    let denom = 2.0 * shannon_binary(mean_meth_mean);
+    if denom > 0.0 { Some(jsd / denom) } else { None }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn shannon_binary_half_is_one_bit() {
+        assert!((shannon_binary(0.5) - 1.0).abs() < 1e-12);
+    }
+
+    #[test]
+    fn shannon_binary_extremes_are_zero() {
+        assert_eq!(shannon_binary(0.0), 0.0);
+        assert_eq!(shannon_binary(1.0), 0.0);
+        assert_eq!(shannon_binary(-0.1), 0.0);
+        assert_eq!(shannon_binary(1.5), 0.0);
+    }
+
+    #[test]
+    fn shannon_binary_at_range_edge() {
+        // H(0.1) = H(0.9) ~ 0.468996 bits.
+        assert!((shannon_binary(0.1) - 0.4689956).abs() < 1e-6);
+        assert!((shannon_binary(0.9) - 0.4689956).abs() < 1e-6);
+    }
+
+    #[test]
+    fn i_norm_in_band_divides_by_k_times_entropy() {
+        // mean_meth 0.5 -> H = 1 bit, k_max 3 -> denominator 3.
+        assert!((i_norm(3.0, 3.0, 0.5).unwrap() - 1.0).abs() < 1e-12);
+    }
+
+    #[test]
+    fn i_norm_allowed_range_is_lo_inclusive_hi_exclusive() {
+        assert!(i_norm(1.0, 3.0, 0.1).is_some());
+        assert!(i_norm(1.0, 3.0, 0.9).is_none());
+    }
+
+    #[test]
+    fn i_norm_outside_allowed_range_is_none() {
+        assert!(i_norm(1.0, 3.0, 0.05).is_none());
+        assert!(i_norm(1.0, 3.0, 0.95).is_none());
+        assert!(i_norm(1.0, 3.0, 0.0).is_none());
+        assert!(i_norm(1.0, 3.0, 1.0).is_none());
+    }
+
+    #[test]
+    fn i_norm_zero_k_max_is_none() {
+        assert!(i_norm(1.0, 0.0, 0.5).is_none());
+    }
+
+    #[test]
+    fn jsd_norm_in_band_divides_by_two_entropy() {
+        // mean_meth 0.5 -> H = 1 bit -> denominator 2.
+        assert!((jsd_norm(1.0, 0.5).unwrap() - 0.5).abs() < 1e-12);
+    }
+
+    #[test]
+    fn jsd_norm_outside_allowed_range_is_none() {
+        assert!(jsd_norm(1.0, 0.05).is_none());
+        assert!(jsd_norm(1.0, 0.9).is_none());
+        assert!(jsd_norm(1.0, 0.95).is_none());
+    }
+}