Figure updates by imallona · Pull Request #6 · imallona/amet

imallona · 2026-05-15T15:05:35Z

No description provided.

Borrowed from the yamet pipeline and wired into amet's simulation report. simulations_01_sim_data.Rmd (Emanuel Sonder's simulator, logic unchanged) generates single-cell methylation across a grid of CpG count, coverage model and transition matrix. The lowReal coverage regime draws real missingness from the argelaguet gastrulation cpg_level cells, so generate_emanuel_sim_data chains on the argelaguet data exactly as yamet's sim_simulate_data did. Co-authored-by: Izaskun Mallona <izaskun.mallona@gmail.com>

Copilot

Pull request overview

This PR updates the scoring/figure pipeline to treat methylation-normalized scores (i_norm, jsd_norm) as first-class outputs from the Rust binary, then updates the Snakemake workflow + R eval/reporting scripts to consume those normalized columns directly (and to emit paired “unadjusted vs adjusted” figures in the simulations report). It also adds an HDF5 pivot for large windows runs to avoid loading the full long table into memory.

Changes:

Add i_norm (cell_feature) and jsd_norm (feature) to the amet binary outputs and update snapshots/tests/docs accordingly.
Update simulation eval scripts + simulations_report.Rmd to produce/plot adjusted + unadjusted variants (including scatter comparisons for wcVI/acVI recovery).
Add an HDF5-based windows store builder/loader and wire it into the Ecker windows/embeddings reports.

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
workflow/Snakefile	Updates simulation eval targets and several eval rule outputs for adjusted/unadjusted artifacts.
workflow/scripts/windows_h5.R	Adds helper to load HDF5-backed windows×cells matrices (DelayedArray/HDF5Array).
workflow/scripts/render_logging.R	Removes old normalization helpers; keeps render logging/thread helpers.
workflow/scripts/palettes.R	Extends CRC patient palette/shape mappings for additional patients.
workflow/scripts/eval_wcvi_recovery.R	Emits adjusted/unadjusted wcVI recovery figures + scatter; reads `i_norm/jsd_norm` directly.
workflow/scripts/eval_vs_marginal_baseline.R	Splits baseline comparison into unadjusted vs adjusted outputs.
workflow/scripts/eval_tool_comparison.R	Expands amet score set to include raw + normalized scores and updates palette mapping.
workflow/scripts/eval_sparsity.R	Emits unadjusted/adjusted sparsity figures; stops recomputing `i_norm` in-script.
workflow/scripts/eval_p_decoupling.R	Emits unadjusted/adjusted decoupling figures; stops recomputing `i_norm` in-script.
workflow/scripts/eval_n_cells.R	Emits unadjusted/adjusted JSD-vs-n figures.
workflow/scripts/eval_jsd_mixture_k.R	Emits unadjusted/adjusted JSD-vs-K figures.
workflow/scripts/eval_jsd_divergence.R	Emits unadjusted/adjusted JSD-vs-divergence figures.
workflow/scripts/eval_feature_variability.R	Switches to amet-emitted `i_norm/jsd_norm` and includes `i_total` in comparisons.
workflow/scripts/eval_feature_length.R	Emits unadjusted/adjusted feature-length figures; stops recomputing `i_norm`.
workflow/scripts/eval_consensus_perturbation.R	Switches to amet-emitted normalized columns and expands score sets.
workflow/scripts/eval_benchmark_summary.R	Updates simulator figure styling and removes redundant in-R normalization.
workflow/scripts/eval_acvi_recovery.R	Emits adjusted/unadjusted acVI recovery figures + scatter; reads `i_norm/jsd_norm` directly.
workflow/scripts/driver_utils.R	Updates driver categorization to use normalized group medians (`median_i_norm`, `median_jsd_norm`).
workflow/scripts/build_windows_h5.R	New: streams windows long table into an HDF5 store (windows×cells matrices).
workflow/rules/ecker.smk	Adds rule to build windows HDF5 store and wires it into Ecker windows/embeddings renders.
workflow/rules/crc.smk	Fixes Rmd parameter interpolation for `windows_annotation` (brace escaping issue).
workflow/rules/common.smk	Adds `WINDOWS_H5_R` path constant for rule inputs.
workflow/rules/argelaguet.smk	Renames output artifacts to `_i_norm` / `_jsd_norm`-based filenames.
workflow/Rmd/simulations_report.Rmd	Converts several sections to tabsets and plots adjusted/unadjusted variants explicitly.
workflow/Rmd/fig_ecker.Rmd	Switches figure panels and labels from `i_total/jsd` to `i_norm/jsd_norm`.
workflow/Rmd/fig_crc.Rmd	Switches figure panels and labels from `i_total/jsd` to `i_norm/jsd_norm`.
workflow/Rmd/fig_argelaguet.Rmd	Switches figure panels and labels from `i_total/jsd` to `i_norm/jsd_norm`.
workflow/Rmd/ecker.Rmd	Switches entropy assembly, heatmaps, and embeddings from raw to normalized scores; adds NA-safe clustering.
workflow/Rmd/ecker_windows.Rmd	Loads windows data from HDF5 (realized matrices) and pivots downstream summaries to `i_norm/jsd_norm`.
workflow/Rmd/ecker_embeddings.Rmd	Loads windows matrices from HDF5 and embeds using `i_norm` (plus methylation).
workflow/Rmd/crc_windows.Rmd	Uses `i_norm/jsd_norm` as headline; keeps raw `i_total` only for residualization chain.
workflow/Rmd/crc_embeddings.Rmd	Embeds/plots using `i_norm` as primary; updates labels and summaries accordingly.
workflow/Rmd/argelaguet.Rmd	Switches report assembly/heatmaps/embeddings to `i_norm/jsd_norm`.
workflow/Rmd/argelaguet_windows.Rmd	Switches windows QC and annotation summaries to `i_norm`.
workflow/Rmd/argelaguet_embeddings.Rmd	Switches window embeddings/diagnostics to `i_norm`.
workflow/envs/r-tools.yml	Adds Bioconductor `rhdf5` and `HDF5Array` dependencies.
TODO.md	Updates canonicalization note: `i_norm` from binary as general score, `i_total_resid` only for differential testing.
README.md	Documents the four scores and the `i_norm/jsd_norm` formulas + NA range behavior.
method/tests/snapshot/golden/feature.tsv	Updates golden snapshot to include `jsd_norm`.
method/tests/snapshot/golden/cell_feature.tsv	Updates golden snapshot to include `i_norm`.
method/tests/integration.rs	Adds integration test asserting `i_norm/jsd_norm` are NA outside allowed methylation range.
method/src/scores/normalize.rs	New: implements `i_norm/jsd_norm` normalization and allowed methylation range logic.
method/src/scores/mod.rs	Exposes the new `normalize` module.
method/src/main.rs	Adds `i_norm/jsd_norm` to TSV headers and writes normalized columns per row/group.
Makefile	Raises per-process ulimit and adds `--rerun-triggers mtime` to prevent expensive cascaded reruns.
CHANGELOG.md	Adds Unreleased notes about normalized score columns + workflow change to consume them directly.

Comments suppressed due to low confidence (6)

workflow/Snakefile:58

EVAL_OUTPUTS is used to trigger eval rules for render_simulations_report, but simulations_report.Rmd renders plots from the .svg files. Some eval rules (e.g. eval_simulator_diagnostics and eval_lag_profile) only declare a .pdf in their output: even though their scripts write .svg/.csv via save_eval(). If the .svg is missing (or cleaned) Snakemake may consider the job up-to-date and the report will silently omit the plot. Consider declaring these outputs with multiext(..., ".pdf", ".svg", ".csv") (or changing the report to use PDFs).

EVAL_OUTPUTS = [
    op.join(SIM, "eval", "p_decoupling_unadjusted.pdf"),
    op.join(SIM, "eval", "p_decoupling_adjusted.pdf"),
    op.join(SIM, "eval", "simulator_diagnostics.pdf"),
    op.join(SIM, "eval", "lag_profile.pdf"),
    op.join(SIM, "eval", "sparsity_unadjusted.pdf"),
    op.join(SIM, "eval", "sparsity_adjusted.pdf"),
    op.join(SIM, "eval", "feature_length_unadjusted.pdf"),
    op.join(SIM, "eval", "feature_length_adjusted.pdf"),
    op.join(SIM, "eval", "jsd_mixture_k_unadjusted.pdf"),
    op.join(SIM, "eval", "jsd_mixture_k_adjusted.pdf"),
    op.join(SIM, "eval", "jsd_divergence_unadjusted.pdf"),
    op.join(SIM, "eval", "jsd_divergence_adjusted.pdf"),
    op.join(SIM, "eval", "n_cells_unadjusted.pdf"),
    op.join(SIM, "eval", "n_cells_adjusted.pdf"),
    op.join(SIM, "eval", "vs_marginal_baseline_unadjusted.pdf"),
    op.join(SIM, "eval", "vs_marginal_baseline_adjusted.pdf"),
    op.join(SIM, "eval", "wcvi_recovery_unadjusted.pdf"),
    op.join(SIM, "eval", "wcvi_recovery_adjusted.pdf"),
    op.join(SIM, "eval", "wcvi_recovery_scatter.pdf"),
    op.join(SIM, "eval", "acvi_recovery_unadjusted.pdf"),
    op.join(SIM, "eval", "acvi_recovery_adjusted.pdf"),
    op.join(SIM, "eval", "acvi_recovery_scatter.pdf"),
    op.join(SIM, "eval", "benchmark_summary.pdf"),
    op.join(SIM, "eval", "tool_comparison.pdf"),
    op.join(SIM, "eval", "consensus_perturbation.pdf"),
    op.join(SIM, "eval", "feature_variability.pdf"),
]

workflow/scripts/eval_p_decoupling.R:45

This aggregation filters on is.finite(i_norm) as well as i_total. That means the unadjusted i_total plot will drop cells/windows outside the allowed methylation band ([0.1, 0.9)), which may unintentionally remove the methylation-extreme conditions from the unadjusted diagnostic. Consider filtering i_total and i_norm separately per variant (or clearly documenting the shared in-band restriction).
workflow/scripts/eval_sparsity.R:20
shannon_binary() is now unused in this script after moving normalization into the amet binary. Consider removing it to avoid implying the script recomputes i_norm.

This issue also appears on line 27 of the same file.
workflow/scripts/eval_sparsity.R:33

Filtering on both i_total and i_norm will exclude observations where i_total is defined but i_norm is NA (outside [0.1, 0.9)). That makes the unadjusted (i_total) robustness plot omit the methylation extremes. Consider computing the unadjusted and adjusted aggregations from different filtered subsets.
workflow/scripts/eval_feature_length.R:19
shannon_binary() is now unused in this script after switching to amet-emitted i_norm. Consider removing it to keep the script focused on plotting/aggregation.

This issue also appears on line 27 of the same file.
workflow/scripts/eval_feature_length.R:32

This aggregation filters on is.finite(i_norm) in addition to i_total, so the unadjusted (i_total) variant drops methylation-extreme rows where i_norm is NA. If the intent is to show unadjusted behavior across the full marginal range, consider filtering separately for the unadjusted vs adjusted plots.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 shannon_binary <- function(p) {
    out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1
    out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe])
    out[!safe] <- NA_real_; out
 }


 df <- read.table(gzfile(opt$cell_feature), header = TRUE, sep = "\t",
                 na.strings = "NA", stringsAsFactors = FALSE)
 df$shannon_marginal <- shannon_h(df$mean_meth)
-i_cols <- grep("^i_[0-9]+$", names(df), value = TRUE); k_max <- length(i_cols)
-df$i_norm <- df$i_total / (k_max * shannon_h(df$mean_meth))
-df <- df %>% filter(is.finite(i_norm))
+df <- df %>% filter(is.finite(i_total), is.finite(i_norm))
 df$structure <- ifelse(grepl("^iid_", df$cell_id), "iid", "structured")


+shannon_binary <- function(p) {
+    out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1
+    out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe])
+    out[!safe] <- NA_real_; out
+}


+shannon_binary <- function(p) {
+    out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1
+    out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe])
+    out[!safe] <- NA_real_; out
+}


+shannon_binary <- function(p) {
+    out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1
+    out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe])
+    out[!safe] <- NA_real_; out
+}


+    windows x cells i_total and meth matrices. Streams the input cell block
+    by cell block so peak memory stays near one cell."""


+if (file.exists(opt$output)) file.remove(opt$output)
+h5createFile(opt$output)
+
+con <- pipe(sprintf("zcat %s", shQuote(opt$cell_feature)), open = "rt")


+  for (cid in unique(complete$cell_id))
+    flush_cell(complete[cell_id == cid])
+}
+if (!is.null(pending) && nrow(pending) > 0L)
+  for (cid in unique(pending$cell_id))
+    flush_cell(pending[cell_id == cid])


Port Emanuel's simulations from `yamet`

…r & hmr states)

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

imallona and others added 11 commits May 15, 2026 17:04

Fix CRC figures, including all patients

11d7bd3

Patch Ecker

ef543e7

Switch to hd5 for Ecker / not downsampled, windows

2a132ca

fix indexing

c9cb27c

Fix windows H5

99dd9da

Process Ecker's count table in memory, fix minor in CRC

089136c

Plot adj vs unadj scores, again, on simulations

d3ccc7c

Update simulation plots

2cb30c3

Add unadjusted to benchmark

5fe8162

Fully switch to norm/adj scores; Rust calculated

03a3bbb

imallona requested a review from Copilot May 18, 2026 17:43

Copilot started reviewing on behalf of imallona May 18, 2026 17:44 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

imallona and others added 4 commits May 19, 2026 09:06

Demote i_total_resid, make diff entropic windows crc01-only

341dd94

Merge pull request #7 from imallona/emanuel_simuls

59cd0ad

Port Emanuel's simulations from `yamet`

qAdded additional transition matrices to parameter grid (different lm…

da9d967

…r & hmr states)

Potential fix for pull request finding

40864c0

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure updates#6

Figure updates#6
imallona wants to merge 15 commits into
mainfrom
dev

imallona commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		windows x cells i_total and meth matrices. Streams the input cell block
		by cell block so peak memory stays near one cell."""

Conversation

imallona commented May 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants