Conversation
Borrowed from the yamet pipeline and wired into amet's simulation report. simulations_01_sim_data.Rmd (Emanuel Sonder's simulator, logic unchanged) generates single-cell methylation across a grid of CpG count, coverage model and transition matrix. The lowReal coverage regime draws real missingness from the argelaguet gastrulation cpg_level cells, so generate_emanuel_sim_data chains on the argelaguet data exactly as yamet's sim_simulate_data did. Co-authored-by: Izaskun Mallona <izaskun.mallona@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the scoring/figure pipeline to treat methylation-normalized scores (i_norm, jsd_norm) as first-class outputs from the Rust binary, then updates the Snakemake workflow + R eval/reporting scripts to consume those normalized columns directly (and to emit paired “unadjusted vs adjusted” figures in the simulations report). It also adds an HDF5 pivot for large windows runs to avoid loading the full long table into memory.
Changes:
- Add
i_norm(cell_feature) andjsd_norm(feature) to theametbinary outputs and update snapshots/tests/docs accordingly. - Update simulation eval scripts +
simulations_report.Rmdto produce/plot adjusted + unadjusted variants (including scatter comparisons for wcVI/acVI recovery). - Add an HDF5-based windows store builder/loader and wire it into the Ecker windows/embeddings reports.
Reviewed changes
Copilot reviewed 47 out of 47 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| workflow/Snakefile | Updates simulation eval targets and several eval rule outputs for adjusted/unadjusted artifacts. |
| workflow/scripts/windows_h5.R | Adds helper to load HDF5-backed windows×cells matrices (DelayedArray/HDF5Array). |
| workflow/scripts/render_logging.R | Removes old normalization helpers; keeps render logging/thread helpers. |
| workflow/scripts/palettes.R | Extends CRC patient palette/shape mappings for additional patients. |
| workflow/scripts/eval_wcvi_recovery.R | Emits adjusted/unadjusted wcVI recovery figures + scatter; reads i_norm/jsd_norm directly. |
| workflow/scripts/eval_vs_marginal_baseline.R | Splits baseline comparison into unadjusted vs adjusted outputs. |
| workflow/scripts/eval_tool_comparison.R | Expands amet score set to include raw + normalized scores and updates palette mapping. |
| workflow/scripts/eval_sparsity.R | Emits unadjusted/adjusted sparsity figures; stops recomputing i_norm in-script. |
| workflow/scripts/eval_p_decoupling.R | Emits unadjusted/adjusted decoupling figures; stops recomputing i_norm in-script. |
| workflow/scripts/eval_n_cells.R | Emits unadjusted/adjusted JSD-vs-n figures. |
| workflow/scripts/eval_jsd_mixture_k.R | Emits unadjusted/adjusted JSD-vs-K figures. |
| workflow/scripts/eval_jsd_divergence.R | Emits unadjusted/adjusted JSD-vs-divergence figures. |
| workflow/scripts/eval_feature_variability.R | Switches to amet-emitted i_norm/jsd_norm and includes i_total in comparisons. |
| workflow/scripts/eval_feature_length.R | Emits unadjusted/adjusted feature-length figures; stops recomputing i_norm. |
| workflow/scripts/eval_consensus_perturbation.R | Switches to amet-emitted normalized columns and expands score sets. |
| workflow/scripts/eval_benchmark_summary.R | Updates simulator figure styling and removes redundant in-R normalization. |
| workflow/scripts/eval_acvi_recovery.R | Emits adjusted/unadjusted acVI recovery figures + scatter; reads i_norm/jsd_norm directly. |
| workflow/scripts/driver_utils.R | Updates driver categorization to use normalized group medians (median_i_norm, median_jsd_norm). |
| workflow/scripts/build_windows_h5.R | New: streams windows long table into an HDF5 store (windows×cells matrices). |
| workflow/rules/ecker.smk | Adds rule to build windows HDF5 store and wires it into Ecker windows/embeddings renders. |
| workflow/rules/crc.smk | Fixes Rmd parameter interpolation for windows_annotation (brace escaping issue). |
| workflow/rules/common.smk | Adds WINDOWS_H5_R path constant for rule inputs. |
| workflow/rules/argelaguet.smk | Renames output artifacts to *_i_norm / *_jsd_norm-based filenames. |
| workflow/Rmd/simulations_report.Rmd | Converts several sections to tabsets and plots adjusted/unadjusted variants explicitly. |
| workflow/Rmd/fig_ecker.Rmd | Switches figure panels and labels from i_total/jsd to i_norm/jsd_norm. |
| workflow/Rmd/fig_crc.Rmd | Switches figure panels and labels from i_total/jsd to i_norm/jsd_norm. |
| workflow/Rmd/fig_argelaguet.Rmd | Switches figure panels and labels from i_total/jsd to i_norm/jsd_norm. |
| workflow/Rmd/ecker.Rmd | Switches entropy assembly, heatmaps, and embeddings from raw to normalized scores; adds NA-safe clustering. |
| workflow/Rmd/ecker_windows.Rmd | Loads windows data from HDF5 (realized matrices) and pivots downstream summaries to i_norm/jsd_norm. |
| workflow/Rmd/ecker_embeddings.Rmd | Loads windows matrices from HDF5 and embeds using i_norm (plus methylation). |
| workflow/Rmd/crc_windows.Rmd | Uses i_norm/jsd_norm as headline; keeps raw i_total only for residualization chain. |
| workflow/Rmd/crc_embeddings.Rmd | Embeds/plots using i_norm as primary; updates labels and summaries accordingly. |
| workflow/Rmd/argelaguet.Rmd | Switches report assembly/heatmaps/embeddings to i_norm/jsd_norm. |
| workflow/Rmd/argelaguet_windows.Rmd | Switches windows QC and annotation summaries to i_norm. |
| workflow/Rmd/argelaguet_embeddings.Rmd | Switches window embeddings/diagnostics to i_norm. |
| workflow/envs/r-tools.yml | Adds Bioconductor rhdf5 and HDF5Array dependencies. |
| TODO.md | Updates canonicalization note: i_norm from binary as general score, i_total_resid only for differential testing. |
| README.md | Documents the four scores and the i_norm/jsd_norm formulas + NA range behavior. |
| method/tests/snapshot/golden/feature.tsv | Updates golden snapshot to include jsd_norm. |
| method/tests/snapshot/golden/cell_feature.tsv | Updates golden snapshot to include i_norm. |
| method/tests/integration.rs | Adds integration test asserting i_norm/jsd_norm are NA outside allowed methylation range. |
| method/src/scores/normalize.rs | New: implements i_norm/jsd_norm normalization and allowed methylation range logic. |
| method/src/scores/mod.rs | Exposes the new normalize module. |
| method/src/main.rs | Adds i_norm/jsd_norm to TSV headers and writes normalized columns per row/group. |
| Makefile | Raises per-process ulimit and adds --rerun-triggers mtime to prevent expensive cascaded reruns. |
| CHANGELOG.md | Adds Unreleased notes about normalized score columns + workflow change to consume them directly. |
Comments suppressed due to low confidence (6)
workflow/Snakefile:58
EVAL_OUTPUTSis used to trigger eval rules forrender_simulations_report, butsimulations_report.Rmdrenders plots from the.svgfiles. Some eval rules (e.g.eval_simulator_diagnosticsandeval_lag_profile) only declare a.pdfin theiroutput:even though their scripts write.svg/.csvviasave_eval(). If the.svgis missing (or cleaned) Snakemake may consider the job up-to-date and the report will silently omit the plot. Consider declaring these outputs withmultiext(..., ".pdf", ".svg", ".csv")(or changing the report to use PDFs).
EVAL_OUTPUTS = [
op.join(SIM, "eval", "p_decoupling_unadjusted.pdf"),
op.join(SIM, "eval", "p_decoupling_adjusted.pdf"),
op.join(SIM, "eval", "simulator_diagnostics.pdf"),
op.join(SIM, "eval", "lag_profile.pdf"),
op.join(SIM, "eval", "sparsity_unadjusted.pdf"),
op.join(SIM, "eval", "sparsity_adjusted.pdf"),
op.join(SIM, "eval", "feature_length_unadjusted.pdf"),
op.join(SIM, "eval", "feature_length_adjusted.pdf"),
op.join(SIM, "eval", "jsd_mixture_k_unadjusted.pdf"),
op.join(SIM, "eval", "jsd_mixture_k_adjusted.pdf"),
op.join(SIM, "eval", "jsd_divergence_unadjusted.pdf"),
op.join(SIM, "eval", "jsd_divergence_adjusted.pdf"),
op.join(SIM, "eval", "n_cells_unadjusted.pdf"),
op.join(SIM, "eval", "n_cells_adjusted.pdf"),
op.join(SIM, "eval", "vs_marginal_baseline_unadjusted.pdf"),
op.join(SIM, "eval", "vs_marginal_baseline_adjusted.pdf"),
op.join(SIM, "eval", "wcvi_recovery_unadjusted.pdf"),
op.join(SIM, "eval", "wcvi_recovery_adjusted.pdf"),
op.join(SIM, "eval", "wcvi_recovery_scatter.pdf"),
op.join(SIM, "eval", "acvi_recovery_unadjusted.pdf"),
op.join(SIM, "eval", "acvi_recovery_adjusted.pdf"),
op.join(SIM, "eval", "acvi_recovery_scatter.pdf"),
op.join(SIM, "eval", "benchmark_summary.pdf"),
op.join(SIM, "eval", "tool_comparison.pdf"),
op.join(SIM, "eval", "consensus_perturbation.pdf"),
op.join(SIM, "eval", "feature_variability.pdf"),
]
workflow/scripts/eval_p_decoupling.R:45
- This aggregation filters on
is.finite(i_norm)as well asi_total. That means the unadjustedi_totalplot will drop cells/windows outside the allowed methylation band ([0.1, 0.9)), which may unintentionally remove the methylation-extreme conditions from the unadjusted diagnostic. Consider filteringi_totalandi_normseparately per variant (or clearly documenting the shared in-band restriction).
workflow/scripts/eval_sparsity.R:20 shannon_binary()is now unused in this script after moving normalization into the amet binary. Consider removing it to avoid implying the script recomputesi_norm.
This issue also appears on line 27 of the same file.
workflow/scripts/eval_sparsity.R:33
- Filtering on both
i_totalandi_normwill exclude observations wherei_totalis defined buti_normis NA (outside [0.1, 0.9)). That makes the unadjusted (i_total) robustness plot omit the methylation extremes. Consider computing the unadjusted and adjusted aggregations from different filtered subsets.
workflow/scripts/eval_feature_length.R:19 shannon_binary()is now unused in this script after switching to amet-emittedi_norm. Consider removing it to keep the script focused on plotting/aggregation.
This issue also appears on line 27 of the same file.
workflow/scripts/eval_feature_length.R:32
- This aggregation filters on
is.finite(i_norm)in addition toi_total, so the unadjusted (i_total) variant drops methylation-extreme rows wherei_normis NA. If the intent is to show unadjusted behavior across the full marginal range, consider filtering separately for the unadjusted vs adjusted plots.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| shannon_binary <- function(p) { | ||
| out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1 | ||
| out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe]) | ||
| out[!safe] <- NA_real_; out | ||
| } |
| df <- read.table(gzfile(opt$cell_feature), header = TRUE, sep = "\t", | ||
| na.strings = "NA", stringsAsFactors = FALSE) | ||
| df$shannon_marginal <- shannon_h(df$mean_meth) | ||
| i_cols <- grep("^i_[0-9]+$", names(df), value = TRUE); k_max <- length(i_cols) | ||
| df$i_norm <- df$i_total / (k_max * shannon_h(df$mean_meth)) | ||
| df <- df %>% filter(is.finite(i_norm)) | ||
| df <- df %>% filter(is.finite(i_total), is.finite(i_norm)) | ||
| df$structure <- ifelse(grepl("^iid_", df$cell_id), "iid", "structured") |
| shannon_binary <- function(p) { | ||
| out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1 | ||
| out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe]) | ||
| out[!safe] <- NA_real_; out | ||
| } |
| shannon_binary <- function(p) { | ||
| out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1 | ||
| out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe]) | ||
| out[!safe] <- NA_real_; out | ||
| } |
| shannon_binary <- function(p) { | ||
| out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1 | ||
| out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe]) | ||
| out[!safe] <- NA_real_; out | ||
| } |
| windows x cells i_total and meth matrices. Streams the input cell block | ||
| by cell block so peak memory stays near one cell.""" |
| if (file.exists(opt$output)) file.remove(opt$output) | ||
| h5createFile(opt$output) | ||
|
|
||
| con <- pipe(sprintf("zcat %s", shQuote(opt$cell_feature)), open = "rt") |
| for (cid in unique(complete$cell_id)) | ||
| flush_cell(complete[cell_id == cid]) | ||
| } | ||
| if (!is.null(pending) && nrow(pending) > 0L) | ||
| for (cid in unique(pending$cell_id)) | ||
| flush_cell(pending[cell_id == cid]) |
Port Emanuel's simulations from `yamet`
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
No description provided.