Skip to content

Figure updates#6

Open
imallona wants to merge 15 commits into
mainfrom
dev
Open

Figure updates#6
imallona wants to merge 15 commits into
mainfrom
dev

Conversation

@imallona
Copy link
Copy Markdown
Owner

No description provided.

imallona and others added 11 commits May 15, 2026 17:04
Borrowed from the yamet pipeline and wired into amet's simulation
report. simulations_01_sim_data.Rmd (Emanuel Sonder's simulator, logic
unchanged) generates single-cell methylation across a grid of CpG
count, coverage model and transition matrix. The lowReal coverage
regime draws real missingness from the argelaguet gastrulation
cpg_level cells, so generate_emanuel_sim_data chains on the argelaguet
data exactly as yamet's sim_simulate_data did.

Co-authored-by: Izaskun Mallona <izaskun.mallona@gmail.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the scoring/figure pipeline to treat methylation-normalized scores (i_norm, jsd_norm) as first-class outputs from the Rust binary, then updates the Snakemake workflow + R eval/reporting scripts to consume those normalized columns directly (and to emit paired “unadjusted vs adjusted” figures in the simulations report). It also adds an HDF5 pivot for large windows runs to avoid loading the full long table into memory.

Changes:

  • Add i_norm (cell_feature) and jsd_norm (feature) to the amet binary outputs and update snapshots/tests/docs accordingly.
  • Update simulation eval scripts + simulations_report.Rmd to produce/plot adjusted + unadjusted variants (including scatter comparisons for wcVI/acVI recovery).
  • Add an HDF5-based windows store builder/loader and wire it into the Ecker windows/embeddings reports.

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
workflow/Snakefile Updates simulation eval targets and several eval rule outputs for adjusted/unadjusted artifacts.
workflow/scripts/windows_h5.R Adds helper to load HDF5-backed windows×cells matrices (DelayedArray/HDF5Array).
workflow/scripts/render_logging.R Removes old normalization helpers; keeps render logging/thread helpers.
workflow/scripts/palettes.R Extends CRC patient palette/shape mappings for additional patients.
workflow/scripts/eval_wcvi_recovery.R Emits adjusted/unadjusted wcVI recovery figures + scatter; reads i_norm/jsd_norm directly.
workflow/scripts/eval_vs_marginal_baseline.R Splits baseline comparison into unadjusted vs adjusted outputs.
workflow/scripts/eval_tool_comparison.R Expands amet score set to include raw + normalized scores and updates palette mapping.
workflow/scripts/eval_sparsity.R Emits unadjusted/adjusted sparsity figures; stops recomputing i_norm in-script.
workflow/scripts/eval_p_decoupling.R Emits unadjusted/adjusted decoupling figures; stops recomputing i_norm in-script.
workflow/scripts/eval_n_cells.R Emits unadjusted/adjusted JSD-vs-n figures.
workflow/scripts/eval_jsd_mixture_k.R Emits unadjusted/adjusted JSD-vs-K figures.
workflow/scripts/eval_jsd_divergence.R Emits unadjusted/adjusted JSD-vs-divergence figures.
workflow/scripts/eval_feature_variability.R Switches to amet-emitted i_norm/jsd_norm and includes i_total in comparisons.
workflow/scripts/eval_feature_length.R Emits unadjusted/adjusted feature-length figures; stops recomputing i_norm.
workflow/scripts/eval_consensus_perturbation.R Switches to amet-emitted normalized columns and expands score sets.
workflow/scripts/eval_benchmark_summary.R Updates simulator figure styling and removes redundant in-R normalization.
workflow/scripts/eval_acvi_recovery.R Emits adjusted/unadjusted acVI recovery figures + scatter; reads i_norm/jsd_norm directly.
workflow/scripts/driver_utils.R Updates driver categorization to use normalized group medians (median_i_norm, median_jsd_norm).
workflow/scripts/build_windows_h5.R New: streams windows long table into an HDF5 store (windows×cells matrices).
workflow/rules/ecker.smk Adds rule to build windows HDF5 store and wires it into Ecker windows/embeddings renders.
workflow/rules/crc.smk Fixes Rmd parameter interpolation for windows_annotation (brace escaping issue).
workflow/rules/common.smk Adds WINDOWS_H5_R path constant for rule inputs.
workflow/rules/argelaguet.smk Renames output artifacts to *_i_norm / *_jsd_norm-based filenames.
workflow/Rmd/simulations_report.Rmd Converts several sections to tabsets and plots adjusted/unadjusted variants explicitly.
workflow/Rmd/fig_ecker.Rmd Switches figure panels and labels from i_total/jsd to i_norm/jsd_norm.
workflow/Rmd/fig_crc.Rmd Switches figure panels and labels from i_total/jsd to i_norm/jsd_norm.
workflow/Rmd/fig_argelaguet.Rmd Switches figure panels and labels from i_total/jsd to i_norm/jsd_norm.
workflow/Rmd/ecker.Rmd Switches entropy assembly, heatmaps, and embeddings from raw to normalized scores; adds NA-safe clustering.
workflow/Rmd/ecker_windows.Rmd Loads windows data from HDF5 (realized matrices) and pivots downstream summaries to i_norm/jsd_norm.
workflow/Rmd/ecker_embeddings.Rmd Loads windows matrices from HDF5 and embeds using i_norm (plus methylation).
workflow/Rmd/crc_windows.Rmd Uses i_norm/jsd_norm as headline; keeps raw i_total only for residualization chain.
workflow/Rmd/crc_embeddings.Rmd Embeds/plots using i_norm as primary; updates labels and summaries accordingly.
workflow/Rmd/argelaguet.Rmd Switches report assembly/heatmaps/embeddings to i_norm/jsd_norm.
workflow/Rmd/argelaguet_windows.Rmd Switches windows QC and annotation summaries to i_norm.
workflow/Rmd/argelaguet_embeddings.Rmd Switches window embeddings/diagnostics to i_norm.
workflow/envs/r-tools.yml Adds Bioconductor rhdf5 and HDF5Array dependencies.
TODO.md Updates canonicalization note: i_norm from binary as general score, i_total_resid only for differential testing.
README.md Documents the four scores and the i_norm/jsd_norm formulas + NA range behavior.
method/tests/snapshot/golden/feature.tsv Updates golden snapshot to include jsd_norm.
method/tests/snapshot/golden/cell_feature.tsv Updates golden snapshot to include i_norm.
method/tests/integration.rs Adds integration test asserting i_norm/jsd_norm are NA outside allowed methylation range.
method/src/scores/normalize.rs New: implements i_norm/jsd_norm normalization and allowed methylation range logic.
method/src/scores/mod.rs Exposes the new normalize module.
method/src/main.rs Adds i_norm/jsd_norm to TSV headers and writes normalized columns per row/group.
Makefile Raises per-process ulimit and adds --rerun-triggers mtime to prevent expensive cascaded reruns.
CHANGELOG.md Adds Unreleased notes about normalized score columns + workflow change to consume them directly.
Comments suppressed due to low confidence (6)

workflow/Snakefile:58

  • EVAL_OUTPUTS is used to trigger eval rules for render_simulations_report, but simulations_report.Rmd renders plots from the .svg files. Some eval rules (e.g. eval_simulator_diagnostics and eval_lag_profile) only declare a .pdf in their output: even though their scripts write .svg/.csv via save_eval(). If the .svg is missing (or cleaned) Snakemake may consider the job up-to-date and the report will silently omit the plot. Consider declaring these outputs with multiext(..., ".pdf", ".svg", ".csv") (or changing the report to use PDFs).
EVAL_OUTPUTS = [
    op.join(SIM, "eval", "p_decoupling_unadjusted.pdf"),
    op.join(SIM, "eval", "p_decoupling_adjusted.pdf"),
    op.join(SIM, "eval", "simulator_diagnostics.pdf"),
    op.join(SIM, "eval", "lag_profile.pdf"),
    op.join(SIM, "eval", "sparsity_unadjusted.pdf"),
    op.join(SIM, "eval", "sparsity_adjusted.pdf"),
    op.join(SIM, "eval", "feature_length_unadjusted.pdf"),
    op.join(SIM, "eval", "feature_length_adjusted.pdf"),
    op.join(SIM, "eval", "jsd_mixture_k_unadjusted.pdf"),
    op.join(SIM, "eval", "jsd_mixture_k_adjusted.pdf"),
    op.join(SIM, "eval", "jsd_divergence_unadjusted.pdf"),
    op.join(SIM, "eval", "jsd_divergence_adjusted.pdf"),
    op.join(SIM, "eval", "n_cells_unadjusted.pdf"),
    op.join(SIM, "eval", "n_cells_adjusted.pdf"),
    op.join(SIM, "eval", "vs_marginal_baseline_unadjusted.pdf"),
    op.join(SIM, "eval", "vs_marginal_baseline_adjusted.pdf"),
    op.join(SIM, "eval", "wcvi_recovery_unadjusted.pdf"),
    op.join(SIM, "eval", "wcvi_recovery_adjusted.pdf"),
    op.join(SIM, "eval", "wcvi_recovery_scatter.pdf"),
    op.join(SIM, "eval", "acvi_recovery_unadjusted.pdf"),
    op.join(SIM, "eval", "acvi_recovery_adjusted.pdf"),
    op.join(SIM, "eval", "acvi_recovery_scatter.pdf"),
    op.join(SIM, "eval", "benchmark_summary.pdf"),
    op.join(SIM, "eval", "tool_comparison.pdf"),
    op.join(SIM, "eval", "consensus_perturbation.pdf"),
    op.join(SIM, "eval", "feature_variability.pdf"),
]

workflow/scripts/eval_p_decoupling.R:45

  • This aggregation filters on is.finite(i_norm) as well as i_total. That means the unadjusted i_total plot will drop cells/windows outside the allowed methylation band ([0.1, 0.9)), which may unintentionally remove the methylation-extreme conditions from the unadjusted diagnostic. Consider filtering i_total and i_norm separately per variant (or clearly documenting the shared in-band restriction).
    workflow/scripts/eval_sparsity.R:20
  • shannon_binary() is now unused in this script after moving normalization into the amet binary. Consider removing it to avoid implying the script recomputes i_norm.

This issue also appears on line 27 of the same file.
workflow/scripts/eval_sparsity.R:33

  • Filtering on both i_total and i_norm will exclude observations where i_total is defined but i_norm is NA (outside [0.1, 0.9)). That makes the unadjusted (i_total) robustness plot omit the methylation extremes. Consider computing the unadjusted and adjusted aggregations from different filtered subsets.
    workflow/scripts/eval_feature_length.R:19
  • shannon_binary() is now unused in this script after switching to amet-emitted i_norm. Consider removing it to keep the script focused on plotting/aggregation.

This issue also appears on line 27 of the same file.
workflow/scripts/eval_feature_length.R:32

  • This aggregation filters on is.finite(i_norm) in addition to i_total, so the unadjusted (i_total) variant drops methylation-extreme rows where i_norm is NA. If the intent is to show unadjusted behavior across the full marginal range, consider filtering separately for the unadjusted vs adjusted plots.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 30 to 34
shannon_binary <- function(p) {
out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1
out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe])
out[!safe] <- NA_real_; out
}
Comment on lines 24 to 28
df <- read.table(gzfile(opt$cell_feature), header = TRUE, sep = "\t",
na.strings = "NA", stringsAsFactors = FALSE)
df$shannon_marginal <- shannon_h(df$mean_meth)
i_cols <- grep("^i_[0-9]+$", names(df), value = TRUE); k_max <- length(i_cols)
df$i_norm <- df$i_total / (k_max * shannon_h(df$mean_meth))
df <- df %>% filter(is.finite(i_norm))
df <- df %>% filter(is.finite(i_total), is.finite(i_norm))
df$structure <- ifelse(grepl("^iid_", df$cell_id), "iid", "structured")
Comment on lines +16 to +20
shannon_binary <- function(p) {
out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1
out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe])
out[!safe] <- NA_real_; out
}
Comment on lines +16 to +20
shannon_binary <- function(p) {
out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1
out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe])
out[!safe] <- NA_real_; out
}
Comment on lines +16 to +20
shannon_binary <- function(p) {
out <- numeric(length(p)); safe <- !is.na(p) & p > 0 & p < 1
out[safe] <- -p[safe] * log2(p[safe]) - (1 - p[safe]) * log2(1 - p[safe])
out[!safe] <- NA_real_; out
}
Comment thread workflow/rules/ecker.smk
Comment on lines +488 to +489
windows x cells i_total and meth matrices. Streams the input cell block
by cell block so peak memory stays near one cell."""
if (file.exists(opt$output)) file.remove(opt$output)
h5createFile(opt$output)

con <- pipe(sprintf("zcat %s", shQuote(opt$cell_feature)), open = "rt")
Comment on lines +116 to +121
for (cid in unique(complete$cell_id))
flush_cell(complete[cell_id == cid])
}
if (!is.null(pending) && nrow(pending) > 0L)
for (cid in unique(pending$cell_id))
flush_cell(pending[cell_id == cid])
imallona and others added 4 commits May 19, 2026 09:06
Port Emanuel's simulations from `yamet`
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants