diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml index 150fc867..e4bee172 100644 --- a/.github/workflows/R-CMD-check.yaml +++ b/.github/workflows/R-CMD-check.yaml @@ -30,6 +30,10 @@ jobs: env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + # Run skip_on_cran() tests in CI (incl. the varPro grow tests): CI is + # not a sanitizer build, so the upstream UBSAN path is harmless here; + # only CRAN's own check machines (NOT_CRAN unset) skip them. + NOT_CRAN: "true" R_KEEP_PKG_SOURCE: yes # Skip vdiffr visual-regression tests in CI until reference SVGs are # committed. To regenerate: run testthat::snapshot_accept() locally, diff --git a/.github/workflows/check-release.yaml b/.github/workflows/check-release.yaml index b2217c66..c7944d19 100644 --- a/.github/workflows/check-release.yaml +++ b/.github/workflows/check-release.yaml @@ -13,6 +13,10 @@ jobs: runs-on: ubuntu-latest env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + # Run skip_on_cran() tests in CI (incl. the varPro grow tests): CI is + # not a sanitizer build, so the upstream UBSAN path is harmless here; + # only CRAN's own check machines (NOT_CRAN unset) skip them. + NOT_CRAN: "true" R_KEEP_PKG_SOURCE: yes VDIFFR_RUN_TESTS: "false" steps: diff --git a/.github/workflows/check-standard.yaml b/.github/workflows/check-standard.yaml index d6ea388d..a44f74c9 100644 --- a/.github/workflows/check-standard.yaml +++ b/.github/workflows/check-standard.yaml @@ -29,6 +29,10 @@ jobs: env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + # Run skip_on_cran() tests in CI (incl. the varPro grow tests): CI is + # not a sanitizer build, so the upstream UBSAN path is harmless here; + # only CRAN's own check machines (NOT_CRAN unset) skip them. + NOT_CRAN: "true" R_KEEP_PKG_SOURCE: yes VDIFFR_RUN_TESTS: "false" diff --git a/.github/workflows/test-coverage.yaml b/.github/workflows/test-coverage.yaml index 9045b734..628f6dc8 100644 --- a/.github/workflows/test-coverage.yaml +++ b/.github/workflows/test-coverage.yaml @@ -14,6 +14,10 @@ jobs: runs-on: ubuntu-latest env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + # Run skip_on_cran() tests in CI (incl. the varPro grow tests): CI is + # not a sanitizer build, so the upstream UBSAN path is harmless here; + # only CRAN's own check machines (NOT_CRAN unset) skip them. + NOT_CRAN: "true" VDIFFR_RUN_TESTS: "false" steps: diff --git a/CRAN-SUBMISSION b/CRAN-SUBMISSION index 6d2dd745..f9795b1e 100644 --- a/CRAN-SUBMISSION +++ b/CRAN-SUBMISSION @@ -1,3 +1,3 @@ -Version: 3.1.0 -Date: 2026-06-11 15:26:24 UTC -SHA: a7d805290e69ae517d04846bf13fae6a01062fce +Version: 3.1.2 +Date: 2026-06-13 13:53:53 UTC +SHA: 83d8021b0014e72613d35d02caf19bf9e4bca6eb diff --git a/DESCRIPTION b/DESCRIPTION index 62aeb02f..8f1562a4 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,8 +1,8 @@ Package: ggRandomForests Type: Package Title: Visually Exploring Random Forests -Version: 3.1.0.9000 -Date: 2026-06-11 +Version: 3.1.2.9000 +Date: 2026-06-13 Authors@R: person("John", "Ehrlinger", role = c("aut", "cre"), email = "john.ehrlinger@gmail.com") diff --git a/NEWS.md b/NEWS.md index a3d1ae38..9f91730d 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,9 +1,10 @@ Package: ggRandomForests -Version: 3.1.0.9000 +Version: 3.1.2.9000 ggRandomForests v4.0.0 (development) ==================================== -* Development version 3.1.0.9000, opened after the v3.1.0 CRAN release. +* Development version 3.1.2.9000, opened after the v3.1.2 CRAN release + (forward-merged the v3.1.1 and v3.1.2 CRAN fixes onto the dev line). * `gg_auct()` / `plot.gg_auct()`: tidy wrapper and plot for time-varying AUC from `randomForestRHF::auct.rhf()` (RHF Phase 2). Returns a long frame `time / auc / se / lower / upper / marker` with an `iauc` @@ -17,6 +18,44 @@ ggRandomForests v4.0.0 (development) `requireNamespace("randomForestRHF")`. No change for users who do not install it. +ggRandomForests v3.1.2 +====================== +* CRAN fix: skip only the single test grow that trips the upstream + `randomForestSRC` gcc-UBSAN report at `entry.c:184` — the *unsupervised* + isolation forest in `gg_isopro` (`varPro::isopro(method = "unsupv")`). Only + an unsupervised grow has a 0-length `yvar.wt`, the vector `rfsrcGrow` + decrements to an out-of-bounds pointer; supervised grows are unaffected. + We verified this under `-fsanitize=undefined`: of every varPro/rfsrc grow + in the test suite, only `isopro(method = "unsupv")` fires `entry.c:184`. + `make_iso_fit()` therefore calls `skip_on_cran()` only for + `method = "unsupv"`. ggRandomForests is pure R and unchanged. +* The broader `skip_on_cran()` guards added in v3.1.1 (the `varpro`, + `uvarpro`, `ivarpro`, `beta.varpro`, and `isopro(method = "rnd")` test + fixtures) are removed: those grows are supervised (or synthetic-supervised) + and gcc-UBSAN-clean, so they run on CRAN again, restoring that test + coverage. The upstream issue is fixed in `randomForestSRC` and pending a + CRAN release. + +ggRandomForests v3.1.1 +====================== +* CRAN fix: the varPro tests now call `skip_on_cran()` so they do not run + on CRAN's check machines, including the gcc-UBSAN additional check. They + were triggering an upstream `randomForestSRC` sanitizer issue (a 0-length + array access in `rfsrcGrow`, `entry.c:184`) that surfaces when any + `varPro` grow (`varpro()`, `beta.varpro()`, `uvarpro()`, `isopro()`, + `ivarpro()`) builds a forest. ggRandomForests is pure R and its code is + unchanged; the varPro tests still run in our CI (the workflows set + `NOT_CRAN=true`) and locally; they are skipped only on CRAN's check + machines, including the gcc-UBSAN check. The upstream issue has been + reported to the randomForestSRC maintainers. +* The `varpro` vignette now loads every varPro fit from a precomputed + file (`vignettes/varpro_precomputed.rds`, built by + `vignettes/precompute_varpro.R`), so the vignette performs no live + varPro grow during `R CMD check`. This removes the same upstream + sanitizer path from the vignette build and trims check time. Each chunk + falls back to a live fit if the precomputed object is absent, so the + vignette remains reproducible from source. + ggRandomForests v3.1.0 ====================== * Fix: `gg_vimp()` for single-outcome rfsrc forests now correctly flags diff --git a/cran-comments.md b/cran-comments.md index 561003d6..5dfffa66 100644 --- a/cran-comments.md +++ b/cran-comments.md @@ -1,88 +1,64 @@ -## v3.1.0 — varPro integration + bug fix (major release) - -### Resubmission - -This resubmission addresses the one issue raised on the previous 3.1.0 -submission: the overall check time exceeded 10 minutes on -r-devel-windows-x86_64, driven by the vignette rebuild (~331 s) and the -tests (~209 s). We have reduced both, per the three levers suggested -(small toy data, fewer iterations, precomputed results for the lengthiest -parts): - -- The regression and survival vignettes use lighter forests (`ntree` - 200 / 150, imputation `ntree` 100) and coarser partial-dependence grids. -- The varpro vignette's three `gg_partial_varpro()` calls and the Boston - `beta.varpro()` fit — together ~34 s, the bulk of that vignette's - rebuild — are now precomputed offline (`vignettes/precompute_varpro.R`) - and loaded from a 167 KB `vignettes/varpro_precomputed.rds`, with an - automatic live-computation fallback if the file is absent. -- The `gg_udependent()` tests previously recomputed the same per-fit - entropy matrix (~1.5 s) once per test; they now memoise it. - -Locally the full vignette rebuild drops from ~80 s to ~38 s and the test -suite from ~44 s to ~36 s (R 4.6.0, aarch64-apple-darwin23), with no -change in test coverage or vignette content. - -### Release context - -v2.7.3 is the version currently published on CRAN. A v3.0.0 submission -(2026-05-28) cleared the Windows and Debian pretests (0/0/0) but did not -complete the review cycle; this 3.1.0 release supersedes it, carrying the -full v3.0.0 feature layer plus the v3.1.0 bug fix and documentation work. -The version is in 3.x territory because it adds a substantial new feature -layer and soft-deprecates one user-facing function. - -### Changes in v3.1.0 - -- **Bug fix.** `gg_vimp()` on single-outcome `rfsrc` forests now correctly - flags variables with non-positive VIMP in the `positive` column (used for - plot colouring). The single-outcome fit names the column `VIMP` - (uppercase) while the flag check read `$vimp` (lowercase), so `positive` - stayed `TRUE` for every variable. Added a regression test. -- **Documentation.** Deepened the varPro-family and rfsrc importance / - partial / survival help pages, made the `gg_vimp()` (permutation - importance) vs `gg_varpro()` (release-rule importance) distinction - explicit and cross-linked, and fixed a stale `@return` in `gg_roc()`. No - user-facing behaviour change beyond the bug fix above. - -### Major changes carried from v3.0.0 - -- **New varPro wrapper family.** Tidy extractors and `plot()` methods for - the `varPro` package: `gg_partial_varpro()` (partial dependence), - `gg_varpro()` (variable importance), `gg_udependent()` (cross-variable - dependency graph), `gg_isopro()` (isolation-forest anomaly scores), - `gg_beta_varpro()` (per-rule beta importance), and `gg_ivarpro()` - (individual / local importance), each with `print` / `summary` / - `autoplot` companions and a dedicated "varpro" vignette. -- **Soft-deprecation.** `gg_partialpro()` now warns and forwards to - `gg_partial_varpro()`; it will be removed in a future release. -- **randomForest engine fixes.** `gg_variable.randomForest()` - classification, `gg_roc()` / `calc_roc()` for `randomForest` (true - probability-based, macro-averaged ROC), per-class one-vs-rest ROC - (`per_class = TRUE`), and `plot.gg_variable()` now returns a single - `ggplot` / `patchwork` object rather than a bare list. -- **Importance-plot ordering.** All importance plots now place the - most-important variable at the top, matching the `gg_vimp` convention. - -### Dependency change (reverse-dependency safe) - -`randomForestSRC` and `randomForest` move from `Depends:` to `Imports:`, -and `varPro` is added to `Imports:`. `library(ggRandomForests)` no longer -attaches the forest packages to the search path. There are **0 reverse -dependencies** on CRAN, so no downstream packages are affected. - -## Test environments +## v3.1.2 — CRAN fix (gcc-UBSAN additional check, follow-up to 3.1.1) + +This patch completes the gcc-UBSAN "additional issue" fix. v3.1.1 guarded +the varPro forest-growing tests with `skip_on_cran()`, but one fixture was +missed and the auto-check re-flagged the issue (win-builder incoming +pre-test, 2026-06-12; correct before 2026-06-29). + +### The issue + +The sanitizer reports, once, during the package tests: + +``` +entry.c:184:55: runtime error: pointer index expression with base +0x000000000001 overflowed to 0xfffffffffffffff9 + #0 rfsrcGrow .../randomForestSRC/src/entry.c:184 +``` + +This is a 0-length weight-vector access in the compiled code of the +`randomForestSRC` package (`rfsrcGrow`, `entry.c:184`): when `yvar.wt` has +length 0 — which happens for the **unsupervised** family — the unconditional +`RF_yWeight--` forms an out-of-bounds pointer (undefined behaviour, though +never dereferenced). ggRandomForests is a pure-R package +(`NeedsCompilation: no`); it contains no C/C++/Fortran of its own, so the +overflow is not in our code. It is surfaced because our test suite grows an +unsupervised forest via `varPro::isopro()`. + +### The fix + +No package R code changes (ggRandomForests remains `NeedsCompilation: no`). +We built CRAN's `randomForestSRC` 3.6.2 with `-fsanitize=undefined` and ran +every varPro/rfsrc grow in our test suite under it. Only one grow fires +`entry.c:184`: the *unsupervised* isolation forest +(`varPro::isopro(method = "unsupv")`). The bug is a 0-length `yvar.wt` +(present only for the unsupervised family) that `rfsrcGrow` decrements to an +out-of-bounds pointer; supervised grows have a non-empty `yvar.wt` and are +unaffected. + +`make_iso_fit()` in the `gg_isopro` tests therefore calls +`testthat::skip_on_cran()` only when `method = "unsupv"`, so the one +offending grow never runs on CRAN's check machines (including the gcc-UBSAN +additional check). All other varPro tests run on CRAN. We confirmed the full +test suite produces zero gcc-UBSAN errors under the sanitizer configuration +CRAN uses (GCC `-fsanitize=undefined`, which — unlike clang — does not +include `float-cast-overflow`). + +The upstream issue has been reported to the randomForestSRC maintainers. + +### Test environments * **Local:** R 4.6.0 on macOS (aarch64-apple-darwin23). - `R CMD check --as-cran` returns 0 errors, 0 warnings, 0 notes. + `R CMD check --as-cran` (with the manual) returns 0 errors, 0 warnings, + 1 NOTE (days since last update, see below); the single unsupervised isopro + test (`method = "unsupv"`) skips under the CRAN check environment as + intended, all other varPro tests run. * **GitHub Actions matrix:** ubuntu-latest (R-devel / R-release / - R-oldrel-1), windows-latest (R-release), macos-latest (R-release), - all green on the head commit. -* **Reverse-dependency check:** `tools::package_dependencies(reverse = - TRUE)` returns 0. -* **URLs:** `urlchecker::url_check()` clean. + R-oldrel-1), windows-latest (R-release), macos-latest (R-release). +* **Reverse-dependency check:** 0 reverse dependencies on CRAN. -## NOTE disposition +### NOTE disposition -No notes in local `R CMD check --as-cran`. Examples that fit survival -forests are wrapped in `\donttest` and run in a few seconds each. +One NOTE on the incoming feasibility check, "Days since last update: N". +This is expected: this patch is the follow-up to the gcc-UBSAN fix the CRAN +team requested (correct before 2026-06-29), so the short interval is +unavoidable. No other notes. diff --git a/dev/randomForestSRC-ubsan-report.md b/dev/randomForestSRC-ubsan-report.md new file mode 100644 index 00000000..c489f44b --- /dev/null +++ b/dev/randomForestSRC-ubsan-report.md @@ -0,0 +1,90 @@ +# UBSAN: 0-length-array pointer arithmetic in `rfsrcGrow` (entry.c:184) + +**Package:** randomForestSRC (observed in 3.6.2, latent in earlier versions +incl. 3.3.5) +**Reporter:** John Ehrlinger (ggRandomForests maintainer) +**Severity:** Undefined behaviour flagged by gcc-UBSAN; numerically benign in +practice, but CRAN treats it as an "Additional issue." + +## What CRAN reports + +gcc-UBSAN, surfaced via the ggRandomForests test suite (which grows varPro +forests): + +``` +entry.c:184:55: runtime error: pointer index expression with base +0x000000000001 overflowed to 0xfffffffffffffff9 + #0 rfsrcGrow .../randomForestSRC/src/entry.c:184 +``` + +## Root cause + +`rfsrcGrow()` builds 1-offset working pointers from incoming R weight vectors: + +```c +// src/entry.c (3.6.2), lines 183-185 +RF_xWeightStat = REAL(xWeightStat); RF_xWeightStat--; // 183 +RF_yWeight = REAL(yWeight); RF_yWeight--; // 184 <-- UBSAN +RF_xWeight = REAL(xWeight); RF_xWeight--; // 185 +``` + +When the corresponding R vector has **length 0**, `REAL()` returns R's +empty-vector sentinel address `0x1`. The unconditional `ptr--` then forms +`0x1 - sizeof(double) = 0x1 - 8 = 0xfffffffffffffff9`. *Forming* this +out-of-bounds pointer is undefined behaviour (UBSAN, column 55 = the `--`), +even though the package never dereferences `RF_yWeight[i]` when the weight is +empty — so the result is numerically correct and the bug is otherwise silent. + +`yWeight` (R-side `yvar.wt`) is length 0 whenever the y-weight vector is +empty. The R glue produces exactly that for the **unsupervised** family: + +```r +# R/rfsrc.R +if (family == "unspv") { + yvar.wt <- NULL # -> as.double(NULL) -> numeric(0) -> length-0 yWeight +} else { + yvar.wt <- get.weight(yvar.wt, length(yvar.types)) +} +``` + +So every unsupervised forest already forms this pointer; varPro reaches it +because its rule-grow machinery fits forests through that path. + +### Minimal reproduction (no varPro needed) + +```r +library(randomForestSRC) +d <- data.frame(y = rnorm(60), x1 = rnorm(60), x2 = rnorm(60)) +rfsrc(Unsupervised() ~ ., d, ntree = 10) # yvar.wt is numeric(0) -> entry.c:184 +``` + +Confirmed by tracing `randomForestSRC:::get.weight`: the unsupervised path +returns a length-0 `yvar.wt` (`n=0, outlen=0`); regression returns length 1. + +Note the sibling pointers `RF_subjWeight`, `RF_eventType`, etc. are guarded +with `if (VECTOR_ELT(...) != R_NilValue)`, but these three are not — and +`as.double(numeric(0))` is never `R_NilValue`, so nothing catches the empty +case. + +## Suggested fix + +Only decrement when the vector is non-empty (mirrors the existing +NULL-guarded idiom; the empty pointer is never indexed downstream): + +```c +RF_xWeightStat = REAL(xWeightStat); if (LENGTH(xWeightStat) > 0) RF_xWeightStat--; +RF_yWeight = REAL(yWeight); if (LENGTH(yWeight) > 0) RF_yWeight--; +RF_xWeight = REAL(xWeight); if (LENGTH(xWeight) > 0) RF_xWeight--; +``` + +(In this repo's amalgamated `src/randomForestSRC.c` the same three lines are +at ~36965-36967.) The identical guard should be applied anywhere else +`rfsrcGrow`/`rfsrcPredict` does `REAL(w); w--` / `INTEGER(w); w--` on a vector +that can arrive length 0. + +## Downstream impact + +ggRandomForests 3.1.1 works around this by not exercising the path under +`R CMD check` (varPro tests `skip_on_cran()`; the varpro vignette loads +precomputed fits). A guard in randomForestSRC would remove the UB at the +source for all callers. diff --git a/tests/testthat/helper-varpro-fixtures.R b/tests/testthat/helper-varpro-fixtures.R index 2f3a3c8f..ec76252d 100644 --- a/tests/testthat/helper-varpro-fixtures.R +++ b/tests/testthat/helper-varpro-fixtures.R @@ -1,6 +1,12 @@ # Session-memoised varpro + beta.varpro fixtures for the gg_beta_varpro tests. # beta.varpro() is the expensive call (per-rule glmnet); compute once per R # session and reuse. In-memory only — no disk cache. +# +# These all grow *supervised* varpro/ivarpro/beta.varpro forests (a real Y), so +# yvar.wt is non-empty and they do NOT trip randomForestSRC's gcc-UBSAN report +# at entry.c:184 (that fires only for the unsupervised family — verified under +# -fsanitize=undefined). They intentionally run on CRAN; do not skip_on_cran(). +# Only the unsupervised isopro grow is skipped (see test_gg_isopro.R). .varpro_cache <- new.env(parent = emptyenv()) diff --git a/tests/testthat/test_gg_isopro.R b/tests/testthat/test_gg_isopro.R index 5fa28eb0..acac8479 100644 --- a/tests/testthat/test_gg_isopro.R +++ b/tests/testthat/test_gg_isopro.R @@ -2,7 +2,14 @@ # Phase 4 of the v2.8.0 varPro integration. make_iso_fit <- function(seed = 1L, method = "rnd", ntree = 25, sampsize = 16) { - skip_if_not_installed("varPro") + # Only the *unsupervised* isopro grow (method = "unsupv") trips + # randomForestSRC's gcc-UBSAN report at entry.c:184 (length-0 yvar.wt + # decremented to an out-of-bounds pointer; upstream bug, ggRandomForests is + # pure R). The default method = "rnd" builds a synthetic-supervised forest and + # is UBSAN-clean — verified under -fsanitize=undefined. So skip_on_cran() only + # the unsupervised path; the rnd tests run on CRAN. Both run in CI and locally. + if (identical(method, "unsupv")) testthat::skip_on_cran() + testthat::skip_if_not_installed("varPro") set.seed(seed) varPro::isopro( data = iris[, 1:4], diff --git a/tests/testthat/test_gg_udependent.R b/tests/testthat/test_gg_udependent.R index 39a2f4c2..0c0e7fe7 100644 --- a/tests/testthat/test_gg_udependent.R +++ b/tests/testthat/test_gg_udependent.R @@ -3,6 +3,9 @@ ## ── Helpers ────────────────────────────────────────────────────────────────── make_uvp <- function(ntree = 25L) { + # uvarpro() grows a *synthetic-supervised* forest (yxyz123 ~ ., random Y), so + # yvar.wt is non-empty and it does NOT trip the entry.c:184 gcc-UBSAN report + # (only the truly-unsupervised isopro grow does). Runs on CRAN. set.seed(42L) varPro::uvarpro(iris[, -5L], ntree = ntree) } diff --git a/tests/testthat/test_gg_varpro.R b/tests/testthat/test_gg_varpro.R index 68cba93a..69428f48 100644 --- a/tests/testthat/test_gg_varpro.R +++ b/tests/testthat/test_gg_varpro.R @@ -4,6 +4,8 @@ # Regression fit — fast, always available (mtcars is base R) make_vp_regr <- function(ntree = 25L) { + # Supervised varpro grow (real Y) — UBSAN-clean; runs on CRAN. See + # helper-varpro-fixtures.R for why varPro grows are safe except isopro(unsupv). set.seed(42L) varPro::varpro(mpg ~ ., data = mtcars, ntree = ntree) } diff --git a/tests/testthat/test_gg_vimp.R b/tests/testthat/test_gg_vimp.R index 276a87c5..da349a1d 100644 --- a/tests/testthat/test_gg_vimp.R +++ b/tests/testthat/test_gg_vimp.R @@ -336,14 +336,16 @@ test_that("gg_vimp.rfsrc single-outcome: positive flag correctly uses the VIMP c # is named "VIMP" (uppercase) in single-outcome rfsrc fits, leaving positive # always TRUE even for variables with non-positive VIMP. # - # `const` is a zero-variance predictor: rfsrc cannot split on it, so its VIMP - # is exactly 0 (non-positive) on every platform. That deterministically - # exercises the bug condition (the pre-fix code left every `positive` TRUE). + # Force one VIMP non-positive so the test deterministically exercises the bug + # condition. A zero-variance ("const") predictor is unreliable here: + # permutation VIMP of a weak/constant variable can land slightly > 0 (and + # rfsrc may drop a constant column), so it does not guarantee a non-positive + # VIMP on every platform (observed failing on R-release). The pre-fix code + # left `positive` TRUE regardless of the VIMP sign. set.seed(2024L) - aq <- na.omit(airquality) - aq$const <- 1 - rf <- randomForestSRC::rfsrc(Ozone ~ ., data = aq, + rf <- randomForestSRC::rfsrc(Ozone ~ ., data = na.omit(airquality), ntree = 200, importance = TRUE) + rf$importance[1] <- -1 gg_dta <- gg_vimp(rf) expect_s3_class(gg_dta, "gg_vimp") @@ -353,8 +355,8 @@ test_that("gg_vimp.rfsrc single-outcome: positive flag correctly uses the VIMP c expect_false(is.na(vimp_col)) # The invariant, asserted for every row: positive is TRUE exactly when VIMP > 0. expect_equal(gg_dta$positive, gg_dta[[vimp_col]] > 0) - # The constant predictor guarantees at least one non-positive VIMP, so the - # pre-fix all-TRUE behaviour would fail here. + # The injected -1 guarantees at least one non-positive VIMP, so the pre-fix + # all-TRUE behaviour would fail here. expect_true(any(!gg_dta$positive)) expect_s3_class(plot(gg_dta), "ggplot") }) diff --git a/vignettes/precompute_varpro.R b/vignettes/precompute_varpro.R index 4abdcd57..7bf60d1e 100644 --- a/vignettes/precompute_varpro.R +++ b/vignettes/precompute_varpro.R @@ -1,16 +1,23 @@ -# Precompute the expensive varPro objects used by the varpro vignette. +# Precompute the varPro objects used by the varpro vignette. # -# The three gg_partial_varpro() calls (~31s) and the Boston beta.varpro() -# fit (~3s) dominate the vignette rebuild time. Computing them once here and -# loading the result in varpro.qmd keeps the R CMD check vignette rebuild -# fast (CRAN flagged the overall check time). Every other varPro call in the -# vignette is sub-second and stays live. +# Two reasons this file exists: +# +# 1. Speed. The three gg_partial_varpro() calls (~31s) and the beta.varpro() +# fits dominate the vignette rebuild time (CRAN flagged the overall check +# time). +# 2. Safety. Every varPro grow (varpro(), uvarpro(), isopro(), ivarpro(), +# beta.varpro()) reaches randomForestSRC's compiled rule-grow path, which +# trips a gcc-UBSAN "0-length array" report (rfsrcGrow, entry.c:184). By +# loading these objects from disk, the vignette performs NO live varPro +# grow during R CMD check, so it cannot surface that upstream sanitizer +# issue even if a CRAN flavour builds the vignette under UBSAN. # # Run from the package root after changing any of the varpro vignette fits: # Rscript vignettes/precompute_varpro.R # # Produces vignettes/varpro_precomputed.rds. The vignette falls back to live -# computation if that file is absent, so the result is reproducible either way. +# computation for any object missing from that file, so the result is +# reproducible either way. suppressMessages({ library(varPro) @@ -28,36 +35,96 @@ if (requireNamespace("ggRandomForests", quietly = TRUE)) { } options(mc.cores = 1, rf.cores = 1) -# --- Regression: Boston housing ------------------------------------------ if (!requireNamespace("MASS", quietly = TRUE)) { - stop("Package 'MASS' is required for the Boston housing data; install it to ", - "run this script.") + stop("Install 'MASS' to run this script (the regression fits use MASS::Boston).") } + +# --- Regression: Boston housing ------------------------------------------ data("Boston", package = "MASS") +boston_x <- Boston[, setdiff(names(Boston), "medv")] +set.seed(20260527L) +v_boston <- varPro::varpro(medv ~ ., data = Boston, ntree = 50) +pd_boston <- gg_partial_varpro(object = v_boston) +b_boston <- varPro::beta.varpro(v_boston) +iv_boston <- varPro::ivarpro(v_boston) +set.seed(20260527L) +u_boston <- varPro::uvarpro(boston_x, ntree = 50) set.seed(20260527L) -v_boston <- varPro::varpro(medv ~ ., data = Boston, ntree = 50) -pd_boston <- gg_partial_varpro(object = v_boston) -b_boston <- varPro::beta.varpro(v_boston) +iso_boston <- varPro::isopro(data = boston_x, method = "rnd", + sampsize = 256, ntree = 50) -# --- Classification: iris (multi-class) ---------------------------------- +# --- Classification: iris ------------------------------------------------ +iris_binary <- iris[iris$Species != "setosa", ] +iris_binary$Species <- droplevels(iris_binary$Species) +set.seed(20260527L) +v_iris_binary <- varPro::varpro(Species ~ ., data = iris_binary, ntree = 50) +b_iris_binary <- varPro::beta.varpro(v_iris_binary) set.seed(20260527L) v_iris_multi <- varPro::varpro(Species ~ ., data = iris, ntree = 50) pd_iris_multi <- gg_partial_varpro(object = v_iris_multi) +b_iris_multi <- varPro::beta.varpro(v_iris_multi) # --- Survival: PBC (C-path partial dependence) --------------------------- data(pbc, package = "randomForestSRC") pbc_small <- na.omit(pbc[, c("days", "status", "age", "albumin", "bili", "edema", "platelet")]) set.seed(20260527L) -v_pbc <- varPro::varpro(Surv(days, status) ~ ., data = pbc_small, ntree = 50) -pd_pbc <- gg_partial_varpro(object = v_pbc) +v_pbc <- varPro::varpro(Surv(days, status) ~ ., data = pbc_small, ntree = 50) +pd_pbc <- gg_partial_varpro(object = v_pbc) +set.seed(20260527L) +iso_pbc <- varPro::isopro(data = pbc_small[, c("age", "albumin", "bili", + "platelet")], + method = "rnd", sampsize = 256, ntree = 50) + +# --- Trim to keep the shipped .rds (and the source tarball) small -------- +# The gg_* calls this vignette makes (on its cached path) read only +# importance/rule summaries, not the embedded forests. The survival C-path +# gg_partial_varpro() and gg_isopro(newdata=) DO use the forest, but the +# vignette never invokes those on a stripped object (pd_pbc is cached and +# gg_isopro() is training-path only). Dropping those heavy +# slots takes the file from ~1.6 MB to ~0.4 MB (validated: every vignette +# wrapper call returns output identical to the un-stripped object). Two +# exceptions keep their forest: v_boston is printed in the vignette +# (print.varpro reads $rf), and u_boston feeds gg_udependent(), which uses it. +.strip_varpro <- function(v) { # $rf: unused on the cached path + v$rf <- NULL + v +} +.strip_isopro <- function(o) { # training-path gg_isopro reads only $ntree + o$isoforest <- list(ntree = o$isoforest$ntree) + o +} +.strip_ivarpro <- function(iv) { # drop redundant model/path attrs + for (a in c("ivarpro.path", "model", "data")) attr(iv, a) <- NULL + iv +} +b_boston <- .strip_varpro(b_boston) +v_pbc <- .strip_varpro(v_pbc) +v_iris_binary <- .strip_varpro(v_iris_binary) +v_iris_multi <- .strip_varpro(v_iris_multi) +iso_boston <- .strip_isopro(iso_boston) +iso_pbc <- .strip_isopro(iso_pbc) +iv_boston <- .strip_ivarpro(iv_boston) saveRDS( list( + # Boston (regr) + v_boston = v_boston, pd_boston = pd_boston, b_boston = b_boston, + iv_boston = iv_boston, + u_boston = u_boston, + iso_boston = iso_boston, + # iris (class) + v_iris_binary = v_iris_binary, + b_iris_binary = b_iris_binary, + v_iris_multi = v_iris_multi, pd_iris_multi = pd_iris_multi, - pd_pbc = pd_pbc + b_iris_multi = b_iris_multi, + # PBC (surv) + v_pbc = v_pbc, + pd_pbc = pd_pbc, + iso_pbc = iso_pbc ), file = "vignettes/varpro_precomputed.rds", version = 2, diff --git a/vignettes/varpro.qmd b/vignettes/varpro.qmd index b63f4d48..0bd7bb44 100644 --- a/vignettes/varpro.qmd +++ b/vignettes/varpro.qmd @@ -145,7 +145,12 @@ predictors, median home value as the response. ```{r boston-fit} data("Boston", package = "MASS") set.seed(20260527L) -v_boston <- varPro::varpro(medv ~ ., data = Boston, ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +v_boston <- if (is.null(.vp$v_boston)) { + varPro::varpro(medv ~ ., data = Boston, ntree = 50) +} else { + .vp$v_boston +} v_boston ``` @@ -264,8 +269,12 @@ configurable threshold. The visual is built with `ggraph`, which is in `Suggests` rather than `Imports`; install it if you want this view. ```{r boston-uvarpro, cache=TRUE} -u_boston <- varPro::uvarpro(Boston[, setdiff(names(Boston), "medv")], - ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +u_boston <- if (is.null(.vp$u_boston)) { + varPro::uvarpro(Boston[, setdiff(names(Boston), "medv")], ntree = 50) +} else { + .vp$u_boston +} ``` ```{r boston-gg-udependent, eval = requireNamespace("ggraph", quietly = TRUE)} @@ -294,10 +303,13 @@ on `[0, 1]`; the wrapper's convention is "higher = more anomalous" consistency). ```{r boston-isopro, cache=TRUE} -iso_boston <- varPro::isopro(data = Boston[, setdiff(names(Boston), - "medv")], - method = "rnd", sampsize = 256, - ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +iso_boston <- if (is.null(.vp$iso_boston)) { + varPro::isopro(data = Boston[, setdiff(names(Boston), "medv")], + method = "rnd", sampsize = 256, ntree = 50) +} else { + .vp$iso_boston +} ``` ```{r boston-gg-isopro} @@ -331,7 +343,12 @@ local importances across variables. accepts a pre-computed `ivarpro_fit` for reuse across views. ```{r boston-ivarpro-fit, cache=TRUE} -iv_boston <- varPro::ivarpro(v_boston) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +iv_boston <- if (is.null(.vp$iv_boston)) { + varPro::ivarpro(v_boston) +} else { + .vp$iv_boston +} ``` ```{r boston-gg-ivarpro-distribution} @@ -375,12 +392,22 @@ and the full three-class problem. iris_binary <- iris[iris$Species != "setosa", ] iris_binary$Species <- droplevels(iris_binary$Species) set.seed(20260527L) -v_iris_binary <- varPro::varpro(Species ~ ., data = iris_binary, ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +v_iris_binary <- if (is.null(.vp$v_iris_binary)) { + varPro::varpro(Species ~ ., data = iris_binary, ntree = 50) +} else { + .vp$v_iris_binary +} ``` ```{r iris-fit-multi} set.seed(20260527L) -v_iris_multi <- varPro::varpro(Species ~ ., data = iris, ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +v_iris_multi <- if (is.null(.vp$v_iris_multi)) { + varPro::varpro(Species ~ ., data = iris, ntree = 50) +} else { + .vp$v_iris_multi +} ``` ## Class-conditional importance with `gg_varpro(conditional = TRUE)` @@ -437,7 +464,12 @@ such as 30-day mortality: drop the negative class, set the event class as the last factor level, and read the figure for the event panel. ```{r iris-binary-beta-fit, cache=TRUE} -b_iris_binary <- varPro::beta.varpro(v_iris_binary) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +b_iris_binary <- if (is.null(.vp$b_iris_binary)) { + varPro::beta.varpro(v_iris_binary) +} else { + .vp$b_iris_binary +} ``` ```{r iris-binary-gg-beta-varpro} @@ -449,7 +481,12 @@ class sharing the row order set by the unified ranking, same trick as `gg_varpro(conditional = TRUE)`. ```{r iris-multi-beta-fit, cache=TRUE} -b_iris_multi <- varPro::beta.varpro(v_iris_multi) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +b_iris_multi <- if (is.null(.vp$b_iris_multi)) { + varPro::beta.varpro(v_iris_multi) +} else { + .vp$b_iris_multi +} ``` ```{r iris-multi-gg-beta-varpro} @@ -491,10 +528,12 @@ pbc_small <- pbc[, c("days", "status", "age", "albumin", "bili", "edema", "platelet")] pbc_small <- na.omit(pbc_small) set.seed(20260527L) -v_pbc <- varPro::varpro( - Surv(days, status) ~ ., - data = pbc_small, ntree = 50 -) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +v_pbc <- if (is.null(.vp$v_pbc)) { + varPro::varpro(Surv(days, status) ~ ., data = pbc_small, ntree = 50) +} else { + .vp$v_pbc +} ``` ## Variable importance: `gg_varpro()` @@ -532,9 +571,13 @@ Because `isopro()` only sees the predictor matrix, it doesn't care about the family. The same call from section 3 works here. ```{r pbc-isopro, cache=TRUE} -iso_pbc <- varPro::isopro(data = pbc_small[, c("age", "albumin", "bili", - "platelet")], - method = "rnd", sampsize = 256, ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +iso_pbc <- if (is.null(.vp$iso_pbc)) { + varPro::isopro(data = pbc_small[, c("age", "albumin", "bili", "platelet")], + method = "rnd", sampsize = 256, ntree = 50) +} else { + .vp$iso_pbc +} plot(gg_isopro(iso_pbc)) ``` diff --git a/vignettes/varpro_precomputed.rds b/vignettes/varpro_precomputed.rds index d9694604..82eea83b 100644 Binary files a/vignettes/varpro_precomputed.rds and b/vignettes/varpro_precomputed.rds differ