Skip to content

Commit 295344c

Browse files
author
Zachary DeBruine
committed
RcppML 1.0.0: complete rewrite with IRLS distributions, GPU support, StreamPress I/O, factor networks
Major changes from 0.3.7: - S4 nmf class replaces list output - Statistical distributions via IRLS: MSE, Generalized Poisson, Negative Binomial, Gamma, Inverse Gaussian, Tweedie - Zero-inflation models (ZINB/ZIGP) - Built-in cross-validation with configurable test masking - Optional CUDA GPU acceleration - StreamPress/SparsePress I/O (.spz format) - FactorNet graph API for multi-modal/deep/branching NMF - Automatic distribution selection - 11 comprehensive vignettes - Full pkgdown documentation site
1 parent 205a968 commit 295344c

553 files changed

Lines changed: 30002 additions & 14653 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.Rbuildignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,3 +89,11 @@
8989
^FACTOR_NET_VIGNETTE_PLAN\.md$
9090
^cran-comments\.md$
9191
^vignettes/figure$
92+
^GUIDE_EMBEDDING_NOTES\.md$
93+
^VIGNETTE_UNADDRESSED_CRITIQUES\.md$
94+
^sandbox$
95+
^rendered_vignettes$
96+
^doc$
97+
^GuidedNMFManuscript$
98+
^NEWS\.md\.bak$
99+
^README\.md$

.github/copilot-instructions.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ bash tools/fix_rcpp_info_bug.sh
111111

112112
5. **Thread safety**: OpenMP parallelization requires proper reduction clauses for shared variables.
113113

114-
6. **C++ API indexing**: `bipartiteMatch()` returns 0-indexed `$assignment` (not `$pairs`). `dclust()` returns 0-indexed `$samples` and numeric `$id` (not character). Always check C++ struct definitions for actual field names and types.
114+
6. **C++ API indexing**: `bipartiteMatch()` returns 0-indexed `$assignment` (not `$pairs`). `dclust()` returns 0-indexed `$samples` and character `$id` (binary path string encoding split hierarchy, e.g. "01" means root→left→right). Always check C++ struct definitions for actual field names and types.
115115

116116
7. **Rcpp 1.1.0 info bug**: After every `roxygenise()`, run `bash tools/fix_rcpp_info_bug.sh`. The bug inserts an undefined `info` symbol into `RcppExports.cpp`. Never use `roxygenise(".", clean = TRUE)` — it fails because it tries to dyn.load the .so with the bug.
117117

@@ -386,3 +386,18 @@ If a new command pattern is needed frequently, add it to `.vscode/settings.json`
386386
### Module Setup
387387
- `module load r/4.5.2` on compute nodes.
388388
- **OpenMP**: Set `OMP_NUM_THREADS=4` to `8` when sharing nodes with existing jobs (not `$SLURM_CPUS_PER_TASK`, since you are SSH'd, not in a SLURM allocation).
389+
390+
### Viewing Rendered HTML in VS Code on the HPC
391+
392+
To view rendered HTML files (e.g. vignettes, pkgdown output) in VS Code's Simple Browser:
393+
394+
1. Start a Python HTTP server **on the login node** (serving static files is lightweight and safe):
395+
```bash
396+
cd /path/to/html/files && python3 -m http.server 8899 --bind 127.0.0.1
397+
```
398+
2. VS Code Remote SSH auto-forwards the port. Open in Simple Browser:
399+
```
400+
http://localhost:8899/filename.html
401+
```
402+
403+
Do **not** use `file://` URIs — VS Code's Simple Browser cannot resolve remote filesystem paths. Do **not** run the server on a compute node — VS Code only auto-forwards ports from the host it's connected to (the login node).

.gitignore

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@
77
# IDE
88
.vscode/
99

10+
# Manuscript (not part of the R package)
11+
GuidedNMFManuscript/
12+
1013
# Superseded plans (now archived to docs/dev/)
1114
/HARDENING_PLAN.md
1215
/WORKSTREAMS.md
@@ -16,6 +19,7 @@ src/*.o
1619
src/*.so
1720
src/*.dll
1821
inst/lib/*.so
22+
tests/cpp/*.o
1923

2024
# pkgdown site (auto-generated)
2125
docs/site/
@@ -63,3 +67,30 @@ benchmarks/**/*.csv
6367
benchmarks/**/*.spz
6468
benchmarks/**/cellcensus_*_binary/
6569
benchmarks/results/overfitting_log.txt
70+
71+
# Development artifacts (not part of package distribution)
72+
logs/
73+
rendered_vignettes/
74+
sandbox/
75+
NEWS.md.bak
76+
inst/lib/.nfs*
77+
78+
# tools/ generated outputs (keep scripts, ignore results)
79+
tools/*.log
80+
tools/*.rds
81+
tools/*.png
82+
tools/*.html
83+
tools/*_progress.txt
84+
tools/*_log.txt
85+
86+
# GuidedNMFManuscript generated outputs (keep scripts & reports)
87+
GuidedNMFManuscript/**/*.log
88+
GuidedNMFManuscript/**/*.rds
89+
GuidedNMFManuscript/**/*.csv
90+
GuidedNMFManuscript/**/results/
91+
GuidedNMFManuscript/**/plots/
92+
GuidedNMFManuscript/**/logs/
93+
GuidedNMFManuscript/**/combined/
94+
GuidedNMFManuscript/**/tsv2_*/
95+
GuidedNMFManuscript/dead_code/
96+
GuidedNMFManuscript/sandbox/

CONTRIBUTING.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
Thank you for your interest in contributing to RcppML! This document provides
44
guidelines and instructions for contributing to the project.
55

6+
RcppML v1.0.0 was written in early 2026 in VSCode with GitHub Copilot and Claude Opus 4.6 Agent mode. A similar development IDE is recommended for further contributions due to the large number of backends, algorithms, and optimization branches that are difficult to trace manually. Please ensure all pull requests are as narrow in scope as possible.
7+
68
## Getting Started
79

810
1. **Fork and clone** the repository:
@@ -28,8 +30,11 @@ guidelines and instructions for contributing to the project.
2830
| Directory | Contents |
2931
|-----------|----------|
3032
| `R/` | R source files (roxygen documented) |
31-
| `src/` | C++ Rcpp bridge (`RcppFunctions.cpp`) |
32-
| `inst/include/RcppML/` | Header-only C++ library (core algorithms) |
33+
| `src/` | C++ Rcpp bridge files (`RcppFunctions*.cpp`, `sparsepress_bridge.cpp`) |
34+
| `inst/include/RcppML/` | Header-only C++ library (core NNLS, clustering) |
35+
| `inst/include/FactorNet/` | NMF/SVD engine, GPU kernels, factor graph fitting |
36+
| `inst/include/streampress/` | StreamPress I/O (rANS coding) |
37+
| `inst/include/sparsepress/` | SparsePress compression |
3338
| `tests/testthat/` | Unit tests (testthat) |
3439
| `vignettes/` | Package vignettes |
3540
| `man/` | **Auto-generated** — do NOT edit |
@@ -55,9 +60,11 @@ guidelines and instructions for contributing to the project.
5560

5661
### Editing C++ Code
5762

58-
All C++ algorithm code lives in `inst/include/RcppML/` as a header-only library.
59-
The only compiled file is `src/RcppFunctions.cpp`, which provides `// [[Rcpp::export]]`
60-
wrappers.
63+
All C++ algorithm code lives in `inst/include/` as a header-only library
64+
(subdivided into `RcppML/`, `FactorNet/`, `streampress/`, and `sparsepress/`).
65+
The `src/` directory contains Rcpp bridge files (`RcppFunctions*.cpp`) that
66+
provide `// [[Rcpp::export]]` wrappers, plus `sparsepress_bridge.cpp` for
67+
StreamPress I/O.
6168

6269
After changing `// [[Rcpp::export]]` annotations, run:
6370
```r

CRAN_AUDIT.md

Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
# RcppML 1.0.0 — CRAN Pre-Submission Audit
2+
3+
**Initial Audit**: 2026-03-13
4+
**Last Revised**: 2026-03-14
5+
**Package Version**: 1.0.0
6+
**Previous CRAN Version**: 0.3.7
7+
8+
---
9+
10+
## Executive Summary
11+
12+
The package passes `R CMD check --as-cran` (with full vignette rebuild) with **0 errors**, **2 warnings** (both from missing system tools on the HPC: `checkbashisms`, `qpdf`), and **1 note** (CRAN incoming feasibility: SeuratData not in mainstream repos, ~9.5 MB tarball).
13+
14+
### Verdict: **PASS** — Ready for CRAN submission
15+
16+
---
17+
18+
## 1. R CMD check (`--as-cran`, full vignette rebuild)
19+
20+
### Status: ✅ PASS (0 errors, 2 warnings, 1 note)
21+
22+
```
23+
Platform: x86_64-pc-linux-gnu, R 4.5.2, GCC 13.3.1, RHEL 9.7
24+
25+
* checking whether package 'RcppML' can be installed ... [274s] OK
26+
* checking R code for possible problems ... OK
27+
* checking Rd files ... OK
28+
* checking for missing documentation entries ... OK
29+
* checking for code/documentation mismatches ... OK
30+
* checking Rd \usage sections ... OK
31+
* checking Rd contents ... OK
32+
* checking examples ... OK
33+
* checking examples with --run-donttest ... OK
34+
* checking tests ... OK (testthat: 40s)
35+
* checking for unstated dependencies in vignettes ... OK
36+
* checking package vignettes ... OK
37+
* checking re-building of vignette outputs ... [679s] OK
38+
* checking compiled code ... OK
39+
40+
Status: 2 WARNINGs, 1 NOTE
41+
```
42+
43+
**WARNINGs** (system-tool gaps, not package defects — CRAN build machines have these tools):
44+
1. `checkbashisms` — not installed on HPC. Checks configure/cleanup shell scripts.
45+
2. `qpdf` — not installed on HPC. Checks PDF compression; all vignettes are HTML.
46+
47+
**NOTE** (CRAN incoming feasibility):
48+
- `SeuratData` in Suggests but not in mainstream repos. Expected; documented in `cran-comments.md`.
49+
- Tarball: 9,506,276 bytes (~9.1 MB). Justified: 7 datasets (6.5 MB), 11 pre-built vignettes (3.2 MB), Eigen template headers (2.4 MB).
50+
51+
**Installed size**: 111.7 MB unstripped (98.6 MB debug symbols in `libs/`). After `strip -s` (which CRAN uses), the shared library is 1.8 MB.
52+
53+
### All Previously Identified Issues — RESOLVED:
54+
- ~~465 MB tarball from `GuidedNMFManuscript/` leak~~`.Rbuildignore`
55+
- ~~4 unused-variable compilation warnings~~`(void)var;` casts
56+
- ~~Test failures from deprecated API~~ → tests updated
57+
- ~~Example errors (GPU, file, unexported)~~ → properly guarded
58+
- ~~R >= 3.5.0 but `|>` pipe in examples~~ → bumped to R >= 4.1.0
59+
- ~~`stxBrain.SeuratData` undeclared vignette dependency~~`system.file()` check
60+
- ~~`https://yann.lecun.com/...` URL broken (no HTTPS support)~~ → reverted to working `http://`
61+
- ~~Missing `\value` in 10 internal `.Rd` files~~ → added `@return` tags
62+
- ~~`stop(paste(...))` in `plot_nmf.R`~~`stop(sprintf(...))`
63+
64+
---
65+
66+
## 2. Detailed Compliance Checks
67+
68+
### 2a. `\dontrun{}` Usage — ✅ JUSTIFIED (19 total)
69+
70+
| Category | Count | Files | Justification |
71+
|----------|-------|-------|---------------|
72+
| SPZ file-dependent | 12 | `streampress.R` | Require `.spz` files not shipped |
73+
| GPU hardware | 3 | `sp_gpu.R` | Require CUDA GPU; no CPU fallback |
74+
| Unexported internals | 3 | `random.R` | `@keywords internal`; not on search path |
75+
| Internal NMF methods | 1 | `nmf_methods.R` | `mse()` is `@keywords internal` |
76+
77+
All `\dontrun{}` blocks are genuinely non-runnable in a CRAN check environment.
78+
79+
### 2b. `\value` / `@return` Tags — ✅ ALL PRESENT
80+
81+
All `.Rd` files (including internal `dot-*` functions) now have `\value` sections.
82+
83+
### 2c. `T`/`F` Misuse — ✅ NONE
84+
85+
No standalone `T` or `F` used as booleans in any R source file.
86+
87+
### 2d. `cat()` Usage — ✅ COMPLIANT
88+
89+
All `cat()` calls are in `print.*` S3 methods. No unconditional console output in non-print functions.
90+
91+
### 2e. `par()` / `options()` State — ✅ PROPERLY RESTORED
92+
93+
All `par()` modifications have corresponding `on.exit()` restoration:
94+
- `R/dclust.R` L183-184
95+
- `R/training_log.R` L300-301
96+
97+
No `options()` modifications in package code (only reads via `getOption()`).
98+
99+
### 2f. Forbidden Functions — ✅ NONE
100+
101+
No `Sys.setenv()`, `setwd()`, or `sink()` calls.
102+
103+
### 2g. URLs — ✅ ALL VALID
104+
105+
One `http://` URL (`yann.lecun.com/exdb/mnist/`) — this site does not support HTTPS. All other URLs use `https://`.
106+
107+
### 2h. Makevars — ✅ PORTABLE
108+
109+
- Uses relative include paths (`-I../inst/include/`)
110+
- Standard R build variables (`$(SHLIB_OPENMP_CXXFLAGS)`, `$(LAPACK_LIBS)`, etc.)
111+
- Windows: `-Wa,-mbig-obj` for large template code
112+
- No hardcoded paths or non-portable flags
113+
114+
### 2i. configure Script — ✅ POSIX sh
115+
116+
- Shebang: `#!/bin/sh` (not bash)
117+
- No bash-isms
118+
- Standard `[ ]` conditionals
119+
- Graceful CUDA fallback
120+
121+
### 2j. C++ Headers — ✅ STANDARD GUARDS
122+
123+
All `#ifndef`/`#define`/`#endif` include guards. No `#pragma once`.
124+
125+
---
126+
127+
## 3. Vignettes — ✅ PASS (11 total)
128+
129+
All vignettes use current API, proper `eval` guards for optional packages, and declared datasets.
130+
131+
---
132+
133+
## 4. C++ Code & Compilation — ✅ PASS (0 package warnings)
134+
135+
Only remaining compiler warnings come from RcppEigen/Eigen external headers (not actionable; tolerated by CRAN).
136+
137+
---
138+
139+
## 5. R Unit Tests — ✅ GOOD
140+
141+
82 test files, 1291 passing, 488 skipped (GPU + streaming), 0 failures.
142+
143+
---
144+
145+
## 6. DESCRIPTION & NAMESPACE — ✅ PASS
146+
147+
- `R (>= 4.1.0)` dependency
148+
- `Matrix` in `Depends:` (auto-attached for examples)
149+
- 97+ exported symbols properly registered
150+
- `SystemRequirements: CUDA Toolkit >= 11.0 (optional)`
151+
- License: `GPL (>= 3)` — standard CRAN format
152+
153+
---
154+
155+
## 7. Reverse Dependencies — ✅ PASS
156+
157+
| Package | Type | Impact |
158+
|---------|------|--------|
159+
| GeneNMF | imports `nmf()` | ✅ Compatible |
160+
| phytoclass | imports `nnls()` (old API) | ✅ Backward-compat shim + deprecation warning |
161+
| scater (Bioc) | runtime | ✅ No direct function calls |
162+
| miloR (Bioc) | LinkingTo | ✅ No R function imports |
163+
| CARDspa, flashier | Suggests | ✅ No breakage possible |
164+
165+
---
166+
167+
## 8. Outstanding Items (Optional — Not Blocking)
168+
169+
1. **Tarball size (~9.5 MB)**: Above 5 MB guideline. Justified in `cran-comments.md`.
170+
2. **`NEWS.md.bak` on disk**: Excluded from tarball via `.Rbuildignore`.
171+
3. **`http://` URL for MNIST source**: Site does not support HTTPS; HTTP is the only working option.
172+
173+
3. **TODO comments in streampress headers** — 3 informational TODO comments remain in `inst/include/streampress/` (bundled third-party library). All have working implementations; comments are optimization notes. Not flagged by R CMD check.
174+
175+
4. **`RcppML.Rcheck/` and `RcppML_1.0.0.tar.gz` in root** — Transient check/build artifacts. Already excluded from tarball. Can be deleted.
176+
177+
5. **Stale `^manuscript$` in `.Rbuildignore`** — No longer matches anything (superseded by `^GuidedNMFManuscript$`). Harmless but could be removed for tidiness.
178+
179+
6. **`training_logger()` example uses `\dontrun{}`** — Could be converted to a self-contained `\donttest{}` example with synthetic data, but current form is acceptable.
180+
181+
7. **`R (>= 3.5.0)` in DESCRIPTION** — The native pipe `|>` is used in some examples and vignettes, which requires R >= 4.1.0. No R CMD check warning was triggered (examples don't use `|>` in evaluated code), but updating the dependency version would be more accurate.
182+
183+
---
184+
185+
## Summary Scorecard
186+
187+
| Area | Status | Notes |
188+
|------|--------|-------|
189+
| R CMD check || 0 errors, 0 package warnings, 2 benign notes |
190+
| Vignettes (11) || All valid, proper eval guards |
191+
| Roxygen Docs || All exports have @examples |
192+
| C++ Code || 0 warnings from package code |
193+
| C++ Tests || 31 tests, 2320 assertions, 0 failures |
194+
| R Tests || 1291 pass, 0 fail |
195+
| GPU/CPU Matrix || Full coverage including GPU randomized SVD dense |
196+
| Float Precision || Correct fp32/fp64 strategy |
197+
| Dead Code || Deprecated shims properly managed |
198+
| Build Hygiene || Tarball 9.0 MB, no leaked artifacts |
199+
| DESCRIPTION || Complete and accurate |
200+
| NAMESPACE || All exports registered |
201+
| NEWS.md || Comprehensive changelog |
202+
| Reverse Deps || Backward compat maintained |
203+
204+
**Overall**: Package is ready for CRAN submission. Update `cran-comments.md` to explain the tarball size and `SeuratData` NOTE.

DESCRIPTION

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,11 @@ Description: High-performance non-negative matrix factorization (NMF),
2121
described in DeBruine, Melcher, and Triche (2021)
2222
<doi:10.1101/2021.09.01.458620>.
2323
Depends:
24-
R (>= 3.5.0)
24+
R (>= 4.1.0),
25+
Matrix
2526
License: GPL (>= 3)
2627
Imports:
2728
Rcpp,
28-
Matrix,
2929
methods,
3030
stats,
3131
utils
@@ -38,12 +38,19 @@ Suggests:
3838
hdf5r,
3939
jsonlite,
4040
knitr,
41+
pheatmap,
4142
randomForest,
43+
RColorBrewer,
4244
rmarkdown,
4345
patchwork,
4446
plotly,
4547
scales,
46-
testthat (>= 3.0.0)
48+
Seurat,
49+
SeuratData,
50+
SeuratObject,
51+
testthat (>= 3.0.0),
52+
uwot,
53+
viridis
4754
VignetteBuilder: knitr
4855
Config/testthat/edition: 3
4956
LazyData: true

0 commit comments

Comments
 (0)