[r][cpp] Add `sparse = TRUE` as an option for `pseudobulk_matrix()` #268

ycli1995 · 2025-06-14T04:11:56Z

Hi, @bnprks, I added an option sparse = TRUE/FALSE to pseudobulk_matrix().

Motivation

Previous pseudobulk_matrix only returns dense matrices for the aggregated data. It looks great enough when the number of cell groups is small. However, when cell_groups is large (e.g., due to many meta-cells), the resulting pseudobulk matrix can become extremely wide. In such cases, returning a dense matrix can lead to significant memory overhead, even though most entries are zeros. This PR addresses that issue by allowing pseudobulk_matrix() to return a sparse matrix, greatly improving memory efficiency for large and sparse groupings.

Changes Made

[r] Added a new argument: sparse = FALSE (default) to pseudobulk_matrix() in r/R/singlecell_utils.R. When it is set to TRUE, the resulting matrices will become dgCMatrix. Previous codes depending on old pseudobulk_matrix should not be influenced.
[cpp] Added r/src/bpcells-cpp/matrixUtils/PseudobulkSparse.h and .cpp for implementation. The general ideas and calculation methods were copied from Pseudobulk.cpp. Each resulting sparse matrix (Eigen::SparseMatrix<double>) is constructed with .setFromTriplets(). The internal trick is that pseudobulk_matrix_sparse() must receive an ordered cell_groups vector. The input mat should also be ordered according to the cell_groups. The ordering trick is done by the outside caller, that is, the R function pseudobulk_matrix()
[r] Added corresponding unit tests in r/tests/testthat/test-singlecell_utils.R.

Example

# A 26862 x 985167 single-cell matrix
bm1 <- open_matrix_dir("/develop/rstudio/seurat5_learn/data/parse_1m_pbmc/")
bm1

# Simulate 50000 meta-cells with random orders
cell.groups <- sample(1:50000, ncol(bm1), replace = T)
cell.groups <- factor(cell.groups, 1:50000)

# Create an aggregated expression matrix for meta-cells
bench::mark(
  agg_mat <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, sparse = T)
)

This operation may take a long time but save the RAM overhead for construction of dense matrix.

# A tibble: 1 × 13
  expression                                   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result             memory     time       gc      
  <bch:expr>                                 <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>             <list>     <list>     <list>  
1 avg2 <- BPCells::pseudobulk_matrix(bm1, c… 6.01m  6.01m   0.00277    4.19GB  0.00277     1     1      6.01m <dgCMatrx[,50000]> <Rprofmem> <bench_tm> <tibble>

Limitations and Request for Feedback

To maintain compatibility with the original pseudobulk_matrix() which performs a single-pass traversal to compute non_zeros, sum, mean, and variance simultaneously, the current sparse implementation follows the same strategy. As a result, even if the user requests only one statistic (e.g., mean), all others are still computed and returned internally, including non_zeros. This design ensures consistent output structure and avoids branching logic, but may introduce unnecessary computational overhead in simple use cases.
I'm wondering whether it would be worthwhile to refactor the logic such that only the requested statistics are computed when sparse = TRUE. This could reduce runtime and memory usage, especially when working with very large datasets and lightweight operations. Any suggestion or better idea for this? Or maybe we currently just keep the original strategy?

* Make it possible for returning sparse matrices in `pseudobulk_matrix()`

bnprks · 2025-06-14T05:24:26Z

Hi @ycli1995, thanks for the PR! It might be a week or two before we can fully process this, but just letting you know I've seen the PR and I'll discuss it with @immanuelazn.

A few initial ideas:

If you anticipate a use-case where memory usage will be a problem, it's probably worth the extra effort to minimize intermediate memory usage. I think this can likely be done with minimal computational cost as a few extra branches for each matrix column should have negligible cost relative to the actual computations.
It might be interesting to think about whether a two-pass algorithm or unconditionally doing a storage_order_transpose() might lead to better running time on cell-major data by avoiding disk seeks. For two pass I'm imagining a first pass to calculate which output entries are non-zero, then the second pass calculates the actual statistics -- I think it should be possible assuming the entries in each column are loaded in order by row, but I haven't fully thought it out. The storage_order_transpose() option is probably easier to think through and test.

Also, for the use-cases you have in mind, what would you anticipate for cell counts and pseudobulk sizes? If we add storage ordering requirements on the input it would be possible to make the output an "IterableMatrix" (entries calculated on-demand) rather than a "dgCMatrix". Not sure if that would be useful for what you have in mind

ycli1995 · 2025-06-14T16:45:17Z

Hi @bnprks, thanks for your thoughtful response.

Regarding minimizing intermediate memory usage, I’m thinking about controlling which statistics matrices are initialized based on the method parameter on c++ side. Currently, even if the user only requests "variance", the dense matrices for non_zeros, mean, and var are all initialized, which leads to unnecessary memory usage. Here's a quick example:

# A 33538 x 11769 single-cell matrix
bm1 <- BPCells::open_matrix_10x_hdf5("/develop/10x_data/rna/pbmc10k/filtered_feature_bc_matrix.h5") %>%
  BPCells::write_matrix_dir(tempfile("bm1"))

# Simulate the cell groups
cell.groups <- sample(1:2000, ncol(bm1), replace = T)
cell.groups <- factor(cell.groups, 1:10)

bench::mark(
  check = F, min_time = 10,
  out1 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "sum"),
  out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "nonzeros"),
  out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "mean"),
  out3 <- BPCells::pseudobulk_matrix(bm1, cell_groups = cell.groups, method = "variance")
)

The memory usage for method = "variance" is actually 3 times by method = "sum":

# A tibble: 4 × 13
  expression                                               min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
  <bch:expr>                                             <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
1 "out1 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 3.66s  3.67s     0.273    2.69MB        0     3     0        11s <NULL> <Rprofmem> <bench_tm> <tibble>
2 "out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 3.65s  3.65s     0.274    2.69MB        0     3     0        11s <NULL> <Rprofmem> <bench_tm> <tibble>
3 "out2 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 5.76s  5.76s     0.174    5.25MB        0     2     0      11.5s <NULL> <Rprofmem> <bench_tm> <tibble>
4 "out3 <- BPCells::pseudobulk_matrix(bm1, cell_groups … 7.82s  7.83s     0.128    7.81MB        0     2     0      15.7s <NULL> <Rprofmem> <bench_tm> <tibble>

In my sparse implementation, I tried to reuse buffer-like vectors to handle each group sequentially. This relies on cell_groups being pre-sorted and the input matrix being ordered accordingly. Even in the transposed case, the buffer handles per-gene data across all groups. Therefore, if IterableMatrix can efficiently traverse ordered columns, we could initialize only the required output matrix, saving considerable memory.. Do you think that’s worth pursuing?
As for the suggestion of returning an IterableMatrix instead of a dgCMatrix: that's a very interesting direction. For my current use-case, I’m mainly focused on getting the sparse pseudobulk matrix into R in a format that works smoothly with downstream packages like monocle and SCENIC or other Matrix-based workflows, so dgCMatrix is a good fit for now. However I’d love to explore that direction further if time and resources allow.

immanuelazn · 2025-09-16T06:49:57Z

So sorry for the delay on this, I'll be taking over reviewing this.
I took the liberty of trying out your implementation on a smaller subset so I can properly AB test on my consumer hardware.

bm1 <- open_matrix_market("DGE_1M_PBMC.mtx") %>% write_matrix_dir("dge_1m_pbmc")
bm2 <- bm1[1:300000,] %>% t() %>% write_matrix_dir("dge_100k_pbmc", overwrite = TRUE)

cell.groups <- sample(1:10000, ncol(bm2), replace = T)
cell.groups <- factor(cell.groups, 1:10000)
bench::mark(
  agg_mat <- BPCells::pseudobulk_matrix(bm2, cell_groups = cell.groups, sparse = T), iterations = 10
)

With the following results on sparse output:

# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>    <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 agg_mat <- B…  6.4s  8.19s     0.131    1.06GB   0.0146     9     1      1.14m
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

And for the dense output:

# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>    <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 agg_mat <- B… 7.22s  10.5s    0.0992    4.52GB   0.0661     6     4      1.01m
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

Essentially shows that difference in computation is pretty minimal, which is great! I'd actually be happy to set sparse as default.

The main consideration that I currently have is how fast the implementation for a matrix-matrix multiplication approach for a pseudobulk should go. We should be able to beat that as a baseline if we're creating some specialized code for this function. Otherwise, we might be unncessarily adding some complicated tech debt to the repo. Using a matrix-matrix multiplication should be able to output everything except the variance pseudobulk, so creating these and A/B testing against pseudobulk_matrix() for time after doing a conversion to a dgCMatrix would be good. If you have time, it would be good to have you take the helm on this. Otherwise, I can do this around next week.

As for the implementation itself, I felt like I had to refer to the Eigen docs more than I would want to for your implementation. You copied my comments which helped align what steps did what, in relation to the previous dense implementation. But we need additional explanation for why there are three new structs to support this. Just overall more explanation for your implementation is a whole would help, pretending that the reader is not familiar with the Eigen sparse methods. I'll leave some comments across today and tomorrow on spots that I think can be better documented.

Also, you can probably get closer to true speeds if you turn on cpp optimizations during devtools builds. I don't think your benchmarks portray the actual times we expect a user to be spending on these functions, but I could be wrong.

library(pkgbuild)
flags <- pkgbuild::compiler_flags(debug=FALSE)
new_compiler_flags <- function(debug=FALSE) {flags}
assignInNamespace('compiler_flags', new_compiler_flags, 'pkgbuild')

devtools::load_all("r", recompile=TRUE)

immanuelazn · 2025-09-16T07:02:03Z

r/src/matrix_utils.cpp


+
+// [[Rcpp::export]]
+List pseudobulk_matrix_sparse_cpp(SEXP mat,


This is probably unnecessary, you can just have a conditional on the original pseudobulk_matrix_cpp() for sparsity

immanuelazn · 2025-09-16T07:03:27Z

r/src/bpcells-cpp/matrixUtils/PseudobulkSparse.h

+
+namespace BPCells {
+
+struct PseudobulkStatsSparse {


Docstrings would be helpful, why do we need all three of these new structs? I would even throw PseudobulkStatsTriplet and PseudobulkStatsTemp into just the cpp since they'll only be statically used in the pseudobulking function

immanuelazn · 2025-09-16T07:04:20Z

r/src/bpcells-cpp/matrixUtils/PseudobulkSparse.cpp

+    // transpose == false
+    ncells += 1;
+    if (ncells < group_count_vec[g_idx]) continue;
+    for (uint32_t i = 0; i < tmp.var.size(); i++) {


Unclear why we would do this at a first glance (even though I wrote the original code for the corrections for sparsity). Means we need some comments!

immanuelazn · 2025-09-16T07:05:46Z

r/src/bpcells-cpp/matrixUtils/PseudobulkSparse.cpp

+  }
+  struct PseudobulkStatsSparse res;
+  struct PseudobulkStatsTriplet trip;
+  struct PseudobulkStatsTemp tmp;


More comments on what these are used for

immanuelazn · 2025-09-16T07:07:58Z

r/R/singlecell_utils.R

 #'
 #' Current options are: `nonzeros`, `sum`, `mean`, `variance`.
 #' @param threads (integer) Number of threads to use.
+#' @param sparse (logical) Whether or not to return sparse matrices.


Suggested change

#' @param sparse (logical) Whether or not to return sparse matrices.

#' @param sparse (logical) Whether to calculate outputs as sparse matrices of type `dgCMatrix`, rather than dense matrices of type `matrix`

immanuelazn · 2025-09-16T07:10:33Z

r/R/singlecell_utils.R

-
-  res <- pseudobulk_matrix_cpp(it, cell_groups = as.integer(cell_groups) - 1, method = method, transpose = mat@transpose)
+  if (sparse) {
+    new.order <- order(cell_groups)


I'm a little hesitant on doing this approach because shuffling actually isn't an O(1) operation for each idx (total O(n)). Due to the way subsetting logic works in BPCells, this is a O(nlog(n)) operation on the matrix. Any clue how much having cell_groups pre-ordered changes the speed for the cpp implementation?

immanuelazn · 2025-09-16T07:11:41Z

r/R/singlecell_utils.R

+    new.order <- order(cell_groups)
+    cell_groups <- cell_groups[new.order]
+    it <- mat[, new.order] %>%
+      convert_matrix_type("double") %>%


If you make both sparse and dense cases use the same cpp adapter function, you can deduplicate 145-149 and 151-155, and take it out of the condtiional

ycli1995 · 2025-09-27T08:23:13Z

Hi, @immanuelazn .

Sorry for the delayed reply.

Thanks a lot for reviewing this PR. Unfortunately, with my current workload from the company I'm working for, I won’t be able to follow up on those development on github in the near term. If this PR needs to be closed for now to avoid leaving things hanging, I completely understand.

It's really appreciated for your review. Maybe in the future, when I have more spare time, I can revisit and contribute BPCells again. Thanks for your patience and understanding!

Best regards,
Yuchen

immanuelazn · 2025-09-27T08:42:24Z

Hi @ycli1995 , no problem. Really appreciate the work you put in this. Do you mind if i take over development then?

ycli1995 · 2025-09-27T09:02:28Z

Hi @immanuelazn ,

That would be perfect! I’d be more than happy for you to take over the development. Really glad to see BPCells moving forward, and thanks for your willingness to continue working on it.

Best,
Yuchen

[r][cpp] Add sparse = TRUE as an option for pseudobulk_matrix()

b606a8b

* Make it possible for returning sparse matrices in `pseudobulk_matrix()`

Merge branch 'bnprks:main' into pseudobulk_sparse

a26fe27

immanuelazn reviewed Sep 16, 2025

View reviewed changes



		// [[Rcpp::export]]
		List pseudobulk_matrix_sparse_cpp(SEXP mat,

	#' @param sparse (logical) Whether or not to return sparse matrices.
	#' @param sparse (logical) Whether to calculate outputs as sparse matrices of type `dgCMatrix`, rather than dense matrices of type `matrix`

[r][cpp] Add sparse = TRUE as an option for pseudobulk_matrix() #268

Are you sure you want to change the base?

[r][cpp] Add sparse = TRUE as an option for pseudobulk_matrix() #268

Uh oh!

Conversation

ycli1995 commented Jun 14, 2025

Motivation

Changes Made

Example

Limitations and Request for Feedback

Uh oh!

bnprks commented Jun 14, 2025

Uh oh!

ycli1995 commented Jun 14, 2025

Uh oh!

immanuelazn commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

immanuelazn Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

immanuelazn Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

immanuelazn Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

immanuelazn Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

immanuelazn Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

immanuelazn Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

immanuelazn Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

ycli1995 commented Sep 27, 2025

Uh oh!

immanuelazn commented Sep 27, 2025

Uh oh!

ycli1995 commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[r][cpp] Add `sparse = TRUE` as an option for `pseudobulk_matrix()` #268

[r][cpp] Add `sparse = TRUE` as an option for `pseudobulk_matrix()` #268

immanuelazn commented Sep 16, 2025 •

edited

Loading