-
Notifications
You must be signed in to change notification settings - Fork 30
[r][cpp] Add sparse = TRUE as an option for pseudobulk_matrix()
#268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* Make it possible for returning sparse matrices in `pseudobulk_matrix()`
|
Hi @ycli1995, thanks for the PR! It might be a week or two before we can fully process this, but just letting you know I've seen the PR and I'll discuss it with @immanuelazn. A few initial ideas:
Also, for the use-cases you have in mind, what would you anticipate for cell counts and pseudobulk sizes? If we add storage ordering requirements on the input it would be possible to make the output an |
|
Hi @bnprks, thanks for your thoughtful response.
The memory usage for
|
|
So sorry for the delay on this, I'll be taking over reviewing this. With the following results on sparse output: And for the dense output: Essentially shows that difference in computation is pretty minimal, which is great! I'd actually be happy to set sparse as default. The main consideration that I currently have is how fast the implementation for a matrix-matrix multiplication approach for a pseudobulk should go. We should be able to beat that as a baseline if we're creating some specialized code for this function. Otherwise, we might be unncessarily adding some complicated tech debt to the repo. Using a matrix-matrix multiplication should be able to output everything except the variance pseudobulk, so creating these and A/B testing against As for the implementation itself, I felt like I had to refer to the Eigen docs more than I would want to for your implementation. You copied my comments which helped align what steps did what, in relation to the previous dense implementation. But we need additional explanation for why there are three new structs to support this. Just overall more explanation for your implementation is a whole would help, pretending that the reader is not familiar with the Eigen sparse methods. I'll leave some comments across today and tomorrow on spots that I think can be better documented. Also, you can probably get closer to true speeds if you turn on cpp optimizations during devtools builds. I don't think your benchmarks portray the actual times we expect a user to be spending on these functions, but I could be wrong. |
|
|
||
|
|
||
| // [[Rcpp::export]] | ||
| List pseudobulk_matrix_sparse_cpp(SEXP mat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably unnecessary, you can just have a conditional on the original pseudobulk_matrix_cpp() for sparsity
|
|
||
| namespace BPCells { | ||
|
|
||
| struct PseudobulkStatsSparse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstrings would be helpful, why do we need all three of these new structs? I would even throw PseudobulkStatsTriplet and PseudobulkStatsTemp into just the cpp since they'll only be statically used in the pseudobulking function
| // transpose == false | ||
| ncells += 1; | ||
| if (ncells < group_count_vec[g_idx]) continue; | ||
| for (uint32_t i = 0; i < tmp.var.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear why we would do this at a first glance (even though I wrote the original code for the corrections for sparsity). Means we need some comments!
| } | ||
| struct PseudobulkStatsSparse res; | ||
| struct PseudobulkStatsTriplet trip; | ||
| struct PseudobulkStatsTemp tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments on what these are used for
| #' | ||
| #' Current options are: `nonzeros`, `sum`, `mean`, `variance`. | ||
| #' @param threads (integer) Number of threads to use. | ||
| #' @param sparse (logical) Whether or not to return sparse matrices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #' @param sparse (logical) Whether or not to return sparse matrices. | |
| #' @param sparse (logical) Whether to calculate outputs as sparse matrices of type `dgCMatrix`, rather than dense matrices of type `matrix` |
|
|
||
| res <- pseudobulk_matrix_cpp(it, cell_groups = as.integer(cell_groups) - 1, method = method, transpose = mat@transpose) | ||
| if (sparse) { | ||
| new.order <- order(cell_groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little hesitant on doing this approach because shuffling actually isn't an O(1) operation for each idx (total O(n)). Due to the way subsetting logic works in BPCells, this is a O(nlog(n)) operation on the matrix. Any clue how much having cell_groups pre-ordered changes the speed for the cpp implementation?
| new.order <- order(cell_groups) | ||
| cell_groups <- cell_groups[new.order] | ||
| it <- mat[, new.order] %>% | ||
| convert_matrix_type("double") %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you make both sparse and dense cases use the same cpp adapter function, you can deduplicate 145-149 and 151-155, and take it out of the condtiional
|
Hi, @immanuelazn . Sorry for the delayed reply. Thanks a lot for reviewing this PR. Unfortunately, with my current workload from the company I'm working for, I won’t be able to follow up on those development on github in the near term. If this PR needs to be closed for now to avoid leaving things hanging, I completely understand. It's really appreciated for your review. Maybe in the future, when I have more spare time, I can revisit and contribute BPCells again. Thanks for your patience and understanding! Best regards, |
|
Hi @ycli1995 , no problem. Really appreciate the work you put in this. Do you mind if i take over development then? |
|
Hi @immanuelazn , That would be perfect! I’d be more than happy for you to take over the development. Really glad to see Best, |
Hi, @bnprks, I added an option
sparse = TRUE/FALSEtopseudobulk_matrix().Motivation
Previous
pseudobulk_matrixonly returns dense matrices for the aggregated data. It looks great enough when the number of cell groups is small. However, when cell_groups is large (e.g., due to many meta-cells), the resulting pseudobulk matrix can become extremely wide. In such cases, returning a dense matrix can lead to significant memory overhead, even though most entries are zeros. This PR addresses that issue by allowing pseudobulk_matrix() to return a sparse matrix, greatly improving memory efficiency for large and sparse groupings.Changes Made
sparse = FALSE(default) topseudobulk_matrix()inr/R/singlecell_utils.R. When it is set toTRUE, the resulting matrices will becomedgCMatrix. Previous codes depending on oldpseudobulk_matrixshould not be influenced.r/src/bpcells-cpp/matrixUtils/PseudobulkSparse.hand.cppfor implementation. The general ideas and calculation methods were copied fromPseudobulk.cpp. Each resulting sparse matrix (Eigen::SparseMatrix<double>) is constructed with.setFromTriplets(). The internal trick is thatpseudobulk_matrix_sparse()must receive an orderedcell_groupsvector. The inputmatshould also be ordered according to thecell_groups. The ordering trick is done by the outside caller, that is, the R functionpseudobulk_matrix()r/tests/testthat/test-singlecell_utils.R.Example
This operation may take a long time but save the RAM overhead for construction of dense matrix.
Limitations and Request for Feedback
To maintain compatibility with the original
pseudobulk_matrix()which performs a single-pass traversal to computenon_zeros,sum,mean, andvariancesimultaneously, the current sparse implementation follows the same strategy. As a result, even if the user requests only one statistic (e.g.,mean), all others are still computed and returned internally, includingnon_zeros. This design ensures consistent output structure and avoids branching logic, but may introduce unnecessary computational overhead in simple use cases.I'm wondering whether it would be worthwhile to refactor the logic such that only the requested statistics are computed when
sparse = TRUE. This could reduce runtime and memory usage, especially when working with very large datasets and lightweight operations. Any suggestion or better idea for this? Or maybe we currently just keep the original strategy?