Skip to content

Thoughts about the integrating optimizations into the analysis stack #1

@LTLA

Description

@LTLA

Thinking about integration with BiocSingular in particular.

I think the best option would be to write a separate package with a S4 generic that takes an instance of a matrix-like-object and returns a BiocSingularParam object that provides the "best" choice of algorithm (in terms of the speed/accuracy trade-off). This approach has several advantages:

  • Isolate implementation (in BiocSingular) from decisions about the choice of algorithm, which makes it easier to maintain as I only have to implement things rather than explicitly choose between them.
  • Provide users with greater control - if they don't like the generated BiocSingularParam object, or if they know better (e.g., about their file system access speeds), they can just use their own.
  • Simplify extensions for community-defined matrix representations, as anyone can write methods for the generic if they know that their representation is fast/slow at being touched.

The downside of the above strategy is that the choice of algorithm is not entirely transparent in the analysis stack, as the user has to actively call the new function that generates the new BiocSingularParam object. But I would argue that the choice of algorithm would not have been transparent in the first place, e.g., switching between IRLBA and RSVD can give results that differ beyond numerical precision, while switching between approximate and exact SVD will result in changes to the random seed.

Plus, if you write a separate package, you can also put in functions to perform empirical benchmarking for people who are really interested in optimizing their SVD speeds. Those won't go into BiocSingular.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions