Cell atlas by jimmymathews · Pull Request #448 · nadeemlab/smprofiler

jimmymathews · 2026-05-26T19:19:23Z

This is my proposal for finishing the backend portion of smprofiler-web#224. (For @wooferclaw ).

Add the main new code (train_atlas_models.py) to the smprofiler codebase, probably also breaking it down, i.e. modularizing it by bringing together related functions and placing function/class definitions in a logical order. At the minimum, a portion (separate file) that is ordinary library code, and a portion that is the CLI under smprofiler/<module>/scripts/....
Manually rewrite the inline documentation to make sure it is really needed and accurate (it looks like some automatic tools were used to generate this). I can be more specific about this if that would help.
In the future the HGNC normalization may be helpful, but for the current purpose it is not needed because we have a complete manual mapping between the channel names as used in our datasets and the channel names appearing in the cell atlas. So this functionality should perhaps be split off as a separate module (file in the smprofiler codebase).
Refactor the initial portion of the training to accept its input from the Parquet-formatted table file generated by this aggregation script in smprofiler-data. (Eventually I'll probably bring this functionality into smprofiler as well, for convenience.)
Include a tiny test of the training using a tiny subset of the real atlas data.
Maybe deprecate the ability to load the channel annotations from file. That way we only get it from the API, to help make sure we don't use mismatched source code vs. live versions.
Consider deprecating or justifying some of the additional "utility" functions like _silence_fd. Maybe we can just use print. There is also a logging pattern used throughout the smprofiler codebase:

from smprofiler.standalone_utilities.log_formats import colorized_logger
logger = colorized_logger(__name__)
...
logger.info('...')
logger.error('...')
logger.warning('...')

Ensure that the only model architectures that are considered provide a non-trivial variance/standard deviation estimation method during inference. By non-trivial I mean that it should actually depend on the input x provided for inference, not just the global standard deviation of the predictions over all the training data (or similar). The only architecture I'm sure provides this is the Bayesian ridge and Gaussian Process regression. However if we can make it work well, a bootstrapping procedure may work here for an arbitrary architecture, provided we do some kind of weighting of the training data samples reflecting similarity with the provided input x.
Include some basic functionality for storing the models once trained. I think a new database table would be fine for this, as the models are quite small, and we could easily store important metadata this way. This could allow, for example, multiple models trained at different times (older/newer versions). The new table could be along the lines of:

CREATE TABLE(
    onnx_model BYTEA,
    study VARCHAR,
    FOREIGN KEY (channel_predicted) references chemical_species(id),
    training_time_minutes NUMERIC,
    created TIMESTAMP WITH TIMEZONE,
    size_bytes INTEGER,
    architecture_type VARCHAR
)

Add some separate entrypoints into the functionality that are documented for function usage, not implementation. Things like get_model (or similar-named) API endpoint, a model usage (inference) snippet (Python, to be tested in one of the small tests). A small JS usage snippet would also be useful.

Co-authored-by: Copilot <copilot@github.com>

Grigory Frantsuzov and others added 8 commits April 1, 2026 01:24

Minimal implementation

b9bcacc

updated notebook

721fe5d

onnx files

e344d3e

fixed atlas path

317b146

Merge branch 'main' into mlops_experiments_allen

67fa4d1

updated model training

4eb14b6

training script update

95b21e2

updated training script

e28b587

Co-authored-by: Copilot <copilot@github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cell atlas#448

Cell atlas#448
jimmymathews wants to merge 8 commits into
mainfrom
mlops_experiments_allen

jimmymathews commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimmymathews commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant