Skip to content

Cell atlas#448

Open
jimmymathews wants to merge 8 commits into
mainfrom
mlops_experiments_allen
Open

Cell atlas#448
jimmymathews wants to merge 8 commits into
mainfrom
mlops_experiments_allen

Conversation

@jimmymathews
Copy link
Copy Markdown
Collaborator

This is my proposal for finishing the backend portion of smprofiler-web#224. (For @wooferclaw ).

  1. Add the main new code (train_atlas_models.py) to the smprofiler codebase, probably also breaking it down, i.e. modularizing it by bringing together related functions and placing function/class definitions in a logical order. At the minimum, a portion (separate file) that is ordinary library code, and a portion that is the CLI under smprofiler/<module>/scripts/....
  2. Manually rewrite the inline documentation to make sure it is really needed and accurate (it looks like some automatic tools were used to generate this). I can be more specific about this if that would help.
  3. In the future the HGNC normalization may be helpful, but for the current purpose it is not needed because we have a complete manual mapping between the channel names as used in our datasets and the channel names appearing in the cell atlas. So this functionality should perhaps be split off as a separate module (file in the smprofiler codebase).
  4. Refactor the initial portion of the training to accept its input from the Parquet-formatted table file generated by this aggregation script in smprofiler-data. (Eventually I'll probably bring this functionality into smprofiler as well, for convenience.)
  5. Include a tiny test of the training using a tiny subset of the real atlas data.
  6. Maybe deprecate the ability to load the channel annotations from file. That way we only get it from the API, to help make sure we don't use mismatched source code vs. live versions.
  7. Consider deprecating or justifying some of the additional "utility" functions like _silence_fd. Maybe we can just use print. There is also a logging pattern used throughout the smprofiler codebase:
from smprofiler.standalone_utilities.log_formats import colorized_logger
logger = colorized_logger(__name__)
...
logger.info('...')
logger.error('...')
logger.warning('...')
  1. Ensure that the only model architectures that are considered provide a non-trivial variance/standard deviation estimation method during inference. By non-trivial I mean that it should actually depend on the input x provided for inference, not just the global standard deviation of the predictions over all the training data (or similar). The only architecture I'm sure provides this is the Bayesian ridge and Gaussian Process regression. However if we can make it work well, a bootstrapping procedure may work here for an arbitrary architecture, provided we do some kind of weighting of the training data samples reflecting similarity with the provided input x.
  2. Include some basic functionality for storing the models once trained. I think a new database table would be fine for this, as the models are quite small, and we could easily store important metadata this way. This could allow, for example, multiple models trained at different times (older/newer versions). The new table could be along the lines of:
CREATE TABLE(
    onnx_model BYTEA,
    study VARCHAR,
    FOREIGN KEY (channel_predicted) references chemical_species(id),
    training_time_minutes NUMERIC,
    created TIMESTAMP WITH TIMEZONE,
    size_bytes INTEGER,
    architecture_type VARCHAR
)
  1. Add some separate entrypoints into the functionality that are documented for function usage, not implementation. Things like get_model (or similar-named) API endpoint, a model usage (inference) snippet (Python, to be tested in one of the small tests). A small JS usage snippet would also be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant