Skip to content

Better handling of custom dataset loaders #2

@jproctor

Description

@jproctor

I’m not 100% sold on MHC-specific loaders being baked into the repo like they are. Possibly they should be moved into a subdirectory/module and stay in this repo, possibly they should be moved to a separate repo.

In custom_datasets.py, load_mhc_libs() relies on the existence of prepro_channels.npy in the data directory. It looks like load_mars_big() needs ccs_channels.npy too. Regardless of whether they stay in this repo or move, they shouldn’t rely on the existence of otherwise undocumented data files. The question is whether they belong with the data (in a documented way) or the loader.

Architecturally, a plugin pattern definitely fits both public datasets and their custom loader functions (and possibly a way to bundle them together), but I think separate repos are overkill without a better use case. Focusing on the loaders:

  1. Let’s define a place to add loader modules and some code to slurp in everything it finds there. backend/loaders/ makes sense to me but I could argue for other places.
  2. Move the MHC loaders into a module there. Be clever with .gitignore so other things in the loaders directory are excluded from the repo.
  3. Move the .npy files into that module (subdir for supporting data?).

There’s also room for a change in datasets.yml to make the files: key a little clearer about what’s going on instead of baking those architectural decisions into the loaders as well (these types of data require two lines and the second is always the metadata file, these require one and assume the metadata filename, &c.), and it would make sense to at least consider that before we declare this task done. This could easily end up being a major API overhaul and split out to its own project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions