Add override mechanism for algorithm containers, transfer explicit .sif images with HTCondor#464
Conversation
…th HTCondor This accomplishes two main things: 1. Users can explicitly state what container they want a given PRM to run in via the configuration file, using the PRM name (as defined in the config file) as the key. 2. When users specify an override `.sif` image, that image is added to an HTCondor transfer list such that Condor moves the file to the EP for execution (to avoid pulling during the job). Explicitly moving required input files is a "best practice" in HTCondor, because failure to resolve inputs at runtime squanders capacity. When no override is provided or the HTCondor Snakemake executor isn't available, the new Snakefile resource rule should be a no-op. In addition to adding the features, I tried to split up some other functions in and around container resolution to make them more testable.
This adds a more robust way to check whether the container framekwork is apptainer/singularity, which for the purposes of our codebase should be treated as synonyms. I decided to do this after noticing an issue in test logs where the container framework was set to apptainer, and `unpack_singularity` was true -- the unpacking behavior happened correctly despite a logged warning claiming it wouldn't happen because the warning only checked for singularity. I believe this diff makes that type of mistake a little harder.
As I started writing the PR message, I realized things weren't quite the way I wanted them to be w.r.t. this hierarchy. Thisshould fix it.
agitter
left a comment
There was a problem hiding this comment.
These changes make sense to me overall.
| # | ||
| # Local .sif file path (e.g., "images/pathlinker_v2.sif"): | ||
| # Apptainer/Singularity only. Skips pulling from registry and uses the | ||
| # pre-built .sif directly. When running via HTCondor with shared-fs-usage: none, |
There was a problem hiding this comment.
Because shared-fs-usage: none isn't in this config file, it could help to state the place where is it set (the spras_profile config).
| # Example (one of each type): | ||
| # images: | ||
| # omicsintegrator1: "images/omics-integrator-1_v2.sif" # local .sif (Apptainer only) | ||
| # pathlinker: "pathlinker:v1234" # image name only (base_url/owner prepended) | ||
| # omicsintegrator2: "some-other-owner/oi2:latest" # owner/image (base_url prepended) | ||
| # mincostflow: "ghcr.io/reed-compbio/mincostflow:v2" # full registry reference (used as-is) |
There was a problem hiding this comment.
This syntax makes sense to me.
There was a problem hiding this comment.
Upon further consideration, would it make more sense to nest these image overrides under each algorithm below? They the key would be image instead of , which may be less typo prone.
There was a problem hiding this comment.
I considered that, but to me it felt more appropriate that the container overrides be defined under the container section of configuration. It could just as easily go the other way if you disagree.
| ) | ||
| else: | ||
| print(f'Container image override (local .sif): {image_override}', flush=True) | ||
| elif image_override: |
There was a problem hiding this comment.
I'm wondering whether this block needs to be more robust to malformed overrides. What if I provide hello.world or a/b/c/d/e/f? Do we want to pass that through until there is an error later?
|
|
||
| run_container(CONTAINER_SUFFIX, DUMMY_COMMAND, DUMMY_VOLUMES, DUMMY_WORKDIR, DUMMY_OUTDIR, settings) | ||
| container_arg = mock_singularity.call_args[0][0] | ||
| # The actual .sif is used inside run_container_singularity; run_container itself |
There was a problem hiding this comment.
Do we have a way to test that the actual .sif is used there?
This PR aims to accomplish a few things:
foo, I want you to use container imagebar". In the general case, I'm not sure how much this is needed, but @agitter and I have floated the concept in the past, and it helps me solve 2. This implements a 4-tier hierarchy over how much of the image name is overridden (with special logic for.sifextensions), e.g.:would result in container URIs of
omicsintegrator1 --> docker.io/reedcompbio/oi1:latestomicsintegrator2 --> docker.io/jhiemstra/oi2:latest1mincostflow --> hub.opensciencegrid.org/jhiemstra/mincostflow:latestallpairs --> images/allpairs.sif (local file)The caveat here is that I don't believe all container registries follow this kind of
<base_url>/<owner>/<container>hierarchy, but it's at least true for docker, ghcr and hub.opensciencegrid.org. If a user finds themselves somewhere outside those, they can always declare the entire URI explicitly..sifextension, the file is added to the reconstruct rule'shtcondor_transfer_input_filesresource key. This triggers HTCondor to transfer the sif image as part of the job's input sandbox..sifoverride is present and apptainer/singularity is the configured container framework, SPRAS uses the local file instead of pulling/building from a remote registry. This is combined with 2) are the key fixes for Provide guidance on working around docker rate limiting for large CHTC runs #462singularityandapptainercontainer frameworks differently in some cases. I added a helper function that makes it easier to treat these as synonyms.Note that I haven't yet documented this guidance, as #462 requests -- I'd rather get this over the finish line first, then add the documentation to #459 (so I don't create conflicts for myself)