fix: offline prefetch downloads to wrong HF cache when using containers#17
Merged
Conversation
When offline=True, prefetch_assets() was downloading models/datasets to the login shell's default HF_HOME rather than the cluster-specific path set by container.env_file. The container sources this file at job runtime, so the two locations disagreed and the container couldn't find the cached assets. Now submit.py parses the env_file for HF cache vars (HF_HOME, HF_HUB_CACHE, etc.) and applies them to os.environ before calling prefetch_assets(), ensuring both the prefetch and the container use the same cache root.
Silently skipping a missing env file would let prefetch_assets() run against the wrong HF cache without any indication of the misconfiguration. Raising FileNotFoundError makes the error explicit and fails fast.
huggingface_hub and datasets cache HF_HOME/HF_HUB_CACHE at import time. Importing prefetch_assets at module level meant those libraries were initialised before _apply_hf_env_from_file ran, so the env vars we set were ignored and downloads went to the default cache location. Move the import to just before the call, after the env vars are applied.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
offline: true,scripts/submit.pycallsprefetch_assets()to download models and datasets on the login node before submitting the SLURM job. However, the prefetch was downloading to the wrong HF cache location, so the container couldn't find the assets at runtime.Three bugs combined to cause this:
Wrong cache path:
prefetch_assets()used the login shell's defaultHF_HOME(~/.cache/huggingface/), while the containerised SLURM job sourcescontainer.env_file(e.g.env/jupiter.env) which sets a cluster-specificHF_HOME. If the user hadn't manually sourced the env file before runningsubmit.py, the two locations disagreed.Import-time cache read:
huggingface_hubanddatasetssnapshotHF_HOME/HF_HUB_CACHEinto internal constants at import time. Importingprefetch_assetsat module level meant those libraries initialised before the env vars were applied, so the corrected values were silently ignored.Missing file not caught: A misconfigured or missing
env_filepath would silently skip the env setup and let the prefetch proceed against the wrong cache.This PR fixes all three:
container.env_filefor HF cache vars (HF_HOME,HF_HUB_CACHE,HUGGINGFACE_HUB_CACHE,HF_DATASETS_CACHE,TRANSFORMERS_CACHE) and applies them toos.environbefore prefetch.prefetch_assetsimport to just before the call (after env vars are set), mirroring the same lazy-import pattern already used intrain.pyforHF_HUB_OFFLINE.FileNotFoundErrorifcontainer.env_fileis specified but not present.Type of change