Skip to content

fix: offline prefetch downloads to wrong HF cache when using containers#17

Merged
Neonkraft merged 3 commits into
mainfrom
fix/offline-prefetch-env-vars
Apr 29, 2026
Merged

fix: offline prefetch downloads to wrong HF cache when using containers#17
Neonkraft merged 3 commits into
mainfrom
fix/offline-prefetch-env-vars

Conversation

@Neonkraft
Copy link
Copy Markdown
Collaborator

Summary

When offline: true, scripts/submit.py calls prefetch_assets() to download models and datasets on the login node before submitting the SLURM job. However, the prefetch was downloading to the wrong HF cache location, so the container couldn't find the assets at runtime.

Three bugs combined to cause this:

  1. Wrong cache path: prefetch_assets() used the login shell's default HF_HOME (~/.cache/huggingface/), while the containerised SLURM job sources container.env_file (e.g. env/jupiter.env) which sets a cluster-specific HF_HOME. If the user hadn't manually sourced the env file before running submit.py, the two locations disagreed.

  2. Import-time cache read: huggingface_hub and datasets snapshot HF_HOME/HF_HUB_CACHE into internal constants at import time. Importing prefetch_assets at module level meant those libraries initialised before the env vars were applied, so the corrected values were silently ignored.

  3. Missing file not caught: A misconfigured or missing env_file path would silently skip the env setup and let the prefetch proceed against the wrong cache.

This PR fixes all three:

  • Parses container.env_file for HF cache vars (HF_HOME, HF_HUB_CACHE, HUGGINGFACE_HUB_CACHE, HF_DATASETS_CACHE, TRANSFORMERS_CACHE) and applies them to os.environ before prefetch.
  • Moves the prefetch_assets import to just before the call (after env vars are set), mirroring the same lazy-import pattern already used in train.py for HF_HUB_OFFLINE.
  • Raises FileNotFoundError if container.env_file is specified but not present.

Type of change

  • Bug fix
  • New feature
  • Refactor
  • Performance
  • Documentation
  • Maintenance

When offline=True, prefetch_assets() was downloading models/datasets to
the login shell's default HF_HOME rather than the cluster-specific path
set by container.env_file. The container sources this file at job runtime,
so the two locations disagreed and the container couldn't find the cached
assets.

Now submit.py parses the env_file for HF cache vars (HF_HOME,
HF_HUB_CACHE, etc.) and applies them to os.environ before calling
prefetch_assets(), ensuring both the prefetch and the container use the
same cache root.
Silently skipping a missing env file would let prefetch_assets() run
against the wrong HF cache without any indication of the misconfiguration.
Raising FileNotFoundError makes the error explicit and fails fast.
huggingface_hub and datasets cache HF_HOME/HF_HUB_CACHE at import time.
Importing prefetch_assets at module level meant those libraries were
initialised before _apply_hf_env_from_file ran, so the env vars we set
were ignored and downloads went to the default cache location.

Move the import to just before the call, after the env vars are applied.
@Neonkraft Neonkraft changed the title Fix/offline prefetch env vars fix: offline prefetch downloads to wrong HF cache when using containers Apr 29, 2026
@Neonkraft Neonkraft merged commit 6b064fb into main Apr 29, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant