[adapter,hparams] feat: support resuming from Hugging Face checkpoint paths#160
Conversation
… paths Extend `BaseAdapter.load_checkpoint` to transparently resolve a Hugging Face repo spec (either `hf://owner/repo[/subdir][@rev]` or bare `owner/repo[/...]`) to a local cache directory via `huggingface_hub.snapshot_download`, reusing the existing `lora` / `full` / `state` loading branches unchanged. Logic priority at `_resolve_checkpoint_path`: 1. `hf://` prefix -> force HF (overrides any colliding local dir) 2. Local path exists -> return as-is 3. Otherwise parse as `owner/repo[/subfolder][@revision]` and download Multi-node-safe: download gated on `is_local_main_process` (one per node), not `is_main_process` (one global). On non-shared filesystems each node populates its own HF cache once; on shared filesystems huggingface_hub's per-blob `WeakFileLock` dedupes the concurrent `snapshot_download` calls so only one node transfers bytes. Fail-fast: narrow `except (RepositoryNotFoundError, HfHubHTTPError)` re-raised as `FileNotFoundError` with full context (path, repo_id, subfolder, revision, HF_TOKEN hint) for actionable error messages. Also updates the `resume_path` help text and normalizes the inline comment on all 59 example YAMLs to document the new HF support. No new config fields; `resume_path: Optional[str]` keeps the same shape. Note: pre-existing black/isort lint debt in the touched src files is unrelated to this change. Co-authored-by: Cursor <cursoragent@cursor.com>
Move the two `huggingface_hub` imports added in the previous commit out of function bodies and up to the module's import block: - `from huggingface_hub import snapshot_download` in `src/flow_factory/utils/checkpoint.py` - `from huggingface_hub.errors import RepositoryNotFoundError, HfHubHTTPError` in `src/flow_factory/models/abc.py` `huggingface_hub` is already a hard dependency (`pyproject.toml:39`), so there is no import-cost concern; lazy-loading only hid the dependency surface from readers and `isort`. Codify the rule so this pattern is caught in review going forward: - Extend constraint X-GenGroup#22 ("Import Style") in `.agents/knowledge/constraints.md` with an explicit "Top-level imports only" bullet, listing the three sanctioned exceptions (optional deps via `try/except ImportError`, backend-gated imports under runtime feature checks like DeepSpeed/FSDP, and unresolvable circular imports). - Add a matching checklist item to the Code Style section of `.agents/skills/ff-review/SKILL.md`. Pre-existing inline FSDP/DeepSpeed imports in `models/abc.py` (lines ~997-1152) are grandfathered under the new rule's exception (b). Co-authored-by: Cursor <cursoragent@cursor.com>
[adapter,hparams] feat: support resuming from Hugging Face checkpoint paths
There was a problem hiding this comment.
Pull request overview
This PR extends adapter checkpoint loading so resume_path can point either to a local directory or to a Hugging Face Hub repo spec, which is resolved to a cached snapshot directory before the existing LoRA/full/state loading logic runs.
Changes:
- Add HF checkpoint spec parsing + snapshot download helpers in
utils/checkpoint.py. - Add
BaseAdapter._resolve_checkpoint_path()and invoke it fromBaseAdapter.load_checkpoint()to transparently download/resolve HF resume paths. - Update
resume_pathhelp text and normalize example YAML comments to document HF support.
Reviewed changes
Copilot reviewed 64 out of 64 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/flow_factory/utils/checkpoint.py | Adds HF path prefix constant, parser, and snapshot_download wrapper for checkpoint resolution. |
| src/flow_factory/models/abc.py | Resolves load_checkpoint() paths via HF Hub when needed (new _resolve_checkpoint_path). |
| src/flow_factory/hparams/model_args.py | Expands resume_path help text to document HF repo specs + multi-node caveats. |
| .agents/skills/ff-review/SKILL.md | Codifies “top-level imports only” guidance in the review checklist. |
| .agents/knowledge/constraints.md | Codifies “top-level imports only” as a repository constraint (with exceptions). |
| examples/template/sd3_5/async_reward.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/lora/z_image/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/lora/wan22/t2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/lora/wan21/t2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/lora/wan21/i2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/lora/sd3_5/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/lora/qwen_image/rational_rewards_t2i.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/lora/qwen_image_edit_plus/rational_rewards_edit.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/lora/flux2_klein_base/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/lora/flux1/rational_rewards_t2i.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/lora/flux1/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/lora/flux1_kontext/rational_rewards_edit.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/full/z_image/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/full/z_image_turbo/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/full/wan22/t2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/full/flux2_klein_base/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/nft/full/flux1/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/z_image/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/z_image_turbo/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/wan22/t2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/wan22/i2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/wan21/v2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/wan21/t2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/wan21/i2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/sd3_5/nocfg.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/sd3_5/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/qwen_image/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/qwen_image_edit_plus/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/ltx2/t2av.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/ltx2/t2av_pickscore.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/ltx2/i2av.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/flux2/t2i.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/flux2/i2i.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/flux2_klein/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/flux2_klein_base/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/flux1/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/lora/flux1_kontext/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/z_image/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/z_image_turbo/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/wan22/t2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/wan22/i2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/wan21/t2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/wan21/i2v.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/sd3_5/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/qwen_image/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/qwen_image_edit_plus/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/flux2/t2i.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/flux2/i2i.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/flux2_klein/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/flux2_klein_base/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/flux1/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/grpo/full/flux1_kontext/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/dpo/lora/sd3_5/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/dgpo/lora/sd3_5/nocfg.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/dgpo/lora/sd3_5/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/crd/lora/sd3_5/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/awm/lora/sd3_5/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/awm/lora/flux2_klein_base/default.yaml | Updates resume_path comment to mention HF repo specs. |
| examples/awm/lora/flux1/default.yaml | Updates resume_path comment to mention HF repo specs. |
Comments suppressed due to low confidence (1)
src/flow_factory/models/abc.py:1509
- After the barrier, every process calls
download_hf_checkpoint(...), which wrapssnapshot_download. Even when the cache is already populated this can cause lock contention and repeated hub metadata checks across all ranks. Consider having onlyis_local_main_processcallsnapshot_download, then broadcast the resulting local path to other ranks (or have non-local ranks call withlocal_files_only=Truevia an option ondownload_hf_checkpoint) to make the post-barrier step strictly local/cheap.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| repo_id, subfolder, revision = parse_hf_checkpoint_path(spec) | ||
|
|
||
| if self.accelerator.is_local_main_process: | ||
| local_path = download_hf_checkpoint(repo_id, subfolder, revision) | ||
| logger.info( | ||
| f"[local rank 0 / global rank {self.accelerator.process_index}] " | ||
| f"resolved checkpoint '{path}' -> {local_path}" | ||
| ) | ||
| self.accelerator.wait_for_everyone() | ||
|
|
There was a problem hiding this comment.
Fixed in commit aa883aa by un-gating the download.
All ranks now call snapshot_download directly. huggingface_hub's per-blob WeakFileLock serializes concurrent calls within each filesystem domain (cross-node on POSIX-locking shared FS, per-node on non-shared FS), so we still get exactly one actual download per filesystem domain — same byte-count as the gated version, but without the failure-before-barrier deadlock.
How the four FS x failure quadrants behave now:
- Shared FS, success: 1 download globally (cross-node lock dedup).
- Non-shared FS, success: N downloads in parallel (one per node).
- Symmetric failure (auth/network/repo-not-found): every rank raises uniformly; no one reaches the trailing barrier; no deadlock.
- Asymmetric single-rank failure (rare): residual hazard called out explicitly in the docstring — surviving ranks would trip the NCCL watchdog on the trailing
wait_for_everyone().
Trailing wait_for_everyone() is kept so downstream _load_lora / _load_full_model (which may add collective ops over time) enter in lockstep.
Side effect: also resolves the post-barrier HEAD-request overhead concern from the low-confidence summary comment (no more post-barrier snapshot_download call).
| - [ ] English comments and docstrings | ||
| - [ ] Apache 2.0 license header on new files | ||
| - [ ] No unnecessary wildcard imports (except `hparams`) | ||
| - [ ] **Top-level imports only** — no `import` / `from ... import ...` inside function or method bodies (constraint #22). Exceptions: documented optional deps via `try/except ImportError`, or genuine unresolvable circular imports. |
There was a problem hiding this comment.
Fixed in commit aa883aa.
Replaced the duplicated exception list with a reference to constraint #22, so the skill checklist no longer needs to be kept in sync with constraints.md. New text:
- [ ] **Top-level imports only** (constraint #22) — see that file for the three sanctioned exceptions (optional deps via `try/except ImportError`, backend-gated runtime feature checks like DeepSpeed/FSDP, unresolvable circular imports).
Follows the agents-docs-maintenance.mdc principle "Reference constraint numbers ... do not re-explain the rule."
Three fixes from the Copilot review on the upstream PR: 1. Path-traversal validation (utils/checkpoint.py): parse_hf_checkpoint_path now rejects '.', '..', and backslash segments with an informative ValueError that preserves the original spec. Validates at the parser front door rather than at the download site so error messages keep the user's exact input. Without this, a spec like 'owner/repo/..' would escape the snapshot directory via os.path.join. 2. Un-gate _resolve_checkpoint_path (models/abc.py): Remove the `if is_local_main_process:` gate and the post-barrier double-call pattern. All ranks now call snapshot_download directly; huggingface_hub's per-blob WeakFileLock serializes concurrent calls within each filesystem domain (cross-node on POSIX-locking shared FS, per-node on non-shared FS), so we still get exactly one download per filesystem domain. This eliminates the distributed-deadlock hazard where a download failure on the gated rank would raise before reaching wait_for_everyone(), leaving siblings blocked at the barrier until NCCL watchdog timeout. The trailing wait_for_everyone() is kept to maintain lockstep entry into the downstream loaders. Residual asymmetric-failure risk (one rank's network blip while others succeed) is documented in the docstring. 3. Skill-checklist alignment (.agents/skills/ff-review/SKILL.md): Replace the duplicated import-exception list with a reference to constraint X-GenGroup#22, where the full set of three sanctioned exceptions (optional deps, backend-gated runtime feature checks, circular imports) lives. Prevents future drift between the two documents. Verified with 8 happy-path + 5 original-error + 6 path-traversal parser test cases (all pass). Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Extend
BaseAdapter.load_checkpointto transparently resolve a Hugging Face repo spec (either explicithf://owner/repo[/subdir][@rev]or bareowner/repo[/...]) to a local cache directory viahuggingface_hub.snapshot_download. The existinglora/full/stateloading branches are reused unchanged.No new config fields.
resume_path: Optional[str]keeps the same shape and gains a second meaning — local directory or HF repo spec — mirroring howmodel_name_or_pathalready accepts both.Logic priority (resolved inside
BaseAdapter._resolve_checkpoint_path)huggingface_hubis already a hard dependency (pyproject.toml:39), so no dependency change.Multi-node correctness
The download is gated on
accelerator.is_local_main_process(one process per node), notis_main_process(one global). Trade-offs across the four FS-vs-gating combinations:is_local_main_process(chosen)WeakFileLockdedupes across nodes)is_main_processis_local_main_process(chosen)is_main_processis_local_main_processwins decisively on non-shared filesystems (2x faster wall clock) and avoids a latent asymmetric-progress hazard where rank 0 could enter_load_lora/_load_full_modelwhile other ranks are still inside the resolver — which would deadlock the moment any downstream loader adds a collective op (FSDP state-dict broadcast, DeepSpeedGatheredParameters, etc.).Fail-fast errors
A single narrow
except (RepositoryNotFoundError, HfHubHTTPError)re-raises asFileNotFoundErrorwith full context (path, repo_id, subfolder, revision,HF_TOKENhint) — complies withno-defensive-except.mdc's "narrow + documented" exception rule.Operational caveats (documented in the
resume_pathhelp string)HF_TOKENmust be set on every node (Slurm--export=NONEand K8s withoutenvFromcan drop env vars on workers).HF_HUB_ENABLE_HF_TRANSFER=1(~5x faster) or bumpTORCH_NCCL_BLOCKING_WAIT/NCCL_TIMEOUTto avoid the watchdog killing the job while local rank 0 downloads.hf download owner/repo) and a plain local path.Commits (2 substantive + 1 merge)
2207c92[adapter,hparams] feat: support resuming from Hugging Face checkpoint paths— the actual feature.540ff6d[adapter,docs] refactor: hoist HF imports to module top; codify rule— small post-merge cleanup: move twohuggingface_hubimports from function bodies to the module's import block, and extend constraint [Feat] Support Z-Image-Omni #22 ("Import Style") + theff-reviewskill checklist with an explicit "top-level imports only" rule (with the three sanctioned exceptions: optional deps, backend-gated imports, circular-import workarounds). Pre-existing inline FSDP/DeepSpeed imports inmodels/abc.pyare grandfathered.5fb0b14is the fork's merge commit; happy to have you squash if you prefer linear history.Files changed
src/flow_factory/utils/checkpoint.py— addedparse_hf_checkpoint_path,download_hf_checkpoint, andHF_PATH_PREFIX.src/flow_factory/models/abc.py— addedBaseAdapter._resolve_checkpoint_path; one-line insertion inload_checkpointto resolve the path before any local-only dispatch.src/flow_factory/hparams/model_args.py— extendedresume_path'shelpwith HF + multi-node notes.examples/**/*.yaml(59 files) — normalized the inline comment onresume_path:to document the new HF support; YAML semantics unchanged..agents/knowledge/constraints.md,.agents/skills/ff-review/SKILL.md— codified the import-style rule (one line each).Test plan
yaml.safe_loadafter the bulk-comment update.black --checkstandalone; pre-existing lint debt in the touched files predates this PR.resume_path: "<small-public-lora>".accelerate launchto confirm one download per node and no asymmetric progress.Out of scope
push_to_hub).resume_revisionconfig field — inline@revis simpler and can be promoted later if users ask.Made with Cursor