Skip to content

[Feat] Support Z-Image-Omni#22

Open
Jayce-Ping wants to merge 12 commits into
mainfrom
z-image-omni
Open

[Feat] Support Z-Image-Omni#22
Jayce-Ping wants to merge 12 commits into
mainfrom
z-image-omni

Conversation

@Jayce-Ping
Copy link
Copy Markdown
Collaborator

No description provided.

Jayce-Ping added a commit to Jayce-Ping/Flow-Factory-Private that referenced this pull request May 17, 2026
Move the two `huggingface_hub` imports added in the previous commit
out of function bodies and up to the module's import block:

- `from huggingface_hub import snapshot_download` in
  `src/flow_factory/utils/checkpoint.py`
- `from huggingface_hub.errors import RepositoryNotFoundError,
  HfHubHTTPError` in `src/flow_factory/models/abc.py`

`huggingface_hub` is already a hard dependency (`pyproject.toml:39`),
so there is no import-cost concern; lazy-loading only hid the
dependency surface from readers and `isort`.

Codify the rule so this pattern is caught in review going forward:
- Extend constraint X-GenGroup#22 ("Import Style") in
  `.agents/knowledge/constraints.md` with an explicit "Top-level imports
  only" bullet, listing the three sanctioned exceptions (optional deps
  via `try/except ImportError`, backend-gated imports under runtime
  feature checks like DeepSpeed/FSDP, and unresolvable circular imports).
- Add a matching checklist item to the Code Style section of
  `.agents/skills/ff-review/SKILL.md`.

Pre-existing inline FSDP/DeepSpeed imports in `models/abc.py` (lines
~997-1152) are grandfathered under the new rule's exception (b).

Co-authored-by: Cursor <cursoragent@cursor.com>
Jayce-Ping added a commit to Jayce-Ping/Flow-Factory-Private that referenced this pull request May 17, 2026
…olver

Unifies the two parallel Hugging Face Hub checkpoint-loading implementations
that landed independently on each branch.

Adopted approach (from feat/resume-from-huggingface):
  - Pure helpers in utils/checkpoint.py:
      parse_hf_checkpoint_path(path)         -> (repo_id, subfolder, revision)
      download_hf_checkpoint(repo_id, ...)   -> local dir (allow_patterns for subdir)
  - New orchestrator resolve_checkpoint_path(path, accelerator=None) combines:
      * input validation (TypeError/ValueError on bad input),
      * `hf://`-prefix override + local-wins-first ordering,
      * subfolder support via parse_hf_checkpoint_path,
      * per-node multi-node gating (is_local_main_process + wait_for_everyone),
      * narrow re-raise of (RepositoryNotFoundError, HfHubHTTPError) as
        FileNotFoundError with full context.

Removed (from flow-opd-ours 0ad2636):
  - resolve_lora_dir + _HF_REPO_RE (regex parser without subfolder support).
  - The redundant resolve_lora_dir() call at the top of BaseAdapter._load_lora;
    BaseAdapter.load_checkpoint() now pre-resolves before dispatching, and
    load_lora_as_named_parameters() resolves before calling _load_lora.

Rewired call sites:
  - BaseAdapter.load_checkpoint  -> resolve_checkpoint_path(path, accelerator=self.accelerator)
  - utils/lora_loader.py::load_lora_as_named_parameters -> same
  - Deleted BaseAdapter._resolve_checkpoint_path method (its logic moved into
    the standalone resolve_checkpoint_path function so lora_loader can share it
    without holding an adapter reference).

Net: single source of truth for HF resolution, broader scope (LoRA + full +
training-state, with subfolder support), OPD teacher_paths preserved, and
constraint X-GenGroup#22 (top-level imports) introduced by the merge applies cleanly.

Co-authored-by: Cursor <cursoragent@cursor.com>
Jayce-Ping added a commit to Jayce-Ping/Flow-Factory-Private that referenced this pull request May 17, 2026
Three fixes from the Copilot review on the upstream PR:

1. Path-traversal validation (utils/checkpoint.py):
   parse_hf_checkpoint_path now rejects '.', '..', and backslash
   segments with an informative ValueError that preserves the original
   spec. Validates at the parser front door rather than at the
   download site so error messages keep the user's exact input.
   Without this, a spec like 'owner/repo/..' would escape the snapshot
   directory via os.path.join.

2. Un-gate _resolve_checkpoint_path (models/abc.py):
   Remove the `if is_local_main_process:` gate and the post-barrier
   double-call pattern. All ranks now call snapshot_download directly;
   huggingface_hub's per-blob WeakFileLock serializes concurrent calls
   within each filesystem domain (cross-node on POSIX-locking shared
   FS, per-node on non-shared FS), so we still get exactly one
   download per filesystem domain.

   This eliminates the distributed-deadlock hazard where a download
   failure on the gated rank would raise before reaching
   wait_for_everyone(), leaving siblings blocked at the barrier until
   NCCL watchdog timeout. The trailing wait_for_everyone() is kept to
   maintain lockstep entry into the downstream loaders.

   Residual asymmetric-failure risk (one rank's network blip while
   others succeed) is documented in the docstring.

3. Skill-checklist alignment (.agents/skills/ff-review/SKILL.md):
   Replace the duplicated import-exception list with a reference to
   constraint X-GenGroup#22, where the full set of three sanctioned exceptions
   (optional deps, backend-gated runtime feature checks, circular
   imports) lives. Prevents future drift between the two documents.

Verified with 8 happy-path + 5 original-error + 6 path-traversal
parser test cases (all pass).

Co-authored-by: Cursor <cursoragent@cursor.com>
Jayce-Ping added a commit that referenced this pull request May 17, 2026
… paths (#160)

* [adapter,hparams] feat: support resuming from Hugging Face checkpoint paths

Extend `BaseAdapter.load_checkpoint` to transparently resolve a Hugging
Face repo spec (either `hf://owner/repo[/subdir][@rev]` or bare
`owner/repo[/...]`) to a local cache directory via
`huggingface_hub.snapshot_download`, reusing the existing `lora` /
`full` / `state` loading branches unchanged.

Logic priority at `_resolve_checkpoint_path`:
  1. `hf://` prefix -> force HF (overrides any colliding local dir)
  2. Local path exists -> return as-is
  3. Otherwise parse as `owner/repo[/subfolder][@revision]` and download

Multi-node-safe: download gated on `is_local_main_process` (one per
node), not `is_main_process` (one global). On non-shared filesystems
each node populates its own HF cache once; on shared filesystems
huggingface_hub's per-blob `WeakFileLock` dedupes the concurrent
`snapshot_download` calls so only one node transfers bytes.

Fail-fast: narrow `except (RepositoryNotFoundError, HfHubHTTPError)`
re-raised as `FileNotFoundError` with full context (path, repo_id,
subfolder, revision, HF_TOKEN hint) for actionable error messages.

Also updates the `resume_path` help text and normalizes the inline
comment on all 59 example YAMLs to document the new HF support. No
new config fields; `resume_path: Optional[str]` keeps the same shape.

Note: pre-existing black/isort lint debt in the touched src files is
unrelated to this change.

Co-authored-by: Cursor <cursoragent@cursor.com>

* [adapter,docs] refactor: hoist HF imports to module top; codify rule

Move the two `huggingface_hub` imports added in the previous commit
out of function bodies and up to the module's import block:

- `from huggingface_hub import snapshot_download` in
  `src/flow_factory/utils/checkpoint.py`
- `from huggingface_hub.errors import RepositoryNotFoundError,
  HfHubHTTPError` in `src/flow_factory/models/abc.py`

`huggingface_hub` is already a hard dependency (`pyproject.toml:39`),
so there is no import-cost concern; lazy-loading only hid the
dependency surface from readers and `isort`.

Codify the rule so this pattern is caught in review going forward:
- Extend constraint #22 ("Import Style") in
  `.agents/knowledge/constraints.md` with an explicit "Top-level imports
  only" bullet, listing the three sanctioned exceptions (optional deps
  via `try/except ImportError`, backend-gated imports under runtime
  feature checks like DeepSpeed/FSDP, and unresolvable circular imports).
- Add a matching checklist item to the Code Style section of
  `.agents/skills/ff-review/SKILL.md`.

Pre-existing inline FSDP/DeepSpeed imports in `models/abc.py` (lines
~997-1152) are grandfathered under the new rule's exception (b).

Co-authored-by: Cursor <cursoragent@cursor.com>

* [adapter,hparams] fix: address Copilot review on PR #160

Three fixes from the Copilot review on the upstream PR:

1. Path-traversal validation (utils/checkpoint.py):
   parse_hf_checkpoint_path now rejects '.', '..', and backslash
   segments with an informative ValueError that preserves the original
   spec. Validates at the parser front door rather than at the
   download site so error messages keep the user's exact input.
   Without this, a spec like 'owner/repo/..' would escape the snapshot
   directory via os.path.join.

2. Un-gate _resolve_checkpoint_path (models/abc.py):
   Remove the `if is_local_main_process:` gate and the post-barrier
   double-call pattern. All ranks now call snapshot_download directly;
   huggingface_hub's per-blob WeakFileLock serializes concurrent calls
   within each filesystem domain (cross-node on POSIX-locking shared
   FS, per-node on non-shared FS), so we still get exactly one
   download per filesystem domain.

   This eliminates the distributed-deadlock hazard where a download
   failure on the gated rank would raise before reaching
   wait_for_everyone(), leaving siblings blocked at the barrier until
   NCCL watchdog timeout. The trailing wait_for_everyone() is kept to
   maintain lockstep entry into the downstream loaders.

   Residual asymmetric-failure risk (one rank's network blip while
   others succeed) is documented in the docstring.

3. Skill-checklist alignment (.agents/skills/ff-review/SKILL.md):
   Replace the duplicated import-exception list with a reference to
   constraint #22, where the full set of three sanctioned exceptions
   (optional deps, backend-gated runtime feature checks, circular
   imports) lives. Prevents future drift between the two documents.

Verified with 8 happy-path + 5 original-error + 6 path-traversal
parser test cases (all pass).

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Jayce-Ping added a commit to Jayce-Ping/Flow-Factory-Private that referenced this pull request May 17, 2026
Adopt main's `_resolve_checkpoint_path` adapter method for HF Hub checkpoint
resolution (drops the duplicate top-level `resolve_checkpoint_path` helper from
this branch). Picks up main's path-traversal security check in
`parse_hf_checkpoint_path`. Updates `ff-review` SKILL.md wording for
constraint X-GenGroup#22 to main's more complete phrasing.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant