Skip to content

Feature request / experience sharing: run_locally resume via name-based UUID matching #873

@Barabama

Description

@Barabama

Feature request / experience sharing: run_locally resume via name-based UUID matching

Problem

When a workflow containing dynamically generated jobs (e.g. from a Maker) fails mid-execution, rerunning it breaks downstream OutputReference objects. Maker.make() generates new UUIDs on each call, while stored jobs from the previous run have different UUIDs.

Minimal example with atomate2's DoubleRelaxMaker:

maker = DoubleRelaxMaker(...)

# First run: Relax2 fails
flow = Flow([maker.make(structure), downstream_static, downstream_nscf])
run_locally(flow)  # Relax1 ✓ → Relax2 ✗

# Rerun: new Maker call → new UUIDs → downstream references stale
flow = Flow([maker.make(structure), downstream_static, downstream_nscf])
run_locally(flow)  # ValueError: Could not resolve reference

Root cause: jobflow assigns UUIDs at Job creation time (inside Maker.make()), not at execution time. When the user reconstructs the same logical workflow, every Job gets a fresh UUID while downstream OutputReference objects still point to the old ones.

Related discussions

I'm aware of existing threads on this or adjacent problems:

Proposed approach

Use job.name + job.index as a stable identifier to match jobs across reruns, then restore original UUIDs before execution. All logic is contained in a custom run_locally wrapper — no changes to jobflow core.

Three steps before starting execution:

  1. Match by name + index: query store.query_one({"name": job.name, "index": job.index}) to find jobs from a previous run, building a {new_uuid: old_uuid} mapping.
  2. Update OutputReferences: recursively walk all jobs' function_args and function_kwargs, calling obj.set_uuid(old_uuid, inplace=True) on any OutputReference whose new uuid is in the mapping. This must happen before step 3.
  3. Restore UUIDs: call job.set_uuid(old_uuid) on matched jobs. Validate that stored outputs are actually loadable before marking a job as completed.

Then in the main loop, skip any job whose uuid is in completed_uuids and whose parents are all completed:

if job.uuid in completed_uuids and set(parents).issubset(completed_uuids):
    response = _load_completed_output(job, store)
    continue  # skip re-execution

Full implementation: https://gitee.com/wubo-movers/ht-vasp/blob/master/htvasp/utils/local.py
(I use Gitee because my HPC cluster cannot access GitHub.)

Real-world validation

Tested across three VASP workflows on a Slurm cluster, all with a resume=True default:

  • NSCF (DoubleRelax→Static→NSCF-DOS+NSCF-Band) — resumes from cluster failures (OOM, walltime) without re-running relaxations.
  • QHA (DoubleRelax(init)→DoubleRelax(eos)→EOS Deformations×N→Phonon→analyze_free_energy) — phantom frequencies at the final step are a common failure; resume preserves all prior compute.
  • Magnetic exchange (custom OJMaker: generate→flip_jobs×N→solve) — validates resume with user-defined dynamic Makers.

Known limitations

  • name + index stability: validated with per-structure JSONStore files (no shared store). May not hold if multiple workflows write to the same MongoDB — I haven't tested that.
  • Diversions (replace/detour/addition): currently rely on jobflow core's iterflow() for expansion, not handled explicitly in the resume logic.
  • Manager-level only: this only helps run_locally users. It complements, rather than replaces, core-level work like PR introducing flow output #850.

Questions

  1. Is a resume: bool parameter on run_locally (default False, backward-compatible) worth adding?
  2. How stable is name + index across different JobStore setups? My per-structure JSON files work — are there collision patterns in shared stores I should be aware of?
  3. If the core-level approach in PR introducing flow output #850 is likely to land soon, I'd rather contribute tests or docs. If not, I'm happy to submit a PR for a manager-level resume.

Thanks for building jobflow — it's been essential for my research.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions