You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a workflow containing dynamically generated jobs (e.g. from a Maker) fails mid-execution, rerunning it breaks downstream OutputReference objects. Maker.make() generates new UUIDs on each call, while stored jobs from the previous run have different UUIDs.
Minimal example with atomate2's DoubleRelaxMaker:
maker=DoubleRelaxMaker(...)
# First run: Relax2 failsflow=Flow([maker.make(structure), downstream_static, downstream_nscf])
run_locally(flow) # Relax1 ✓ → Relax2 ✗# Rerun: new Maker call → new UUIDs → downstream references staleflow=Flow([maker.make(structure), downstream_static, downstream_nscf])
run_locally(flow) # ValueError: Could not resolve reference
Root cause: jobflow assigns UUIDs at Job creation time (inside Maker.make()), not at execution time. When the user reconstructs the same logical workflow, every Job gets a fresh UUID while downstream OutputReference objects still point to the old ones.
Related discussions
I'm aware of existing threads on this or adjacent problems:
Use job.name + job.index as a stable identifier to match jobs across reruns, then restore original UUIDs before execution. All logic is contained in a custom run_locally wrapper — no changes to jobflow core.
Three steps before starting execution:
Match by name + index: query store.query_one({"name": job.name, "index": job.index}) to find jobs from a previous run, building a {new_uuid: old_uuid} mapping.
Update OutputReferences: recursively walk all jobs' function_args and function_kwargs, calling obj.set_uuid(old_uuid, inplace=True) on any OutputReference whose new uuid is in the mapping. This must happen before step 3.
Restore UUIDs: call job.set_uuid(old_uuid) on matched jobs. Validate that stored outputs are actually loadable before marking a job as completed.
Then in the main loop, skip any job whose uuid is in completed_uuidsand whose parents are all completed:
Tested across three VASP workflows on a Slurm cluster, all with a resume=True default:
NSCF (DoubleRelax→Static→NSCF-DOS+NSCF-Band) — resumes from cluster failures (OOM, walltime) without re-running relaxations.
QHA (DoubleRelax(init)→DoubleRelax(eos)→EOS Deformations×N→Phonon→analyze_free_energy) — phantom frequencies at the final step are a common failure; resume preserves all prior compute.
Magnetic exchange (custom OJMaker: generate→flip_jobs×N→solve) — validates resume with user-defined dynamic Makers.
Known limitations
name + index stability: validated with per-structure JSONStore files (no shared store). May not hold if multiple workflows write to the same MongoDB — I haven't tested that.
Diversions (replace/detour/addition): currently rely on jobflow core's iterflow() for expansion, not handled explicitly in the resume logic.
Manager-level only: this only helps run_locally users. It complements, rather than replaces, core-level work like PR introducing flow output #850.
Questions
Is a resume: bool parameter on run_locally (default False, backward-compatible) worth adding?
How stable is name + index across different JobStore setups? My per-structure JSON files work — are there collision patterns in shared stores I should be aware of?
If the core-level approach in PR introducing flow output #850 is likely to land soon, I'd rather contribute tests or docs. If not, I'm happy to submit a PR for a manager-level resume.
Thanks for building jobflow — it's been essential for my research.
Feature request / experience sharing:
run_locallyresume via name-based UUID matchingProblem
When a workflow containing dynamically generated jobs (e.g. from a
Maker) fails mid-execution, rerunning it breaks downstreamOutputReferenceobjects.Maker.make()generates new UUIDs on each call, while stored jobs from the previous run have different UUIDs.Minimal example with
atomate2'sDoubleRelaxMaker:Root cause: jobflow assigns UUIDs at Job creation time (inside
Maker.make()), not at execution time. When the user reconstructs the same logical workflow, every Job gets a fresh UUID while downstreamOutputReferenceobjects still point to the old ones.Related discussions
I'm aware of existing threads on this or adjacent problems:
store_inputslimitations in dynamic flowsProposed approach
Use
job.name+job.indexas a stable identifier to match jobs across reruns, then restore original UUIDs before execution. All logic is contained in a customrun_locallywrapper — no changes to jobflow core.Three steps before starting execution:
store.query_one({"name": job.name, "index": job.index})to find jobs from a previous run, building a{new_uuid: old_uuid}mapping.function_argsandfunction_kwargs, callingobj.set_uuid(old_uuid, inplace=True)on anyOutputReferencewhose new uuid is in the mapping. This must happen before step 3.job.set_uuid(old_uuid)on matched jobs. Validate that stored outputs are actually loadable before marking a job as completed.Then in the main loop, skip any job whose uuid is in
completed_uuidsand whose parents are all completed:Full implementation: https://gitee.com/wubo-movers/ht-vasp/blob/master/htvasp/utils/local.py
(I use Gitee because my HPC cluster cannot access GitHub.)
Real-world validation
Tested across three VASP workflows on a Slurm cluster, all with a
resume=Truedefault:DoubleRelax→Static→NSCF-DOS+NSCF-Band) — resumes from cluster failures (OOM, walltime) without re-running relaxations.DoubleRelax(init)→DoubleRelax(eos)→EOS Deformations×N→Phonon→analyze_free_energy) — phantom frequencies at the final step are a common failure; resume preserves all prior compute.OJMaker:generate→flip_jobs×N→solve) — validates resume with user-defined dynamicMakers.Known limitations
name + indexstability: validated with per-structureJSONStorefiles (no shared store). May not hold if multiple workflows write to the same MongoDB — I haven't tested that.replace/detour/addition): currently rely on jobflow core'siterflow()for expansion, not handled explicitly in the resume logic.run_locallyusers. It complements, rather than replaces, core-level work like PR introducing flow output #850.Questions
resume: boolparameter onrun_locally(defaultFalse, backward-compatible) worth adding?name + indexacross different JobStore setups? My per-structure JSON files work — are there collision patterns in shared stores I should be aware of?resume.Thanks for building jobflow — it's been essential for my research.