Feature request / experience sharing: `run_locally` resume via name-based UUID matching

## Feature request / experience sharing: `run_locally` resume via name-based UUID matching

### Problem

When a workflow containing dynamically generated jobs (e.g. from a `Maker`) fails mid-execution, rerunning it breaks downstream `OutputReference` objects. `Maker.make()` generates **new UUIDs** on each call, while stored jobs from the previous run have different UUIDs.

**Minimal example** with `atomate2`'s `DoubleRelaxMaker`:

```python
maker = DoubleRelaxMaker(...)

# First run: Relax2 fails
flow = Flow([maker.make(structure), downstream_static, downstream_nscf])
run_locally(flow)  # Relax1 ✓ → Relax2 ✗

# Rerun: new Maker call → new UUIDs → downstream references stale
flow = Flow([maker.make(structure), downstream_static, downstream_nscf])
run_locally(flow)  # ValueError: Could not resolve reference
```

Root cause: jobflow assigns UUIDs at Job creation time (inside `Maker.make()`), not at execution time. When the user reconstructs the same logical workflow, every Job gets a fresh UUID while downstream `OutputReference` objects still point to the old ones.

### Related discussions

I'm aware of existing threads on this or adjacent problems:

- [materialsproject/jobflow#842](https://github.com/materialsproject/jobflow/issues/842) — `store_inputs` limitations in dynamic flows
- [materialsproject/jobflow#374](https://github.com/materialsproject/jobflow/issues/374) — insufficient info to recover parent-child relationships
- [materialsproject/jobflow#519](https://github.com/materialsproject/jobflow/issues/519) — design discussion on UUIDs (rerun acknowledged as a pain point)
- [Matgenix/jobflow-remote#139](https://github.com/Matgenix/jobflow-remote/issues/139) — same UUID breakage in distributed context
- [PR #850](https://github.com/materialsproject/jobflow/pull/850) — Flow output at the core level (draft)

### Proposed approach

**Use `job.name` + `job.index` as a stable identifier** to match jobs across reruns, then restore original UUIDs before execution. All logic is contained in a custom `run_locally` wrapper — no changes to jobflow core.

Three steps before starting execution:

1. **Match by name + index**: query `store.query_one({"name": job.name, "index": job.index})` to find jobs from a previous run, building a `{new_uuid: old_uuid}` mapping.
2. **Update OutputReferences**: recursively walk all jobs' `function_args` and `function_kwargs`, calling `obj.set_uuid(old_uuid, inplace=True)` on any `OutputReference` whose new uuid is in the mapping. This must happen *before* step 3.
3. **Restore UUIDs**: call `job.set_uuid(old_uuid)` on matched jobs. Validate that stored outputs are actually loadable before marking a job as completed.

Then in the main loop, skip any job whose uuid is in `completed_uuids` **and** whose parents are all completed:

```python
if job.uuid in completed_uuids and set(parents).issubset(completed_uuids):
    response = _load_completed_output(job, store)
    continue  # skip re-execution
```

Full implementation: https://gitee.com/wubo-movers/ht-vasp/blob/master/htvasp/utils/local.py  
(I use Gitee because my HPC cluster cannot access GitHub.)

### Real-world validation

Tested across three VASP workflows on a Slurm cluster, all with a `resume=True` default:

- **NSCF** (`DoubleRelax→Static→NSCF-DOS+NSCF-Band`) — resumes from cluster failures (OOM, walltime) without re-running relaxations.
- **QHA** (`DoubleRelax(init)→DoubleRelax(eos)→EOS Deformations×N→Phonon→analyze_free_energy`) — phantom frequencies at the final step are a common failure; resume preserves all prior compute.
- **Magnetic exchange** (custom `OJMaker`: `generate→flip_jobs×N→solve`) — validates resume with user-defined dynamic `Maker`s.

### Known limitations

- **`name + index` stability**: validated with per-structure `JSONStore` files (no shared store). May not hold if multiple workflows write to the same MongoDB — I haven't tested that.
- **Diversions** (`replace`/`detour`/`addition`): currently rely on jobflow core's `iterflow()` for expansion, not handled explicitly in the resume logic.
- **Manager-level only**: this only helps `run_locally` users. It complements, rather than replaces, core-level work like PR #850.

### Questions

1. Is a `resume: bool` parameter on `run_locally` (default `False`, backward-compatible) worth adding?
2. How stable is `name + index` across different JobStore setups? My per-structure JSON files work — are there collision patterns in shared stores I should be aware of?
3. If the core-level approach in PR #850 is likely to land soon, I'd rather contribute tests or docs. If not, I'm happy to submit a PR for a manager-level `resume`.

Thanks for building jobflow — it's been essential for my research.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request / experience sharing: `run_locally` resume via name-based UUID matching #873