You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to open a discussion about the potential limitations of the store_inputs Job that is added at the end of replace Flows. I think there are two main problems at the moment. I have a partial solution for a subclass of the cases, but I though it would still be worth to open a discussion to keep track of the remaining issues (if the PR is acceptable) or to find alternative solutions.
Problem 1: Multiple replace
If the replace of a dynamic Flow is a single Job, the replacing Job inherits the uuid of the original Job with index increased by 1. This works nicely even in cases where the replace happens multiple times.
Instead, if the Job is replaced by a Flow, the store_inputs is added if the Flow has an output. But if many replacements take place the Flow ends up with a concantenation of many store_inputs. For example, consider the code below:
If this happens, there are a series of negative consequences:
the nice feature of the single job replacement of the last job having the highest index is lost
a bunch of pointless Jobs are executed and their outputs stored (this may be more annoying when using jobflow-remote/fireworks, where every job needs to be submitted)
When the last add Job (job3) runs, to resolve the references jobflow needs to recursively get the outputs of all the store_inputs before getting to the actual value that needs to be fetched. This results in many pointless (although small) queries to the JobStore.
Problem 2: Failed jobs before store_inputs
Consider a case when a simple Flow with two steps replaces a single Job. No recursive replacement in this case. If the last Job of the replaced Flow fails, the store_inputs job is executed normally by all managers, since it has on_missing_references=OnMissing.NONE. However, if there is an additional subsequent job that has on_missing_references=OnMissing.ERROR and depends on the output of the Flow, this has no way of knowing that the real Reference is missing.
Here is an example:
2026-01-21 16:10:15,994 INFO Started executing jobs locally
2026-01-21 16:10:16,253 INFO Starting job - generate (27068f05-5a17-49a5-bdd5-2cca8867e2da)
2026-01-21 16:10:16,309 INFO Finished job - generate (27068f05-5a17-49a5-bdd5-2cca8867e2da)
2026-01-21 16:10:16,310 INFO Starting job - add (246af488-3376-421f-a414-e685eca3f5f6)
2026-01-21 16:10:16,311 INFO Finished job - add (246af488-3376-421f-a414-e685eca3f5f6)
2026-01-21 16:10:16,311 INFO Starting job - fail (11d0d634-6569-4678-a30c-1e7dc18a84b7)
2026-01-21 16:10:16,316 INFO fail failed with exception:
Traceback (most recent call last):
File "./jobflow/src/jobflow/managers/local.py", line 117, in _run_job
response = job.run(store=store)
^^^^^^^^^^^^^^^^^^^^
File ".../jobflow/src/jobflow/core/job.py", line 604, in run
response = function(*self.function_args, **self.function_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../jobflow/store_inputs_failure/store_inputs_failure.py", line 13, in fail
raise RuntimeError("Expected failure!")
RuntimeError: Expected failure!
2026-01-21 16:10:16,317 INFO Starting job - store_inputs (27068f05-5a17-49a5-bdd5-2cca8867e2da, 2)
2026-01-21 16:10:16,317 INFO Finished job - store_inputs (27068f05-5a17-49a5-bdd5-2cca8867e2da, 2)
2026-01-21 16:10:16,317 INFO Starting job - add (db902382-9455-4151-bbad-73f7380b7034)
2026-01-21 16:10:16,320 INFO add failed with exception:
Traceback (most recent call last):
File ".../jobflow/src/jobflow/managers/local.py", line 117, in _run_job
response = job.run(store=store)
^^^^^^^^^^^^^^^^^^^^
File ".../jobflow/src/jobflow/core/job.py", line 593, in run
self.resolve_args(store=store)
File ".../jobflow/src/jobflow/core/job.py", line 703, in resolve_args
resolved_args = find_and_resolve_references(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../jobflow/src/jobflow/core/reference.py", line 473, in find_and_resolve_references
resolved_references = resolve_references(
^^^^^^^^^^^^^^^^^^^
File ".../jobflow/src/jobflow/core/reference.py", line 356, in resolve_references
cache[uuid][index] = store.get_output(
^^^^^^^^^^^^^^^^^
File ".../jobflow/src/jobflow/core/store.py", line 523, in get_output
return find_and_resolve_references(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../jobflow/src/jobflow/core/reference.py", line 452, in find_and_resolve_references
return arg.resolve(
^^^^^^^^^^^^
File ".../jobflow/src/jobflow/core/reference.py", line 166, in resolve
raise ValueError(
ValueError: Could not resolve reference - 11d0d634-6569-4678-a30c-1e7dc18a84b7 not in store or index=None, cache={'11d0d634-6569-4678-a30c-1e7dc18a84b7': {}}
2026-01-21 16:10:16,320 INFO Finished executing jobs locally
And this is the final Flow structure (again from jobflow remote):
The problem is that the last Job fails with an unexpected and unclear error, while it should have not been executed in the first place. Note that this is true both for run_locally and joblfow-remote (I did not test, but expect it will be the same for fireworks).
I can probably find some hackish workaround in jobflow-remote to prevent the final Job to become READY, but I think it would be preferable if this can be solved uniformly for all the managers directly in jobflow. Although I am not sure if this would be possible
Conclusion
I will propose a potential solution for a subset of the cases where store_inputs is added, but that does not address the problem in general. A complete solution may require rethinking more deeply the process of the store_inputs.
Discussing with @davidwaroquiers we considered the option of having a Flow output stored in the DB and this could probably partially solve the problem of the store_inputs. However, I believe it is may not be trivial to implement in a way that is different from what the store_inputs does and also avoiding storing redundant data.
In general I would like to hear about the experience of the other users and check if anyone has some ideas on how to improve this.
I want to open a discussion about the potential limitations of the
store_inputsJob that is added at the end ofreplaceFlows. I think there are two main problems at the moment. I have a partial solution for a subclass of the cases, but I though it would still be worth to open a discussion to keep track of the remaining issues (if the PR is acceptable) or to find alternative solutions.Problem 1: Multiple replace
If the
replaceof a dynamic Flow is a single Job, the replacing Job inherits the uuid of the original Job with index increased by 1. This works nicely even in cases where the replace happens multiple times.Instead, if the Job is replaced by a Flow, the
store_inputsis added if the Flow has an output. But if many replacements take place the Flow ends up with a concantenation of manystore_inputs. For example, consider the code below:This is the output:
And this is the Flow structure (from jobflow remote):
flowchart TD classDef WAITING fill:#aaaaaa classDef READY fill:#DAF7A6 classDef CHECKED_OUT fill:#5E6BFF classDef UPLOADED fill:#5E6BFF classDef SUBMITTED fill:#5E6BFF classDef RUNNING fill:#5E6BFF classDef RUN_FINISHED fill:#5E6BFF classDef DOWNLOADED fill:#5E6BFF classDef REMOTE_ERROR fill:#fC3737 classDef COMPLETED fill:#47bf00 classDef FAILED fill:#fC3737 classDef PAUSED fill:#EAE200 classDef STOPPED fill:#fC3737 classDef USER_STOPPED fill:#fC3737 classDef BATCH_SUBMITTED fill:#5E6BFF classDef BATCH_RUNNING fill:#5E6BFF 849(add) --> 850(check_add) 846(add) --> 849(add) 846(add) --> 847(check_add) 850(check_add) --> 851(store_inputs) 840(add) --> 841(check_add) 844(check_add) --> 845(store_inputs) 848(store_inputs) --> 845(store_inputs) 832(add) --> 840(add) 840(add) --> 843(add) 843(add) --> 846(add) 832(add) --> 833(check_add) 841(check_add) --> 842(store_inputs) 845(store_inputs) --> 842(store_inputs) 843(add) --> 844(check_add) 847(check_add) --> 848(store_inputs) 851(store_inputs) --> 848(store_inputs) 833(check_add) --> 834(add) 842(store_inputs) --> 834(add) 847(check_add) -.-> aba8cfa6-a10b-47e0-8ce8-6d0da72a891f 841(check_add) -.-> ebdb4040-8654-4a43-ad53-69f96189d274 833(check_add) -.-> b34eb9b5-dfe2-4175-81eb-acc65892d048 844(check_add) -.-> eafacaa5-e18e-41da-82f8-7a344ed146d6 832:::COMPLETED 833:::COMPLETED 834:::COMPLETED subgraph b34eb9b5-dfe2-4175-81eb-acc65892d048[ ] 840:::COMPLETED 841:::COMPLETED 842:::COMPLETED subgraph ebdb4040-8654-4a43-ad53-69f96189d274[ ] 843:::COMPLETED 844:::COMPLETED 845:::COMPLETED subgraph eafacaa5-e18e-41da-82f8-7a344ed146d6[ ] 846:::COMPLETED 847:::COMPLETED 848:::COMPLETED subgraph aba8cfa6-a10b-47e0-8ce8-6d0da72a891f[ ] 849:::COMPLETED 850:::COMPLETED 851:::COMPLETED end end end end style b34eb9b5-dfe2-4175-81eb-acc65892d048 fill:#2B65EC,opacity:0.2 style ebdb4040-8654-4a43-ad53-69f96189d274 fill:#2B65EC,opacity:0.2 style eafacaa5-e18e-41da-82f8-7a344ed146d6 fill:#2B65EC,opacity:0.2 style aba8cfa6-a10b-47e0-8ce8-6d0da72a891f fill:#2B65EC,opacity:0.2If this happens, there are a series of negative consequences:
addJob (job3) runs, to resolve the references jobflow needs to recursively get the outputs of all thestore_inputsbefore getting to the actual value that needs to be fetched. This results in many pointless (although small) queries to the JobStore.Problem 2: Failed jobs before
store_inputsConsider a case when a simple Flow with two steps replaces a single Job. No recursive replacement in this case. If the last Job of the replaced Flow fails, the
store_inputsjob is executed normally by all managers, since it hason_missing_references=OnMissing.NONE. However, if there is an additional subsequent job that hason_missing_references=OnMissing.ERRORand depends on the output of the Flow, this has no way of knowing that the real Reference is missing.Here is an example:
This is the output:
And this is the final Flow structure (again from jobflow remote):
flowchart TD classDef WAITING fill:#aaaaaa classDef READY fill:#DAF7A6 classDef CHECKED_OUT fill:#5E6BFF classDef UPLOADED fill:#5E6BFF classDef SUBMITTED fill:#5E6BFF classDef RUNNING fill:#5E6BFF classDef RUN_FINISHED fill:#5E6BFF classDef DOWNLOADED fill:#5E6BFF classDef REMOTE_ERROR fill:#fC3737 classDef COMPLETED fill:#47bf00 classDef FAILED fill:#fC3737 classDef PAUSED fill:#EAE200 classDef STOPPED fill:#fC3737 classDef USER_STOPPED fill:#fC3737 classDef BATCH_SUBMITTED fill:#5E6BFF classDef BATCH_RUNNING fill:#5E6BFF 837(add) --> 838(fail) 835(generate) --> 836(add) 839(store_inputs) --> 836(add) 838(fail) --> 839(store_inputs) 835(generate) -.-> 0eca4ab8-4193-481e-9616-925436e8aacc subgraph 0eca4ab8-4193-481e-9616-925436e8aacc[ ] 837:::COMPLETED 838:::FAILED 839:::COMPLETED end 835:::COMPLETED 836:::REMOTE_ERROR style 0eca4ab8-4193-481e-9616-925436e8aacc fill:#2B65EC,opacity:0.2The problem is that the last Job fails with an unexpected and unclear error, while it should have not been executed in the first place. Note that this is true both for
run_locallyand joblfow-remote (I did not test, but expect it will be the same for fireworks).I can probably find some hackish workaround in jobflow-remote to prevent the final Job to become READY, but I think it would be preferable if this can be solved uniformly for all the managers directly in jobflow. Although I am not sure if this would be possible
Conclusion
I will propose a potential solution for a subset of the cases where
store_inputsis added, but that does not address the problem in general. A complete solution may require rethinking more deeply the process of thestore_inputs.Discussing with @davidwaroquiers we considered the option of having a Flow output stored in the DB and this could probably partially solve the problem of the
store_inputs. However, I believe it is may not be trivial to implement in a way that is different from what thestore_inputsdoes and also avoiding storing redundant data.In general I would like to hear about the experience of the other users and check if anyone has some ideas on how to improve this.