-
Notifications
You must be signed in to change notification settings - Fork 13
Description
(FYI to @leftwo and @jmpesp, whom I chatted with about this scenario)
Suppose we have the following saga DAG:
- Action: Record "new resource" in a database, with state "creating". Undo: Delete "new resource" from database.
- Action: Make a request to an external service to provision the resource. Undo: Make a request to an external service to delete said resource.
- Action: Record "new resource" in the database as "created". (No undo action)
In this example saga graph, we can happily move "forward" and "backwards" through saga states, but can enter an awkward state if we fail saga execution while communicating with the external service.
Suppose we do the following:
- Action for (1) (record the new resource in the DB)
- Action for (2) (we send the request to the external service, but do not yet write the result to the saga log)
- Crash
- Action for (2) (we send the request to the external service, but this time, suppose the external service is not responding)
In this scenario, if we simply perform the "Node 1 undo action", there is a chance we're leaking resources. Concretely, let's suppose the "external service request" would be to provision storage from Crucible, or to provision an instance on a sled. If we simply delete our database record, the resource is consumed, but Nexus is unaware. In this particular case, it may be more correct to record that the provisioned resources exist, but are in a "failed" state.
Proposal: I think we need a way of identifying a "different" error pathway for actions, allowing us to distinguish between "clean errors" and "fatal errors" - akin to how we are recording the results of "successful actions", it seems equally useful to record the results of "unsuccessful actions", so we can know how to treat the database records associated with resources (either deletion or marking the state as dirty).