|
| 1 | +--- |
| 2 | +description: How Conductor guarantees durable code execution for distributed workflows — what persists at every step, at-least-once task delivery, saga pattern compensation, failure matrix, task state transitions, retry logic with exponential backoff, and distributed consistency. The open source distributed workflow engine built for reliability. |
| 3 | +--- |
| 4 | + |
| 5 | +# Durable Execution Semantics |
| 6 | + |
| 7 | +Conductor is a durable execution engine for distributed workflows and durable agents. Every workflow execution is persisted at every step, survives infrastructure failures, and guarantees at-least-once task delivery. This durable execution model means your workflows and agents never lose progress. This page defines exactly what that means. |
| 8 | + |
| 9 | +## What persists |
| 10 | + |
| 11 | +When a workflow executes, Conductor persists: |
| 12 | + |
| 13 | +- The **workflow definition snapshot** used for this execution (immutable after start). |
| 14 | +- The **workflow state**: status, input, output, correlation ID, and variables. |
| 15 | +- Every **task execution**: status, input, output, timestamps, retry count, and worker ID. |
| 16 | +- The **task queue state**: which tasks are scheduled, in progress, or completed. |
| 17 | + |
| 18 | +All state is written to the configured persistence store (Redis, PostgreSQL, MySQL, or Cassandra) before the next step proceeds. If the server restarts, execution resumes from the last persisted state. |
| 19 | + |
| 20 | + |
| 21 | +## Task delivery guarantees |
| 22 | + |
| 23 | +Conductor provides **at-least-once delivery** for all tasks: |
| 24 | + |
| 25 | +- When a task is scheduled, it is placed in a persistent task queue. |
| 26 | +- A worker polls for the task and receives it. The task moves to `IN_PROGRESS`. |
| 27 | +- If the worker completes the task, it reports `COMPLETED` and Conductor advances the workflow. |
| 28 | +- If the worker fails or crashes, the task is **redelivered** based on the retry and timeout configuration. |
| 29 | + |
| 30 | +A task is never silently lost. If a worker polls a task but never responds, the response timeout triggers redelivery. |
| 31 | + |
| 32 | + |
| 33 | +## Failure matrix |
| 34 | + |
| 35 | +Here is exactly what happens in each failure scenario: |
| 36 | + |
| 37 | +| Scenario | What Conductor does | Outcome | |
| 38 | +|---|---|---| |
| 39 | +| **Worker crashes after poll, before any work** | Response timeout fires. Task returns to `SCHEDULED`. New worker picks it up. | Task is retried automatically. No data loss. | |
| 40 | +| **Worker crashes after side effect, before completion update** | Response timeout fires. Task is redelivered to another worker. | Task executes again. Workers must be idempotent for side effects, or use the task's `updateTime` to detect redelivery. | |
| 41 | +| **Worker reports FAILED** | Conductor creates a new task execution based on retry configuration (`retryCount`, `retryDelaySeconds`, `retryLogic`). | Retried up to the configured limit. After exhaustion, task moves to `FAILED` and the workflow's failure handling kicks in. | |
| 42 | +| **Worker reports FAILED_WITH_TERMINAL_ERROR** | No retry. Task is terminal. | Workflow fails or executes the configured `failureWorkflow`. | |
| 43 | +| **Server restarts during workflow execution** | On restart, the sweeper service picks up in-progress workflows from persistent storage and re-evaluates them. | Execution resumes from the last persisted state. No manual intervention needed. | |
| 44 | +| **Long wait across deploys** | WAIT and HUMAN tasks remain `IN_PROGRESS` in persistent storage. The timer or signal resolution is durable. | When the duration elapses or signal arrives (even days later, after multiple deploys), the task completes and the workflow advances. | |
| 45 | +| **Signal/webhook arrives for a paused workflow** | The Task Update API or event handler sets the WAIT/HUMAN task to `COMPLETED` with the provided output. | Workflow resumes immediately with the signal payload available as task output. | |
| 46 | +| **Workflow definition updated while executions are running** | Running executions continue using the **snapshot** of the definition taken at start time. New executions use the updated definition. | No running execution is affected by definition changes. Zero-downtime upgrades. | |
| 47 | +| **Workflow version deleted while executions are running** | Running executions are decoupled from the metadata store. They continue using their embedded definition snapshot. | Existing executions complete normally. Only new starts are affected. | |
| 48 | +| **Network partition between worker and server** | Worker's updates don't reach the server. Response timeout fires, task is requeued. | After partition heals, a new worker (or the same one) picks up the task. | |
| 49 | + |
| 50 | + |
| 51 | +## Task state transitions |
| 52 | + |
| 53 | +Every task follows this state machine: |
| 54 | + |
| 55 | +``` |
| 56 | +SCHEDULED ──→ IN_PROGRESS ──→ COMPLETED |
| 57 | + │ │ |
| 58 | + │ ├──→ FAILED ──→ SCHEDULED (retry) |
| 59 | + │ │ |
| 60 | + │ ├──→ FAILED_WITH_TERMINAL_ERROR |
| 61 | + │ │ |
| 62 | + │ └──→ TIMED_OUT ──→ SCHEDULED (retry) |
| 63 | + │ |
| 64 | + └──→ CANCELED (workflow terminated) |
| 65 | +``` |
| 66 | + |
| 67 | +**Terminal states**: `COMPLETED`, `FAILED` (after retries exhausted), `FAILED_WITH_TERMINAL_ERROR`, `CANCELED`, `COMPLETED_WITH_ERRORS` (optional tasks). |
| 68 | + |
| 69 | +Each transition is persisted before any subsequent action is taken. |
| 70 | + |
| 71 | + |
| 72 | +## Timeout and retry configuration |
| 73 | + |
| 74 | +Durability is configurable per task via the [task definition](../devguide/configuration/taskdef.md): |
| 75 | + |
| 76 | +| Parameter | What it controls | |
| 77 | +|---|---| |
| 78 | +| `timeoutSeconds` | Maximum wall-clock time for the task to reach a terminal state. | |
| 79 | +| `responseTimeoutSeconds` | Maximum time to wait for a worker status update before requeuing. | |
| 80 | +| `pollTimeoutSeconds` | Maximum time a scheduled task waits to be polled before timeout. | |
| 81 | +| `retryCount` | Number of retry attempts on failure or timeout. | |
| 82 | +| `retryLogic` | `FIXED`, `EXPONENTIAL_BACKOFF`, or `LINEAR_BACKOFF`. | |
| 83 | +| `retryDelaySeconds` | Base delay between retries. | |
| 84 | +| `timeoutPolicy` | `RETRY`, `TIME_OUT_WF`, or `ALERT_ONLY`. | |
| 85 | + |
| 86 | + |
| 87 | +## Workflow-level durability |
| 88 | + |
| 89 | +Beyond individual tasks, Conductor provides workflow-level durability: |
| 90 | + |
| 91 | +- **Compensation flows**: Configure a `failureWorkflow` that runs automatically when the main workflow fails, with full context (reason, failed task ID, workflow execution data). |
| 92 | +- **Pause and resume**: Any running workflow can be paused via API and resumed later. State is fully preserved. |
| 93 | +- **Restart, rerun, and retry**: See [Replay and recovery](#replay-and-recovery) below for full details on re-executing workflows. |
| 94 | +- **Versioning**: Multiple workflow versions can run concurrently. Running executions are immutable against definition changes. Restarts can optionally use the latest definition. |
| 95 | + |
| 96 | + |
| 97 | +## Replay and recovery |
| 98 | + |
| 99 | +Every workflow execution is fully replayable. Conductor preserves the complete execution graph — inputs, outputs, and state for every task — so you can re-execute workflows at any time. |
| 100 | + |
| 101 | +| Operation | What it does | When to use | |
| 102 | +|-----------|-------------|-------------| |
| 103 | +| **Restart** | Re-executes the entire workflow from the beginning | Definition changed, need a clean run | |
| 104 | +| **Rerun** | Re-executes from a specific task, reusing outputs of prior tasks | Fix a task in the middle without re-running everything | |
| 105 | +| **Retry** | Retries the last failed task and continues from that point | Transient failure, external dependency was down | |
| 106 | + |
| 107 | +All three operations work on workflows in any terminal state (COMPLETED, FAILED, TIMED_OUT, TERMINATED) and are available indefinitely — Conductor preserves the full execution graph. Restart can optionally use the latest workflow definition, so you can fix a bug in the definition and replay immediately. |
| 108 | + |
| 109 | + |
| 110 | +## Distributed consistency |
| 111 | + |
| 112 | +In multi-node deployments, Conductor ensures consistency through: |
| 113 | + |
| 114 | +- **Distributed locking**: Only one `decide` evaluation runs per workflow at a time across the cluster (pluggable: Zookeeper, Redis). |
| 115 | +- **Fencing tokens**: Prevent stale updates from nodes with expired locks. |
| 116 | +- **Persistent queues**: Task queues survive node failures. Configurable sharding strategies (round-robin or local-only) trade off distribution vs. consistency. |
| 117 | + |
| 118 | +See the [deployment guide](../devguide/running/deploy.md#locking) for distributed lock configuration. |
| 119 | + |
| 120 | + |
| 121 | +## What this means for your code |
| 122 | + |
| 123 | +1. **Workers should be idempotent.** Because of at-least-once delivery, a task may execute more than once. Design workers to handle redelivery safely. |
| 124 | +2. **You don't need to build retry logic.** Conductor handles retries, timeouts, and requeuing. Your worker just reports success or failure. |
| 125 | +3. **Long-running processes are safe.** Use WAIT and HUMAN tasks for pauses that span minutes to days. State is durable across deploys. |
| 126 | +4. **Definition changes are safe.** Update workflow definitions without affecting running executions. Roll out new versions gradually with zero downtime. |
0 commit comments