Summary
Add a bounded retry mechanism for transient (cloud-style) failures while keeping IdLE fail-fast. Retries must be restricted to idempotent or explicitly safe-to-retry steps, use configurable max attempts and backoff, and record attempt counts and final outcome in structured step results.
Scope
Introduce a safe retry mechanism for transient failures common in cloud operations, while keeping the engine fail-fast.
Rules
- Fail-fast stays the default: if a step still fails after retries, the run stops.
- Retry is allowed only when safe:
- Only for steps that are idempotent (e.g., Ensure*) or explicitly marked SafeToRetry by the step/provider.
- Retry only transient errors:
- Prefer an explicit transient flag (e.g., error metadata) from provider/step.
- If not available, use a minimal default heuristic (timeouts, rate limits, 5xx).
- Bounded attempts + backoff:
- Configurable MaxAttempts (default 3–5).
- Configurable delay/backoff between attempts (start simple; jitter optional later).
- Structured reporting:
- Step result records Attempts, final Status, and the last error.
Out of Scope (Deferred)
Acceptance Criteria
- Steps can opt into retries via idempotency/safe-to-retry metadata.
- Transient failures trigger retry up to max attempts; non-transient failures do not retry.
- Default behavior remains fail-fast after retries are exhausted.
- Results include attempt count and clear final status/error.
- Tests cover: transient retry success, transient retry exhausted, non-transient no retry.
Summary
Add a bounded retry mechanism for transient (cloud-style) failures while keeping IdLE fail-fast. Retries must be restricted to idempotent or explicitly safe-to-retry steps, use configurable max attempts and backoff, and record attempt counts and final outcome in structured step results.
Scope
Introduce a safe retry mechanism for transient failures common in cloud operations, while keeping the engine fail-fast.
Rules
Out of Scope (Deferred)
Acceptance Criteria