Skip to content

Execution: safe retries for transient failures (fail-fast) #11

@blindzero

Description

@blindzero

Summary

Add a bounded retry mechanism for transient (cloud-style) failures while keeping IdLE fail-fast. Retries must be restricted to idempotent or explicitly safe-to-retry steps, use configurable max attempts and backoff, and record attempt counts and final outcome in structured step results.

Scope

Introduce a safe retry mechanism for transient failures common in cloud operations, while keeping the engine fail-fast.

Rules

  • Fail-fast stays the default: if a step still fails after retries, the run stops.
  • Retry is allowed only when safe:
    • Only for steps that are idempotent (e.g., Ensure*) or explicitly marked SafeToRetry by the step/provider.
  • Retry only transient errors:
    • Prefer an explicit transient flag (e.g., error metadata) from provider/step.
    • If not available, use a minimal default heuristic (timeouts, rate limits, 5xx).
  • Bounded attempts + backoff:
    • Configurable MaxAttempts (default 3–5).
    • Configurable delay/backoff between attempts (start simple; jitter optional later).
  • Structured reporting:
    • Step result records Attempts, final Status, and the last error.

Out of Scope (Deferred)

Acceptance Criteria

  • Steps can opt into retries via idempotency/safe-to-retry metadata.
  • Transient failures trigger retry up to max attempts; non-transient failures do not retry.
  • Default behavior remains fail-fast after retries are exhausted.
  • Results include attempt count and clear final status/error.
  • Tests cover: transient retry success, transient retry exhausted, non-transient no retry.

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions