Skip to content

Execution – safe retries for transient failures (fail-fast)#51

Merged
blindzero merged 5 commits into
mainfrom
issues/11-Execution-safe-retries-for-transient-failures-fail-fast
Jan 3, 2026
Merged

Execution – safe retries for transient failures (fail-fast)#51
blindzero merged 5 commits into
mainfrom
issues/11-Execution-safe-retries-for-transient-failures-fail-fast

Conversation

@blindzero
Copy link
Copy Markdown
Owner

@blindzero blindzero commented Jan 3, 2026

Summary

This PR introduces safe, deterministic retries for transient execution failures in the IdLE execution engine.

Steps are now retried only when a failure is explicitly marked as transient.
All other errors remain fail-fast, preserving predictable execution and preventing unsafe retry loops.

Motivation

During real-world lifecycle executions (e.g. JML flows), transient failures can occur:

  • temporary directory sync delays
  • short-lived connectivity issues
  • eventually consistent backend systems

At the same time, blind or heuristic retries are dangerous:

  • they can hide logic or configuration errors
  • they may cause unintended side effects
  • they blur trust boundaries between workflow data and engine behavior

This PR strikes a balance:

  • retries are possible
  • but only when explicitly requested by trusted code paths

Design decisions

  • Safe by default
    • Retries happen only if an exception is explicitly marked as transient (Exception.Data['Idle.IsTransient'] = $true)
    • Non-transient errors fail immediately
  • Engine-owned behavior
    • Retry logic lives in the execution engine, not in workflows or plans
    • Workflow definitions remain pure, untrusted data
  • Deterministic execution
    • Exponential backoff with deterministic jitter
    • No randomness that would break reproducibility or testing
  • No new public configuration (yet)
    • Retry settings are currently hardcoded defaults
    • Host-level configurability is intentionally deferred (tracked separately)

Implementation overview

  • Added a private retry helper to the core engine
  • Step execution is wrapped with retry handling
  • Retry attempts emit a structured StepRetrying event
  • Step results include the final execution outcome without changing the public result contract
  • Fail-fast semantics preserved for all non-transient errors

Tests

New Pester tests cover:

  • transient failure → retry → success
  • non-transient failure → no retry → immediate failure
  • retry backoff logic without slowing down the test suite
  • no global scope pollution (ephemeral test module used)

Non-goals / Out of scope

  • Making retry behavior configurable via workflow or plan definitions
  • Provider-specific or heuristic transient error detection
  • Per-step or per-provider retry overrides
  • These are intentionally deferred to keep this change minimal and safe.

Follow-up

A follow-up issue is already tracked to make retry policies host-configurable (execution options), without exposing them to workflows.

Checklist

  • All tests passing
  • Examples verified
  • No breaking API changes
  • No workflow schema changes
  • Deterministic and safe-by-default behavior

Issue

Closes #11

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@blindzero blindzero merged commit 1f005ef into main Jan 3, 2026
4 checks passed
@blindzero blindzero deleted the issues/11-Execution-safe-retries-for-transient-failures-fail-fast branch January 4, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Execution: safe retries for transient failures (fail-fast)

1 participant