Retry use existing new attempt by vijayvammi · Pull Request #243 · AstraZeneca/runnable

vijayvammi · 2025-12-23T20:23:05Z

No description provided.

Design for resume-from-failure functionality that works across local and Argo executors, leveraging native capabilities like memoization.

- Add abstract _calculate_attempt_number method to BaseExecutor - Remove deprecated step_attempt_number property with TODO - Extract attempt calculation into dedicated method in GenericPipelineExecutor - Update _execute_node to use new calculation logic - Implement concrete method in BaseJobExecutor (reads from env var) - Keep abstract method in BasePipelineExecutor for custom implementations - Add comprehensive tests for attempt number calculation scenarios - Delete complex retry design document in favor of simpler approach - Verify all changes work with example pipelines and full test suite This refactoring improves code organization, testability, and maintains backward compatibility while removing technical debt. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

- Add _get_parameters_for_retry method for original parameters - Ignore new parameter files and environment variables during retry - Show informative console messages about parameter usage - Support both retry and normal execution modes - Add comprehensive test coverage for parameter loading scenarios 🤖 Generated with Claude Code

- Add _should_skip_step_in_retry method to check step execution status - Modify execute_from_graph to skip successful steps during retry - Skip steps if they were previously successful in original run - Execute steps that failed or were never executed - Support both retry and normal execution modes - Add comprehensive test coverage for skip logic and integration 🤖 Generated with Claude Code

- Modify _set_up_run_log to handle retry runs properly - Call retry validation and return early for retry runs - Reuse existing run log instead of creating new one for retry - Maintain normal run log creation logic unchanged - Prevent RunLogExistsError for legitimate retry scenarios - Add comprehensive integration tests for retry pipeline setup 🤖 Generated with Claude Code

- Fix critical bug where attempt numbers were always 1 during retry - During retry, reuse existing step logs to preserve original attempts - For failed steps: get existing step log with previous attempts - For never executed steps: create new step log as normal - This allows _calculate_attempt_number to correctly increment attempts - Add comprehensive tests for step log reuse behavior during retry Root cause: execute_from_graph was always calling create_step_log which created brand new step logs, wiping out the original attempts from the failed run, causing _calculate_attempt_number to always return 1. 🤖 Generated with Claude Code

Investigation findings: - Runtime error: nil pointer dereference in Argo workflows - Root cause: run_id parameter changed from {{workflow.uid}} to "PLEASE_SET_RUN_ID" - Memoization system uses {{workflow.parameters.run_id}} as key for all templates - Placeholder string causes nil pointer dereference in Argo runtime - Design document specifies run_id should default to workflow.uid when not provided - Memoization is required for retry functionality, cannot be removed Next: Fix run_id parameter to use {{workflow.uid}} instead of placeholder 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

The run_id parameter was using placeholder "PLEASE_SET_RUN_ID" which caused runtime nil pointer dereference when Argo tried to resolve memoization keys {{workflow.parameters.run_id}}. Changes: - Restore run_id default value from "PLEASE_SET_RUN_ID" to "{{workflow.uid}}" as specified in design document - This provides a valid workflow identifier for memoization keys - Maintains retry functionality while preventing runtime panics - All existing tests continue to pass Verified: - argo lint passes - All retry tests pass (14/14) - Generated workflow validates correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Generated and validated YAML from multiple example pipelines: - examples/01-tasks/python_tasks.py - examples/02-sequential/traversal.py - examples/06-parallel/parallel.py All generated YAMLs include ConfigMap cache configuration in memoize blocks and pass Argo workflow linting validation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Updated the retry design documentation to include the ConfigMap cache configuration in the memoize block. This reflects the implementation of persistent caching across workflow resubmissions using Argo's ConfigMap cache feature. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Performed comprehensive integration testing across multiple pipeline types: - Tested python_tasks, passing_parameters_python, and catalog_python examples - Verified all generate valid Argo YAML with cache configuration - Confirmed cache name consistency across templates (runnable-xxxxxx pattern) - Validated backward compatibility (old-style memoize without cache still works) - All tests passed with argo lint validation Results documented in integration-test-results.txt Task 7 complete: ConfigMap memoization implementation is production-ready

Previously, RUNNABLE_RETRY_RUN_ID was incorrectly set to {{workflow.parameters.run_id}} (current run ID) instead of {{workflow.parameters.retry_run_id}} (original run ID for retries). This caused is_retry to always return True, making the system look for run logs even on first runs where none exist. Changes: - Fix _add_retry_env_vars to use {{workflow.parameters.retry_run_id}} - Now retry_run_id is empty string for normal runs (is_retry = False) - And contains original run ID for actual retries (is_retry = True) - All 23 Argo tests continue to pass Resolves issue where first-time pipeline runs were incorrectly flagged as retries and attempted to load non-existent run logs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

- Add configmap_cache_name parameter to ArgoExecutor config - Allow users to specify custom ConfigMap names for memoization - Default to random generation (runnable-xxxxxx) when not specified - Update documentation with new configuration option - Backward compatible with existing configurations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

- Move secrets import from method level to module level - Follow Python best practices for import organization 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

- Remove hardcoded run_id parameter from workflow arguments - Users must provide run_id at workflow submission time - Enables cache reuse: same run_id = cache hit, new run_id = fresh execution - Resolves 'invalid cache key: {{workflow.uid}}' error - Gives users full control over memoization behavior Usage: # Cache reuse: argo submit argo-pipeline.yaml -p run_id=stable-run # Fresh run: argo submit argo-pipeline.yaml -p run_id=new-run-123 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

- Add run_id parameter declaration without default value - Makes run_id required at workflow submission time - Workflow now declares all expected parameters correctly - Argo lint validation passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

- Add key field to ConfigMapCache model with default 'cache' value - Resolves issue where memoization worked within workflows but didn't persist - ConfigMap entries now properly stored and retrieved across workflow runs - Fixes root cause: empty configMap.key prevented cache persistence Issue analysis showed: - Memoization was working (Hit: true/false status confirmed) - ConfigMap key field was empty, breaking persistent storage - Cache worked within single workflow but not across runs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Add complete retry documentation with cross-environment debugging: - Production failure → local debugging workflow - Technical deep-dive on surgical retry mechanisms - Working examples from examples/09-retry/ directory - Clear distinction between failure handling and retry Also fix broken links in visualization.md and clarify failure handling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

CronWorkflow support: - Add CronSchedule model with schedules and timezone fields - Add cron_schedule config option to ArgoExecutor - Generate CronWorkflow instead of Workflow when schedule is configured - Add documentation in argo.md and example config Retry CLI command: - Add `runnable retry <run_id>` CLI command - Add retry_pipeline() entrypoint that loads run log and re-executes - Document CLI retry in retry-recovery.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Python tasks are plain functions - any IDE debugger works without special configuration. Added callout box explaining this advantage over frameworks that require special debugging setup. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Added examples/07-map/dynamic_map.py demonstrating how to generate the list of items to iterate over at runtime from a previous step's return value, rather than from a static parameters file. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vijayvammi and others added 30 commits December 14, 2025 22:17

docs: Add retry capability design specification

1d00aa0

Design for resume-from-failure functionality that works across local and Argo executors, leveraging native capabilities like memoization.

feat: add retry validation framework for pipeline executors

7cdc059

feat: retry functionality for composite nodes

98edcec

feat: add ConfigMap cache support to Memoize schema

6250cda

feat: add ConfigMap cache name generation to ArgoExecutor

95cd204

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

test: verify YAML structure matches Argo memoization spec

4d657a2

fix: move imports to global scope in ArgoExecutor

c12b704

- Move secrets import from method level to module level - Follow Python best practices for import organization 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

fix: still fixing

d5e2abf

feat: cron argo workflows

b6fde60

feat: cron argo workflows

9ef8201

feat: cron argo workflows

d104f71

vijayvammi merged commit 756d75c into main Dec 23, 2025
1 check passed

vijayvammi deleted the retry-use-existing-new-attempt branch December 23, 2025 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry use existing new attempt#243

Retry use existing new attempt#243
vijayvammi merged 30 commits intomainfrom
retry-use-existing-new-attempt

vijayvammi commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vijayvammi commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant