Skip to content

Retry database failures in the JobRunner#259

Merged
daniel-thom merged 2 commits intomainfrom
fix/add-runner-retries
Apr 8, 2026
Merged

Retry database failures in the JobRunner#259
daniel-thom merged 2 commits intomainfrom
fix/add-runner-retries

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

This fixes issues seen by a user where an overloaded torc-server (running on a login node experiencing Lustre filesystem delays) caused a torc job_runner to exit. This changes the runner to retry those failures for up to 20 minutes.

It also fixes cases where we were still using old CLI syntax, mostly in the docs.

@daniel-thom daniel-thom requested a review from Copilot April 8, 2026 19:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR makes the JobRunner more resilient to transient backend/database failures by expanding retry behavior and avoiding hard panics, and updates various docs/tests to use the unified top-level CLI syntax. It also introduces multi-provider AI Chat support (OpenAI/Ollama/GitHub Models) in torc-dash.

Changes:

  • Add broader transient error detection + retry loop improvements for client API calls, and propagate/handle errors in JobRunner instead of panicking.
  • Update docs/tests/examples to use unified CLI commands (e.g., torc create, torc run, torc submit) and align “slurm generate” terminology.
  • Expand torc-dash AI Chat to support multiple LLM providers (Anthropic/OpenAI/Ollama/GitHub Models) with UI + backend wiring.

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
torc-dash/static/js/app-chat.js Updates chat setup UI behavior and request payload to support new providers/fields.
torc-dash/static/index.html Adds provider/model/base-url fields for additional AI chat providers.
tests/workflows/timeout_auto_recovery_test/workflow.yaml Updates test procedure to new CLI syntax.
tests/workflows/scale_test/README.md Updates workflow creation command to torc create.
tests/workflows/oom_auto_recovery_test/workflow.yaml Updates test procedure to new CLI syntax.
tests/workflows/database_contention_test/workflow.yaml Updates documented commands to new CLI syntax.
tests/workflows/database_contention_test/README.md Updates workflow creation command to torc create.
tests/workflows/README.md Updates scale test instructions to use torc create.
src/mcp_server/tools.rs Switches Slurm workflow creation to slurm generate + create two-step flow.
src/mcp_server/server.rs Updates “exact CLI commands” guidance to new syntax.
src/client/utils.rs Adds retryable error classifier + retries on more transient failures (HTTP 5xx / DB contention).
src/client/job_runner.rs Propagates API errors, adds local kill path, replaces panics with logging + state rollback.
src/client/commands/slurm.rs Aligns comment wording with slurm generate.
src/bin/torc-dash.rs Adds multi-provider LLM configuration, tool filtering/token estimation, updates command wiring.
julia_client/Torc/test/test_workflow.jl Updates Julia tests to call new CLI syntax.
examples/yaml/resource_monitoring_demo.yaml Updates example command to torc create.
examples/README.md Updates example workflow creation commands to torc create.
docs/src/specialized/design/recovery.md Updates references from create-slurm to slurm generate.
docs/src/core/reference/cli.md Removes create-slurm docs and clarifies lifecycle commands are top-level.
docs/src/core/monitoring/dashboard.md Documents new AI Chat providers + CLI args/env vars.
README.md Updates quickstart workflow creation command to torc create.
CLAUDE.md Updates documented CLI usage to new syntax and removes submit-slurm.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This fixes issues seen by a user where an overloaded torc-server
(running on a login node experiencing Lustre filesystem delays) caused a
torc job_runner to exit. This changes the runner to retry those failures
for up to 20 minutes.
@daniel-thom daniel-thom force-pushed the fix/add-runner-retries branch from f9fa37c to 0ec39e4 Compare April 8, 2026 19:59
@daniel-thom daniel-thom merged commit d39456b into main Apr 8, 2026
9 checks passed
@daniel-thom daniel-thom deleted the fix/add-runner-retries branch April 8, 2026 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants