Skip to content

fix(elixir): keep ssh retries in orchestrator#54

Open
mstrautmann-oai wants to merge 1 commit intomainfrom
dev/mstrautmann-oai/symphony-single-host-retries
Open

fix(elixir): keep ssh retries in orchestrator#54
mstrautmann-oai wants to merge 1 commit intomainfrom
dev/mstrautmann-oai/symphony-single-host-retries

Conversation

@mstrautmann-oai
Copy link

Context

SSH failover currently happens inside AgentRunner, which can bypass per-host caps and rerun a ticket on a second host after the first host already started setup.

TL;DR

Keep each worker run on one SSH host and let the orchestrator own retries.

Summary

  • Remove AgentRunner's internal cross-host failover loop
  • Keep a worker lifetime pinned to one selected SSH host
  • Add a regression test proving startup failure on one host does not fall through to another

Alternatives

  • Keep internal failover and classify retryable errors, but that duplicates orchestrator scheduling logic
  • Only patch per-host cap enforcement, but invisible cross-host reruns would still risk duplicate side effects

Test Plan

  • make -C elixir all
  • cd elixir && mise exec -- mix test test/symphony_elixir/core_test.exs:1167 test/symphony_elixir/core_test.exs:1237 test/symphony_elixir/core_test.exs:1337 --seed 0

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR moves SSH retry/failover ownership out of AgentRunner and into the orchestrator by ensuring a single agent run stays pinned to one selected SSH host. This prevents a single worker lifetime from silently hopping to another host (which could bypass per-host caps and risk duplicate side effects).

Changes:

  • Remove AgentRunner’s internal multi-host failover loop; it now runs on exactly one worker_host.
  • Introduce selected_worker_host/2 to pick a single host (preferred host if provided, otherwise first configured host / local).
  • Add a regression test asserting a startup failure on one host is surfaced and does not fall through to another host.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
elixir/lib/symphony_elixir/agent_runner.ex Removes cross-host retry logic and pins each run to a single selected worker_host.
elixir/test/symphony_elixir/core_test.exs Adds a regression test that verifies startup failures don’t trigger silent host hopping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants