Skip to content

Make remote git transport resilient to a session-capped DTN#5

Merged
ligon merged 3 commits into
mainfrom
feat/remote-sync-resilience
Jun 30, 2026
Merged

Make remote git transport resilient to a session-capped DTN#5
ligon merged 3 commits into
mainfrom
feat/remote-sync-resilience

Conversation

@ligon

@ligon ligon commented Jun 30, 2026

Copy link
Copy Markdown
Owner

Summary

Makes sucoder collaborate/bootstrap resilient to a flaky remote git mirror, prompted by a wedged PlayPen sync against the Savio DTN.

Diagnosis (confirmed against the live cluster):

  • The remote mirror was a husk from a prior failed bootstrap: git init ran but no push ever landed, so it had no main ref. Fetch reported couldn't find remote ref main; the push then tried to sync into a half-dead repo.
  • dtn.brc.berkeley.edu caps concurrent SSH sessions at ~1 (a 12-way probe refused 11 of 12). When the DTN ControlMaster refused the push's session, the transport fell back to a fresh dial that re-ran a broken login shell and wedged with the remote end hung up.
  • The GIT_SSH_COMMAND was the only SSH path missing the executor's standard hardening, so that fallback was both fragile and noisy.

Changes

  1. Harden git-over-SSH transport (57cb7f9): add BatchMode / ConnectTimeout / ServerAlive / LogLevel=ERROR to the GIT_SSH_COMMAND (scaffolding and fallback branches plus their inner ProxyCommand). A refused master session now fails fast and quietly instead of dialing a contaminated fresh connection. Verbosity is preserved under --debug-ssh.
  2. Rebuild husk mirrors (57cb7f9): ensure_remote_clone detects a remote repo that exists but has no commits and no base branch, and rebuilds it (rm -rf + git init) rather than pushing into it. A husk has no commits, so nothing is lost.
  3. DTN to login-node failover (988dc9d): new _git_transports / _is_transport_failure plus retry in _sync_remote / _pull_from_url. A broken connection fails over to the next transport; a real git answer (e.g. an empty mirror's couldn't find remote ref) does not retry. The pull failover is a correctness fix: a silently-failed fetch could let the next force-push overwrite agent commits.
  4. Prefer login node over DTN for git (b7ee2b7): flip the default transport ordering to login-node-first, DTN fallback-only, since the DTN's ~1 session cap makes it fragile for git. Filesystem scaffolding still uses the DTN directly.

Tests

464 passing. New coverage: transport ordering/dedup, push and fetch failover, no-failover-on-real-error, no-failover-on-empty-mirror, and husk rebuild.

Notes

The DTN's 1-session cap and a missing-Lmod /etc/bashrc error are genuine cluster misconfigurations on dtn00.brc.berkeley.edu, reported separately to BRC. This PR makes sucoder route around them.

🤖 Generated with Claude Code

Ethan Ligon & Sue Coder and others added 3 commits June 29, 2026 22:50
Two robustness fixes for `collaborate`/`bootstrap` against a remote
mirror, prompted by a wedged PlayPen sync on the Savio DTN.

Symptoms: the fetch reported `couldn't find remote ref main` and the
push died with `Session open refused by peer` -> `disabling
multiplexing` -> remote login-shell noise -> `the remote end hung up
unexpectedly` / `No refs in common`.

Root cause chain:
  1. The remote mirror existed but was a husk from a prior failed
     bootstrap: `git init` ran, no push ever landed, so there was no
     `main` ref. Fetch correctly swallowed this; the push then tried
     to sync into a half-dead repo.
  2. The DTN ControlMaster refused a session for the push, so ssh fell
     back to a fresh full dial that re-ran the remote login shell. A
     broken shared dotfile (allhands/.bashrc-el8 calling a missing lmod
     init on a non-el8 DTN) then killed git-receive-pack.

Changes:
  - _remote_git_env: the GIT_SSH_COMMAND was the only SSH path missing
    the executor's standard guards. Add BatchMode/ConnectTimeout/
    ServerAlive*/LogLevel=ERROR to both the scaffolding and fallback
    branches and their inner ProxyCommand, so a refused master session
    fails fast and quietly instead of dialing a contaminated fresh
    connection. Verbosity is preserved under debug_ssh.
  - ensure_remote_clone + _remote_repo_has_content: detect a remote
    repo that exists but has no commits and no base branch (a husk) and
    rebuild it (rm -rf + git init) rather than push into it. A husk has
    no commits, so nothing is lost.

This makes the failure resilient and legible; it does not fix the
underlying broken DTN dotfile or the master-refused-session condition.

Tests: transport hardening (both hops), debug verbosity preserved, and
husk rebuild. Full suite: 457 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a scaffolding node (DTN) is configured, git push/fetch prefer it
for its fat pipes and spare CPU. But transfer nodes commonly cap
concurrent SSH sessions, and when the DTN ControlMaster refuses a
session ("Session open refused by peer") the transport had nowhere to
go but a contaminated fresh dial -> wedged push. The filesystem
scaffolding path already fails over off the DTN (executor
_DTN_FALLBACK_RETURNCODES); git transport did not. This closes that
asymmetry.

  - _remote_git_env(use_scaffolding=False): build transport against the
    target/login node, skipping the DTN.
  - _git_transports(): ordered [DTN, login node] transports, deduped to
    one when there is no distinct fallback.
  - _is_transport_failure(): classify a git/ssh result as a broken
    connection (255/-1, "session open refused", "remote end hung up",
    etc.) vs. a real answer like an empty mirror's "couldn't find remote
    ref" — the latter would repeat on any node, so it is not retried.
  - _sync_remote: try each transport; fail over only on a transport
    fault, raise a genuine rejection immediately.
  - _pull_from_url: now takes an ordered transports list and fails over
    the fetch. This is a correctness fix, not just convenience: a
    silently-failed DTN fetch would skip the pull and let the next push
    force-overwrite agent commits, so we fail over to the login node
    before giving up. _pull_from_local passes a single local transport.

Tests: transport ordering/dedup, push failover + no-failover-on-real-
error, fetch failover + no-failover-on-empty-mirror. Full suite: 463
passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Diagnostics against Savio's DTN confirmed it caps concurrent SSH
sessions to ~1: a 12-way concurrent probe over the DTN ControlMaster
refused 11 of 12. That is why the original push was refused — its DTN
channel overlapped another DTN op and lost the single session slot.

The DTN's fat pipes aren't worth that fragility for git's small delta
transfers, so flip _git_transports ordering: the target/login node is
now primary, and a configured DTN is kept only as a secondary fallback
(preserving a route home if the login master is down, at no cost in the
common path). Filesystem scaffolding still uses the DTN directly via
run_on_login_node — those ops are sequential and fine under a 1-session
cap. The login-node-first push/fetch now avoids the cap entirely in the
normal case.

No new config: this changes the default ordering only. The failover
machinery (transport classification, retry) is unchanged; only the
order of the transport list flips.

Tests updated for the new ordering, plus a prefers-login-node case.
Full suite: 464 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ligon ligon force-pushed the feat/remote-sync-resilience branch from b7ee2b7 to fe20f3d Compare June 30, 2026 05:51
@ligon ligon merged commit fe20f3d into main Jun 30, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant