Make remote git transport resilient to a session-capped DTN#5
Merged
Conversation
Two robustness fixes for `collaborate`/`bootstrap` against a remote
mirror, prompted by a wedged PlayPen sync on the Savio DTN.
Symptoms: the fetch reported `couldn't find remote ref main` and the
push died with `Session open refused by peer` -> `disabling
multiplexing` -> remote login-shell noise -> `the remote end hung up
unexpectedly` / `No refs in common`.
Root cause chain:
1. The remote mirror existed but was a husk from a prior failed
bootstrap: `git init` ran, no push ever landed, so there was no
`main` ref. Fetch correctly swallowed this; the push then tried
to sync into a half-dead repo.
2. The DTN ControlMaster refused a session for the push, so ssh fell
back to a fresh full dial that re-ran the remote login shell. A
broken shared dotfile (allhands/.bashrc-el8 calling a missing lmod
init on a non-el8 DTN) then killed git-receive-pack.
Changes:
- _remote_git_env: the GIT_SSH_COMMAND was the only SSH path missing
the executor's standard guards. Add BatchMode/ConnectTimeout/
ServerAlive*/LogLevel=ERROR to both the scaffolding and fallback
branches and their inner ProxyCommand, so a refused master session
fails fast and quietly instead of dialing a contaminated fresh
connection. Verbosity is preserved under debug_ssh.
- ensure_remote_clone + _remote_repo_has_content: detect a remote
repo that exists but has no commits and no base branch (a husk) and
rebuild it (rm -rf + git init) rather than push into it. A husk has
no commits, so nothing is lost.
This makes the failure resilient and legible; it does not fix the
underlying broken DTN dotfile or the master-refused-session condition.
Tests: transport hardening (both hops), debug verbosity preserved, and
husk rebuild. Full suite: 457 passed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When a scaffolding node (DTN) is configured, git push/fetch prefer it
for its fat pipes and spare CPU. But transfer nodes commonly cap
concurrent SSH sessions, and when the DTN ControlMaster refuses a
session ("Session open refused by peer") the transport had nowhere to
go but a contaminated fresh dial -> wedged push. The filesystem
scaffolding path already fails over off the DTN (executor
_DTN_FALLBACK_RETURNCODES); git transport did not. This closes that
asymmetry.
- _remote_git_env(use_scaffolding=False): build transport against the
target/login node, skipping the DTN.
- _git_transports(): ordered [DTN, login node] transports, deduped to
one when there is no distinct fallback.
- _is_transport_failure(): classify a git/ssh result as a broken
connection (255/-1, "session open refused", "remote end hung up",
etc.) vs. a real answer like an empty mirror's "couldn't find remote
ref" — the latter would repeat on any node, so it is not retried.
- _sync_remote: try each transport; fail over only on a transport
fault, raise a genuine rejection immediately.
- _pull_from_url: now takes an ordered transports list and fails over
the fetch. This is a correctness fix, not just convenience: a
silently-failed DTN fetch would skip the pull and let the next push
force-overwrite agent commits, so we fail over to the login node
before giving up. _pull_from_local passes a single local transport.
Tests: transport ordering/dedup, push failover + no-failover-on-real-
error, fetch failover + no-failover-on-empty-mirror. Full suite: 463
passed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Diagnostics against Savio's DTN confirmed it caps concurrent SSH sessions to ~1: a 12-way concurrent probe over the DTN ControlMaster refused 11 of 12. That is why the original push was refused — its DTN channel overlapped another DTN op and lost the single session slot. The DTN's fat pipes aren't worth that fragility for git's small delta transfers, so flip _git_transports ordering: the target/login node is now primary, and a configured DTN is kept only as a secondary fallback (preserving a route home if the login master is down, at no cost in the common path). Filesystem scaffolding still uses the DTN directly via run_on_login_node — those ops are sequential and fine under a 1-session cap. The login-node-first push/fetch now avoids the cap entirely in the normal case. No new config: this changes the default ordering only. The failover machinery (transport classification, retry) is unchanged; only the order of the transport list flips. Tests updated for the new ordering, plus a prefers-login-node case. Full suite: 464 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b7ee2b7 to
fe20f3d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes
sucoder collaborate/bootstrapresilient to a flaky remote git mirror, prompted by a wedged PlayPen sync against the Savio DTN.Diagnosis (confirmed against the live cluster):
git initran but no push ever landed, so it had nomainref. Fetch reportedcouldn't find remote ref main; the push then tried to sync into a half-dead repo.dtn.brc.berkeley.educaps concurrent SSH sessions at ~1 (a 12-way probe refused 11 of 12). When the DTN ControlMaster refused the push's session, the transport fell back to a fresh dial that re-ran a broken login shell and wedged withthe remote end hung up.GIT_SSH_COMMANDwas the only SSH path missing the executor's standard hardening, so that fallback was both fragile and noisy.Changes
57cb7f9): add BatchMode / ConnectTimeout / ServerAlive / LogLevel=ERROR to theGIT_SSH_COMMAND(scaffolding and fallback branches plus their inner ProxyCommand). A refused master session now fails fast and quietly instead of dialing a contaminated fresh connection. Verbosity is preserved under--debug-ssh.57cb7f9):ensure_remote_clonedetects a remote repo that exists but has no commits and no base branch, and rebuilds it (rm -rf+git init) rather than pushing into it. A husk has no commits, so nothing is lost.988dc9d): new_git_transports/_is_transport_failureplus retry in_sync_remote/_pull_from_url. A broken connection fails over to the next transport; a real git answer (e.g. an empty mirror'scouldn't find remote ref) does not retry. The pull failover is a correctness fix: a silently-failed fetch could let the next force-push overwrite agent commits.b7ee2b7): flip the default transport ordering to login-node-first, DTN fallback-only, since the DTN's ~1 session cap makes it fragile for git. Filesystem scaffolding still uses the DTN directly.Tests
464 passing. New coverage: transport ordering/dedup, push and fetch failover, no-failover-on-real-error, no-failover-on-empty-mirror, and husk rebuild.
Notes
The DTN's 1-session cap and a missing-Lmod
/etc/bashrcerror are genuine cluster misconfigurations ondtn00.brc.berkeley.edu, reported separately to BRC. This PR makes sucoder route around them.🤖 Generated with Claude Code