[ddev] Wait for Agent cmd-server before returning from start()#23646
Draft
Kyle-Neale wants to merge 2 commits intomasterfrom
Draft
[ddev] Wait for Agent cmd-server before returning from start()#23646Kyle-Neale wants to merge 2 commits intomasterfrom
Kyle-Neale wants to merge 2 commits intomasterfrom
Conversation
Without this gate, callers race the Agent's startup: `docker run` returns before the in-container Agent finishes initializing the check loader, so an immediate `agent check <name>` can return "no valid check found" and exit 255. This has been showing up as the SNMP master.yml flake since agent-data-plane started failing fast on TLS init. Adds a no-op `wait_until_ready` default on `AgentInterface` and a real implementation on `DockerAgent` that polls `agent status --json` via `stamina.retry_context`. Called at the end of `start()`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
Contributor
|
✨ Fix all issues with BitsAI or with Cursor
|
The `--config-file` path of `ddev env agent` previously renamed the original config away before writing the override, leaving a window in which the mounted conf.d directory had no config for the integration. Agent autodiscovery rescans on file events; if it scanned during that window it deregistered the check, and the immediately-following `agent check <name>` returned "no valid check found". This is the actual SNMP master.yml flake fingerprint: agent runs cleanly for 10+ minutes (20-30 successful check cycles), then a single test using `dd_agent_check` (which goes through this code path) hits the race and fails. Two of the last three master.yml SNMP failures match it exactly. Switch to read-modify-restore in place. `EnvData.write_config` now writes via tmp + os.replace so the file is never transiently absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Validation ReportAll 20 validations passed. Show details
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
wait_until_readygate onAgentInterface(no-op default) and aDockerAgentimplementation that pollsagent status --jsonviastamina.retry_context.DockerAgent.start()so callers no longer race the in-container Agent's check-loader initialization.Why
SNMP master.yml has been flaking with
agent check snmpreturning "no valid check found" (exit 255). The first occurrence on 2026-04-06 ~14:00 UTC matchesagent-data-planePR #1177 (initialize TLS early, before supervisor) merging at 13:50 UTC — ADP now fail-fasts on TLS init and races the core Agent's HTTPS listener, which intermittently leaves the check loader uninitialized when the test invokesagent check. The fix lives inddevbecause the race is local to E2E orchestration: the test was already racing the Agent — the Saluki change just made the window long enough to hit.Test plan
ddev --no-interactive test ddev(946 passed, 17 skipped)ddev test -fs ddev(clean)TestWaitUntilReadycovers happy path and timeoutTestStartcases unaffected (autouse fixture no-ops the probe so exactsubprocess.runcall-list assertions still hold)🤖 Generated with Claude Code