Skip to content

[ddev] Wait for Agent cmd-server before returning from start()#23646

Draft
Kyle-Neale wants to merge 2 commits intomasterfrom
kyle.neale/ddev-agent-readiness-gate
Draft

[ddev] Wait for Agent cmd-server before returning from start()#23646
Kyle-Neale wants to merge 2 commits intomasterfrom
kyle.neale/ddev-agent-readiness-gate

Conversation

@Kyle-Neale
Copy link
Copy Markdown
Contributor

Summary

  • Adds a wait_until_ready gate on AgentInterface (no-op default) and a DockerAgent implementation that polls agent status --json via stamina.retry_context.
  • Called at the end of DockerAgent.start() so callers no longer race the in-container Agent's check-loader initialization.

Why

SNMP master.yml has been flaking with agent check snmp returning "no valid check found" (exit 255). The first occurrence on 2026-04-06 ~14:00 UTC matches agent-data-plane PR #1177 (initialize TLS early, before supervisor) merging at 13:50 UTC — ADP now fail-fasts on TLS init and races the core Agent's HTTPS listener, which intermittently leaves the check loader uninitialized when the test invokes agent check. The fix lives in ddev because the race is local to E2E orchestration: the test was already racing the Agent — the Saluki change just made the window long enough to hit.

Test plan

  • ddev --no-interactive test ddev (946 passed, 17 skipped)
  • ddev test -fs ddev (clean)
  • New TestWaitUntilReady covers happy path and timeout
  • Existing TestStart cases unaffected (autouse fixture no-ops the probe so exact subprocess.run call-list assertions still hold)
  • Watch SNMP on next master run to confirm the flake clears

🤖 Generated with Claude Code

Without this gate, callers race the Agent's startup: `docker run` returns
before the in-container Agent finishes initializing the check loader, so
an immediate `agent check <name>` can return "no valid check found" and
exit 255. This has been showing up as the SNMP master.yml flake since
agent-data-plane started failing fast on TLS init.

Adds a no-op `wait_until_ready` default on `AgentInterface` and a real
implementation on `DockerAgent` that polls `agent status --json` via
`stamina.retry_context`. Called at the end of `start()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dd-octo-sts dd-octo-sts Bot added the ddev label May 8, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 90.62500% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.85%. Comparing base (772b9c9) to head (acdab00).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-datadog-prod-us1
Copy link
Copy Markdown
Contributor

datadog-datadog-prod-us1 Bot commented May 8, 2026

Tests

Fix all issues with BitsAI or with Cursor

⚠️ Warnings

🧪 1 Test failed

❄️ Known flaky: test_e2e_profile_cisco_asr_1001x from test_e2e_core_vs_python.py   View in Datadog   (Fix with Cursor)
[s6-init] making user provided files available at /var/run/s6/etc...exited 0.
[s6-init] ensuring user provided files have correct perms...exited 0.
[fix-attrs.d] applying ownership & permissions fixes...
[fix-attrs.d] done.
[cont-init.d] executing container initialization scripts...
[cont-init.d] 01-check-apikey.sh: executing... 
[cont-init.d] 01-check-apikey.sh: exited 0.
[cont-init.d] 50-ci.sh: executing... 
[cont-init.d] 50-ci.sh: exited 0.
[cont-init.d] 50-ecs-managed.sh: executing... 
...

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 90.62%
Overall Coverage: 87.36% (+0.12%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: acdab00 | Docs | Datadog PR Page | Give us feedback!

The `--config-file` path of `ddev env agent` previously renamed the original
config away before writing the override, leaving a window in which the
mounted conf.d directory had no config for the integration. Agent
autodiscovery rescans on file events; if it scanned during that window it
deregistered the check, and the immediately-following `agent check <name>`
returned "no valid check found".

This is the actual SNMP master.yml flake fingerprint: agent runs cleanly for
10+ minutes (20-30 successful check cycles), then a single test using
`dd_agent_check` (which goes through this code path) hits the race and
fails. Two of the last three master.yml SNMP failures match it exactly.

Switch to read-modify-restore in place. `EnvData.write_config` now writes
via tmp + os.replace so the file is never transiently absent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 8, 2026

Validation Report

All 20 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant