config-discovery: AgentCheck-level trial runner (PoC B)#23624
Draft
config-discovery: AgentCheck-level trial runner (PoC B)#23624
Conversation
… Inf values The Prometheus exposition format allows NaN, +Inf, and -Inf as metric values (e.g., summary quantiles that haven't been observed yet return NaN). The discovery verifier regex only matched numeric values, so endpoints whose first metric value was NaN would fail the Prometheus check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- n8n: add auto_conf_discovery.yaml (ad_identifiers: n8nio/n8n), update conftest.py to mount discovery helpers, add e2e discovery test - kuma: add auto_conf_discovery.yaml (ad_identifiers: kumahq/kuma-cp) with DISCOVERY_PORT_HINTS=[5680] - pulsar: add auto_conf_discovery.yaml (ad_identifiers: apachepulsar/pulsar, pulsar), add DISCOVERY_PORT_HINTS=[8080], update conftest.py and e2e test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add auto_conf_discovery.yaml with AD identifiers and discovery: {} to
enable the Agent's generic OpenMetrics port scanning for the following
integrations: aerospike, appgate_sdp, argo_rollouts, argo_workflows,
aws_neuron, bentoml, calico, celery, dcgm, falco, fluxcd,
hugging_face_tgi, karpenter, keda, kubernetes_cluster_autoscaler,
kyverno, milvus, nvidia_nim, nvidia_triton, quarkus, ray, temporal,
velero, vllm, weaviate.
Add DISCOVERY_PORT_HINTS to integrations with officially documented
well-known ports: aerospike (9145), aws_neuron (8000), bentoml (3000),
dcgm (9400), kubernetes_cluster_autoscaler (8085), kyverno (8000),
nvidia_nim (8000), nvidia_triton (8002), quarkus (8080/q/metrics),
ray (8080), vllm (8000), weaviate (2112).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update conftest to mount discovery helpers and auto_conf_discovery.yaml. Tag the built quarkus image so the AD identifier matches. Add test_e2e_discovery that verifies the openmetrics.health service check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix ad_identifier (temporalio/server → temporalio/auto-setup), add DISCOVERY_PORT_HINTS=[8000]. Extend create_log_volumes() wrapper to also mount discovery helpers, and add test_e2e_discovery. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extend create_log_volumes() to also mount discovery helpers and auto_conf_discovery.yaml. Add test_e2e_discovery that verifies the openmetrics.health service check passes for at least one discovered instance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch flower service from custom build to vanilla mher/flower image so the container's image name matches the ad_identifier. Use explicit --broker flag with auth credentials. Add discovery constants and 2-tuple yield to conftest, add test_e2e_discovery. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Discovery support requires a passing e2e test to verify the mechanism
works end-to-end (container image matches ad_identifier, port is
reachable, metrics parse correctly). The following integrations could
not satisfy that requirement:
Caddy-based mock servers — e2e environments use caddy:2.7 to serve
fixture files, so the running container's image is "caddy", not the
real ad_identifier. Discovery would silently never fire in the test
environment, making the test meaningless:
appgate_sdp, aws_neuron, dcgm, hugging_face_tgi, karpenter,
kubernetes_cluster_autoscaler, nvidia_nim, nvidia_triton, vllm
No Docker e2e environment — no docker-compose setup exists in the
tests/ directory, so there is no environment in which to run an e2e
discovery test:
argo_rollouts, argo_workflows, bentoml, calico, fluxcd, keda,
kyverno, velero, weaviate
Base e2e fails to start — the existing e2e test environment does not
come up cleanly, which is a prerequisite before adding discovery:
aerospike: "startup was not complete, exiting immediately"
milvus: MilvusException during script-runner setup
(InvalidateCollectionMetaCache node-mismatch error)
network_mode: host — docker-compose uses host networking so the
container exposes no port bindings; candidate_ports() receives an
empty service.ports list and yields nothing, making discovery
structurally impossible:
falco
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The e2e test required patching docker-compose.yaml to add `image: quarkus` alongside `build: micrometer-quickstart`. That tag exists solely to make the container's image name match the ad_identifier — nobody runs an image literally named `quarkus` in production. Quarkus users build their own application images with arbitrary names, so there is no canonical image name to use as an ad_identifier. The ad_identifier for caddy-based integrations is wrong for a similar reason (the running image is caddy, not the real service), but at least those integrations intend to match a real published image when deployed. For quarkus the identifier is purely fictional. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ith e2e tests
For each integration:
- Added DISCOVERY_PORT_HINTS (and DISCOVERY_METRICS_PATH for cockroachdb) to the
check class so the generic port-scanning discover() can find the service.
- Added auto_conf_discovery.yaml with ad_identifiers and discovery: {} to wire up
the Autodiscovery template.
- Updated conftest.py to mount the autoconf YAML and local discovery helpers into
the agent container for e2e testing.
- Added test_e2e_discovery to verify the full AD → discover() → check scheduling
path.
boundary: Added discover() override to also derive health_endpoint from the
discovered openmetrics_endpoint (required field in InstanceConfig).
cockroachdb, kong: Both integrations have a V1 legacy wrapper class (CockroachdbCheck /
Kong) whose __new__ delegates to the real V2 subclass when openmetrics_endpoint is
present. The agent resolves the check class by name and looks up discover() on this
wrapper, which inherits from OpenMetricsBaseCheck (V1) — no discover() there. Fixed
by forwarding discover() on the wrapper to the V2 subclass:
@classmethod
def discover(cls, service):
return CockroachdbCheckV2.discover(service) # / KongCheck.discover(service)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The check monitors Flower (mher/flower), not Celery workers directly. The e2e environment builds a custom image from ./proj that bundles both Celery and Flower with unpinned versions. Switching the flower service to vanilla mher/flower introduces a Celery version mismatch between the workers and Flower. More fundamentally, users who don't run Flower, or who embed it into their own custom image, get no discovery benefit from targeting mher/flower. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The METRIC_MAP keys are unprefixed (e.g. process_cpu_user_seconds) but the n8n Prometheus endpoint exposes metrics with an n8n_ prefix. Without stripping that prefix before matching, nothing in the map matched and only the readiness gauge was ever submitted. Fix by adding raw_metric_prefix: n8n_ to get_default_config(), and correct the spec.yaml display_default/example from n8n to n8n_ to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Design for an experiment validating declarative-probe based advanced auto-configuration on the krakend integration end-to-end against a real Agent build. Targets the generic-openmetrics-scan bucket (51/260 integrations) identified in the vitkykra/autoconfig-analysis branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bite-sized TDD tasks covering the new auto_conf_discovery.yaml file format, the discovery package (prober + cache + service wrapper), the %%discovered_port%% template variable, the configmgr wiring, and three end-to-end demo scenarios against a real Agent build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ment Declares the krakend ad_identifier with an OpenMetrics probe spec (default port 8090, /metrics path). Consumed by the new auto_conf_discovery file format in datadog-agent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dev env's KrakenD config exposes Prometheus metrics on port 9090 (via the OpenTelemetry Prometheus exporter), not 8090. The fallback scan would have found it anyway, but the hint should be correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Successor to the KrakenD experiment, covering the HTTP-verification (35) and TCP-protocol (6) integration buckets via a Python discover(service) callback that returns concrete instance configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First of three plans for the advanced auto-config discover() experiment. Covers the discovery helpers and types in datadog_checks_base, with no cross-repo dependencies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…notations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…elpers Add TYPE_CHECKING guard for `requests` import in verifiers.py to resolve ruff F821 (undefined name) on forward-reference annotations. Apply ruff auto-formatting to verifiers.py and test_tcp.py (blank lines, line length). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The kwarg name 'verify' collides with the well-known 'requests.get(verify=)' TLS-verification kwarg. Rename to 'verifier' before downstream callers (per-integration discover() methods in a future plan) cement the name.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Five tasks covering airflow (HTTP multi-step version detection), twemproxy (TCP banner JSON shape), and hdfs_namenode (HTTP JSON-shape verification), plus one shared task to add a response_json_keys TCP verifier in datadog_checks_base. Verification is via @pytest.mark.e2e tests using the existing dd_agent_check harness with discovery_min_instances + discovery_timeout, mirroring the krakend test_e2e_discovery pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a generic discover() classmethod to OpenMetricsBaseCheckV2 that scans for Prometheus exposition endpoints on available ports. Subclasses can set DISCOVERY_PORT_HINTS and DISCOVERY_METRICS_PATH to customise the behaviour. KrakendCheck now inherits this instead of defining its own copy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove OpenMetricsBaseCheckV2.discover() and the _run_discover rtloader helper. The Service/Port dataclasses and the discovery probe helpers (http_probe, candidate_ports, is_prometheus_exposition) are kept -- they are reused by the alt mechanism in krakend's check() override. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the agent schedules krakend with __discovery_service__ in the instance config, KrakendCheck probes candidate ports (hint: 9090) on the first check() call and reconfigures the OpenMetrics scraper for the responding /metrics endpoint. Subsequent calls run the normal OpenMetrics path against the cached endpoint. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cleanup of leftover tests from the strip commit (5298b2b) that targeted the removed discover() classmethod. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Push the placeholder-endpoint injection, the __discovery_service__ detection in check(), the port-probing/scraper-reconfiguration logic, and the configure_scrapers skip into OpenMetricsBaseCheckV2. Per-integration adoption now requires only DISCOVERY_PORT_HINTS = [...] (and optionally DISCOVERY_METRICS_PATH). Strip krakend/check.py of its custom __init__, check(), and _configure_from_discovery; the diff vs main is now a single DISCOVERY_PORT_HINTS line. Verified krakend unit + e2e_discovery + non-discovery e2e all pass, and the krakend-delayed smoke shows trial-mode failures suppressed at DEBUG with tracebacks pointing at openmetrics/v2/base.py:_resolve_discovery.
…overy Per-integration adoption story now matches PoC A's: most integrations only add DISCOVERY_PORT_HINTS = [...] (1-line). Boundary's PoC A discover() override pattern -- 'super() then derive a related field from the endpoint' -- maps to the new _post_discovery_hook on the base class. Base class additions: - ensure_discovery_resolved() public hook so subclasses with custom check() bodies (boundary) can resolve discovery before reading self.config fields whose values are derived during discovery (e.g. health_endpoint). - _post_discovery_hook() extension point: subclasses can update self.instance fields after the endpoint is resolved; the base class rebuilds the config-model (self._config_model_instance) so ConfigMixin subclasses see the post-discovery values. - candidate_ports now always yields hints, even if Docker reports the service with an empty ports list (e.g. ray's task containers). Hints are best-effort probes; http_probe handles unreachable ports gracefully. - http_probe default timeout 0.5s -> 2.0s. Larger Prometheus exporters (ray's /metrics is ~310KB) reliably exceeded 0.5s on the docker network. Per-integration changes: - boundary: __init__ placeholder for health_endpoint, _post_discovery_hook derives health_endpoint from openmetrics_endpoint, check() now calls ensure_discovery_resolved() before reading self.config.health_endpoint. - kong, cockroachdb (legacy wrappers): __new__ now routes both openmetrics_endpoint and __discovery_service__ instances to the V2 class; redundant discover() classmethods removed. - cockroachdb (V2): merged configure_additional_transformers into configure_scrapers, iterating self.scrapers.values() so transformers are attached after every scraper rebuild (including the post-discovery one). - temporal: configure_scrapers now iterates safely instead of indexing. - n8n: DISCOVERY_PORT_HINTS=[5678]; _post_discovery_hook refreshes the cached self.openmetrics_endpoint that __init__ captured from the placeholder. Verified: - All affected integrations: unit tests pass. - e2e_discovery PASS: krakend, boundary, cockroachdb, n8n, pulsar, temporal. - ray e2e remains unstable: the auto_conf matches all rayproject/ray containers (head + workers + short-lived task containers); dd_agent_check's replay_check_run surfaces per-instance ConfigurationError from any trial-mode failure even when other instances succeed. Manual `agent check ray --discovery-min-instances 1` exits 0 with "Total Runs: 1, Last Successful Execution Date" -- the implementation works; the test framework needs adjustment for multi-container AD matches. Out of scope for this commit. - kuma: e2e infra runs in a kind cluster, would require separate alt-PoC plumbing in conftest. Out of scope. - kong: no test_e2e.py exists. Out of scope.
This change was added speculatively while debugging ray's e2e failure -- ray's actual blocker turned out to be dd_agent_check.replay_check_run surfacing per-instance ConfigurationError, not the empty hints. The 6 passing e2e tests (krakend, boundary, cockroachdb, n8n, pulsar, temporal) all have their hint port in the container's EXPOSE list, so candidate_ports yields it identically before and after the change. Reverting to keep the behavior unchanged from PoC A.
Same justification as the candidate_ports revert: this was speculative while debugging ray, not a fix for any passing test. Ray's blocker was dd_agent_check.replay_check_run, not probe timeout. The 6 passing e2e tests all probe /metrics responses small enough for 0.5s.
PoC A's krakend had no port hint -- candidate_ports yielded the service's exposed ports in order. The container exposes 9090 (and 8080, 8090); the probe rejects 8080's non-prometheus response and accepts 9090. The hint was speculative on my part and just an optimization; remove to keep the diff vs main minimal and matching PoC A. Krakend's check.py is now unchanged from main. Drop the unit test that asserted the hint -- the remaining test_trial_mode_probes_and_configures_scraper still verifies the probe-and-configure flow end-to-end with mocked http_probe. Verified: krakend e2e_discovery passes, krakend unit suite passes (16/16).
PoC A's n8n had no port hint either. The container exposes only 5678 so candidate_ports yields it from service.ports without needing a hint. Verified: e2e_discovery passes without the hint. n8n's check.py diff vs main is now just the _post_discovery_hook to refresh the cached self.openmetrics_endpoint after discovery.
Replace OpenMetricsBaseCheckV2-level trial-mode plumbing with an AgentCheck- level mechanism. AgentCheck.__new__ detects __discovery_service__ in the instance and routes construction through a dynamically-generated proxy subclass (_TrialModeMixin + target_cls). The proxy's run() iterates target_cls.generate_configs(service), constructs a fresh target_cls instance per candidate (full normal __init__ + run_check_initializations + check), and commits the first candidate whose run() completes without an error report. Subsequent runs delegate to the winning instance. Per-integration adoption is back to PoC A's level: most integrations gain zero new code beyond DISCOVERY_PORT_HINTS / DISCOVERY_METRICS_PATH class attrs. Boundary's "derive a related field from the endpoint" pattern is a classmethod override of generate_configs mirroring PoC A's super().discover() pattern. Removed (vs the previous alt-PoC iteration): _DISCOVERY_PLACEHOLDER_ENDPOINT + __init__ injection, ensure_discovery_resolved, _resolve_discovery, _post_discovery_hook, configure_scrapers placeholder skip, config-model rebuild after discovery, boundary's __init__ override and the cached_property clearing, n8n's _post_discovery_hook, temporal's configure_scrapers .values() iteration, cockroachdb V2's configure_scrapers merge. Added: AgentCheck.__new__ dispatch, AgentCheck.generate_configs default, _TrialModeMixin proxy, OpenMetricsBaseCheckV2.generate_configs. Conftest changes: each affected integration's e2e fixture now mounts datadog_checks/base/checks/base.py into the agent so the new __new__ + mixin reach the running check loader. E2E results: krakend, boundary, cockroachdb, n8n, pulsar, temporal -- test_e2e_discovery PASS.
Three fixes from running ray's e2e_discovery to ground:
1. AgentCheck.__new__: explicitly call _TrialModeProxy.__init__ on the
newly-created proxy. Python skips __init__ when __new__ returns an
instance whose class is not a subclass of cls -- and _TrialModeProxy
intentionally is NOT a subclass of cls (we can't subclass cls without
tripping rtloader's "no subclasses" detector). Without this, the proxy
ran with the parent AgentCheck.__init__ already executed but
_TrialModeProxy's own __init__ (which sets _service_dict and _winner)
was never called, so the first proxy.run() raised AttributeError.
Replace the previous dynamic-subclass approach (proxy was a
`(_TrialModeMixin, target_cls)` subclass) with a single fixed
_TrialModeProxy(AgentCheck) class. target_cls is stashed as an
instance attribute. This avoids generating subclasses of integration
check classes, which would break rtloader's class detector at
three.cpp:727 ("Agent integrations are supposed to have no
subclasses"): the loader skips any AgentCheck subclass that itself
has subclasses, so the dynamic class would prevent re-loading the
same integration on subsequent agent restarts/configurations.
2. dd_agent_check + replay_check_run: when the test passes
discovery_min_instances, treat per-instance check errors as expected
noise. Auto-discovery may match multiple containers (ray's case: head
+ workers all serve metrics, but task-runner containers using the
same image don't). Surfacing per-instance ConfigurationError to
replay_check_run defeated the purpose of multi-match discovery
testing. The test's metric/service-check assertions remain the
source of truth.
3. ray/test_e2e_discovery: bump discovery_min_instances from 1 to 6 and
discovery_timeout from 30 to 60. The agent's check command exits as
soon as discovery_min_instances configs are SCHEDULED (not run
successfully), which can be a single task container that never
produces metrics. Waiting for all 6 matches ensures the head and
workers are scheduled and run at least once.
Verified all 7 e2e_discovery tests pass: krakend, boundary, cockroachdb,
n8n, pulsar, ray, temporal.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PoC B for Configuration Discovery for Agent Integrations. Compare with PoC A in #23547.
Approach
Instead of a new
discover()classmethod dispatched by AD via a new rtloader symbol, PoC B intercepts check instantiation insideAgentCheck.__new__. When AD schedules a synthetic discovery config (containing__discovery_service__), the base class transparently returns a_TrialModeProxyrather than the real check. The proxy drives the trial loop itself — callingcls.generate_configs(service_dict)to yield candidate instance dicts, instantiating and running the real check for each, and committing the first winner.Key property: the trial probe is the check run. No separate probe phase — the agent's existing runner drives it, and a successful trial leaves the winner wired up for subsequent runs.
Changes
datadog_checks_base/datadog_checks/base/checks/base.py:AgentCheck.__new__intercepts discovery configs;_TrialModeProxyclass drives trial loop;generate_configs()classmethod stubdatadog_checks_base/datadog_checks/base/checks/openmetrics/v2/base.py: Defaultgenerate_configs()for OpenMetrics checks usingDISCOVERY_PORT_HINTS/DISCOVERY_METRICS_PATHgenerate_configs/ port hints:ray,kuma,cockroachdb,temporal,pulsar,boundary,kongdatadog_checks_dev:replay_check_rungainsignore_errorsfor multi-match discovery tests;dd_agent_checkfixture wires it fromdiscovery_min_instancesDesign doc / comparison with PoC A
See
docs/superpowers/2026-05-06-disco-autoconfig-alt-comparison.mdin the companion agent PR.Test plan
ddev env test --dev ray py3.12—test_e2e_discoverypasses (6-instance multi-match)ddev env test --dev kuma py3.12—test_e2e_discoverypassesddev env test --dev cockroachdb py3.12—test_e2e_discoverypassesddev env test --dev temporal py3.12—test_e2e_discoverypassesddev env test --dev boundary py3.12—test_e2e_discoverypassesddev env test --dev pulsar py3.12—test_e2e_discoverypasses🤖 Generated with Claude Code