Skip to content

config-discovery: AgentCheck-level trial runner (PoC B)#23624

Draft
vitkyrka wants to merge 48 commits intomasterfrom
vitkyrka/disco-autoconfig-alt
Draft

config-discovery: AgentCheck-level trial runner (PoC B)#23624
vitkyrka wants to merge 48 commits intomasterfrom
vitkyrka/disco-autoconfig-alt

Conversation

@vitkyrka
Copy link
Copy Markdown

@vitkyrka vitkyrka commented May 7, 2026

Summary

PoC B for Configuration Discovery for Agent Integrations. Compare with PoC A in #23547.

Approach

Instead of a new discover() classmethod dispatched by AD via a new rtloader symbol, PoC B intercepts check instantiation inside AgentCheck.__new__. When AD schedules a synthetic discovery config (containing __discovery_service__), the base class transparently returns a _TrialModeProxy rather than the real check. The proxy drives the trial loop itself — calling cls.generate_configs(service_dict) to yield candidate instance dicts, instantiating and running the real check for each, and committing the first winner.

Key property: the trial probe is the check run. No separate probe phase — the agent's existing runner drives it, and a successful trial leaves the winner wired up for subsequent runs.

Changes

  • datadog_checks_base/datadog_checks/base/checks/base.py: AgentCheck.__new__ intercepts discovery configs; _TrialModeProxy class drives trial loop; generate_configs() classmethod stub
  • datadog_checks_base/datadog_checks/base/checks/openmetrics/v2/base.py: Default generate_configs() for OpenMetrics checks using DISCOVERY_PORT_HINTS / DISCOVERY_METRICS_PATH
  • Seven integrations with generate_configs / port hints: ray, kuma, cockroachdb, temporal, pulsar, boundary, kong
  • datadog_checks_dev: replay_check_run gains ignore_errors for multi-match discovery tests; dd_agent_check fixture wires it from discovery_min_instances

Design doc / comparison with PoC A

See docs/superpowers/2026-05-06-disco-autoconfig-alt-comparison.md in the companion agent PR.

Test plan

  • ddev env test --dev ray py3.12test_e2e_discovery passes (6-instance multi-match)
  • ddev env test --dev kuma py3.12test_e2e_discovery passes
  • ddev env test --dev cockroachdb py3.12test_e2e_discovery passes
  • ddev env test --dev temporal py3.12test_e2e_discovery passes
  • ddev env test --dev boundary py3.12test_e2e_discovery passes
  • ddev env test --dev pulsar py3.12test_e2e_discovery passes

🤖 Generated with Claude Code

vitkyrka and others added 30 commits May 4, 2026 17:50
… Inf values

The Prometheus exposition format allows NaN, +Inf, and -Inf as metric
values (e.g., summary quantiles that haven't been observed yet return NaN).
The discovery verifier regex only matched numeric values, so endpoints
whose first metric value was NaN would fail the Prometheus check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- n8n: add auto_conf_discovery.yaml (ad_identifiers: n8nio/n8n), update
  conftest.py to mount discovery helpers, add e2e discovery test
- kuma: add auto_conf_discovery.yaml (ad_identifiers: kumahq/kuma-cp)
  with DISCOVERY_PORT_HINTS=[5680]
- pulsar: add auto_conf_discovery.yaml (ad_identifiers: apachepulsar/pulsar,
  pulsar), add DISCOVERY_PORT_HINTS=[8080], update conftest.py and e2e test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add auto_conf_discovery.yaml with AD identifiers and discovery: {} to
enable the Agent's generic OpenMetrics port scanning for the following
integrations: aerospike, appgate_sdp, argo_rollouts, argo_workflows,
aws_neuron, bentoml, calico, celery, dcgm, falco, fluxcd,
hugging_face_tgi, karpenter, keda, kubernetes_cluster_autoscaler,
kyverno, milvus, nvidia_nim, nvidia_triton, quarkus, ray, temporal,
velero, vllm, weaviate.

Add DISCOVERY_PORT_HINTS to integrations with officially documented
well-known ports: aerospike (9145), aws_neuron (8000), bentoml (3000),
dcgm (9400), kubernetes_cluster_autoscaler (8085), kyverno (8000),
nvidia_nim (8000), nvidia_triton (8002), quarkus (8080/q/metrics),
ray (8080), vllm (8000), weaviate (2112).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update conftest to mount discovery helpers and auto_conf_discovery.yaml.
Tag the built quarkus image so the AD identifier matches.
Add test_e2e_discovery that verifies the openmetrics.health service check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix ad_identifier (temporalio/server → temporalio/auto-setup), add
DISCOVERY_PORT_HINTS=[8000]. Extend create_log_volumes() wrapper to
also mount discovery helpers, and add test_e2e_discovery.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extend create_log_volumes() to also mount discovery helpers and
auto_conf_discovery.yaml. Add test_e2e_discovery that verifies the
openmetrics.health service check passes for at least one discovered instance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch flower service from custom build to vanilla mher/flower image so
the container's image name matches the ad_identifier. Use explicit
--broker flag with auth credentials. Add discovery constants and 2-tuple
yield to conftest, add test_e2e_discovery.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Discovery support requires a passing e2e test to verify the mechanism
works end-to-end (container image matches ad_identifier, port is
reachable, metrics parse correctly). The following integrations could
not satisfy that requirement:

Caddy-based mock servers — e2e environments use caddy:2.7 to serve
fixture files, so the running container's image is "caddy", not the
real ad_identifier. Discovery would silently never fire in the test
environment, making the test meaningless:
  appgate_sdp, aws_neuron, dcgm, hugging_face_tgi, karpenter,
  kubernetes_cluster_autoscaler, nvidia_nim, nvidia_triton, vllm

No Docker e2e environment — no docker-compose setup exists in the
tests/ directory, so there is no environment in which to run an e2e
discovery test:
  argo_rollouts, argo_workflows, bentoml, calico, fluxcd, keda,
  kyverno, velero, weaviate

Base e2e fails to start — the existing e2e test environment does not
come up cleanly, which is a prerequisite before adding discovery:
  aerospike: "startup was not complete, exiting immediately"
  milvus: MilvusException during script-runner setup
          (InvalidateCollectionMetaCache node-mismatch error)

network_mode: host — docker-compose uses host networking so the
container exposes no port bindings; candidate_ports() receives an
empty service.ports list and yields nothing, making discovery
structurally impossible:
  falco

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The e2e test required patching docker-compose.yaml to add
`image: quarkus` alongside `build: micrometer-quickstart`. That tag
exists solely to make the container's image name match the
ad_identifier — nobody runs an image literally named `quarkus` in
production. Quarkus users build their own application images with
arbitrary names, so there is no canonical image name to use as an
ad_identifier.

The ad_identifier for caddy-based integrations is wrong for a similar
reason (the running image is caddy, not the real service), but at
least those integrations intend to match a real published image when
deployed. For quarkus the identifier is purely fictional.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ith e2e tests

For each integration:
- Added DISCOVERY_PORT_HINTS (and DISCOVERY_METRICS_PATH for cockroachdb) to the
  check class so the generic port-scanning discover() can find the service.
- Added auto_conf_discovery.yaml with ad_identifiers and discovery: {} to wire up
  the Autodiscovery template.
- Updated conftest.py to mount the autoconf YAML and local discovery helpers into
  the agent container for e2e testing.
- Added test_e2e_discovery to verify the full AD → discover() → check scheduling
  path.

boundary: Added discover() override to also derive health_endpoint from the
discovered openmetrics_endpoint (required field in InstanceConfig).

cockroachdb, kong: Both integrations have a V1 legacy wrapper class (CockroachdbCheck /
Kong) whose __new__ delegates to the real V2 subclass when openmetrics_endpoint is
present. The agent resolves the check class by name and looks up discover() on this
wrapper, which inherits from OpenMetricsBaseCheck (V1) — no discover() there. Fixed
by forwarding discover() on the wrapper to the V2 subclass:
    @classmethod
    def discover(cls, service):
        return CockroachdbCheckV2.discover(service)  # / KongCheck.discover(service)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The check monitors Flower (mher/flower), not Celery workers directly.
The e2e environment builds a custom image from ./proj that bundles both
Celery and Flower with unpinned versions. Switching the flower service to
vanilla mher/flower introduces a Celery version mismatch between the
workers and Flower. More fundamentally, users who don't run Flower, or
who embed it into their own custom image, get no discovery benefit from
targeting mher/flower.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The METRIC_MAP keys are unprefixed (e.g. process_cpu_user_seconds) but
the n8n Prometheus endpoint exposes metrics with an n8n_ prefix. Without
stripping that prefix before matching, nothing in the map matched and
only the readiness gauge was ever submitted.

Fix by adding raw_metric_prefix: n8n_ to get_default_config(), and
correct the spec.yaml display_default/example from n8n to n8n_ to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Design for an experiment validating declarative-probe based advanced
auto-configuration on the krakend integration end-to-end against a real
Agent build. Targets the generic-openmetrics-scan bucket (51/260
integrations) identified in the vitkykra/autoconfig-analysis branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bite-sized TDD tasks covering the new auto_conf_discovery.yaml file
format, the discovery package (prober + cache + service wrapper), the
%%discovered_port%% template variable, the configmgr wiring, and three
end-to-end demo scenarios against a real Agent build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ment

Declares the krakend ad_identifier with an OpenMetrics probe spec
(default port 8090, /metrics path). Consumed by the new
auto_conf_discovery file format in datadog-agent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dev env's KrakenD config exposes Prometheus metrics on port 9090
(via the OpenTelemetry Prometheus exporter), not 8090. The fallback
scan would have found it anyway, but the hint should be correct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Successor to the KrakenD experiment, covering the HTTP-verification (35)
and TCP-protocol (6) integration buckets via a Python discover(service)
callback that returns concrete instance configs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First of three plans for the advanced auto-config discover() experiment.
Covers the discovery helpers and types in datadog_checks_base, with no
cross-repo dependencies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…notations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…elpers

Add TYPE_CHECKING guard for `requests` import in verifiers.py to resolve
ruff F821 (undefined name) on forward-reference annotations. Apply ruff
auto-formatting to verifiers.py and test_tcp.py (blank lines, line length).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The kwarg name 'verify' collides with the well-known 'requests.get(verify=)'
TLS-verification kwarg. Rename to 'verifier' before downstream callers
(per-integration discover() methods in a future plan) cement the name.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vitkyrka and others added 18 commits May 6, 2026 03:32
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Five tasks covering airflow (HTTP multi-step version detection),
twemproxy (TCP banner JSON shape), and hdfs_namenode (HTTP JSON-shape
verification), plus one shared task to add a response_json_keys TCP
verifier in datadog_checks_base. Verification is via @pytest.mark.e2e
tests using the existing dd_agent_check harness with
discovery_min_instances + discovery_timeout, mirroring the krakend
test_e2e_discovery pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a generic discover() classmethod to OpenMetricsBaseCheckV2 that
scans for Prometheus exposition endpoints on available ports.
Subclasses can set DISCOVERY_PORT_HINTS and DISCOVERY_METRICS_PATH
to customise the behaviour. KrakendCheck now inherits this instead
of defining its own copy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove OpenMetricsBaseCheckV2.discover() and the _run_discover rtloader
helper. The Service/Port dataclasses and the discovery probe helpers
(http_probe, candidate_ports, is_prometheus_exposition) are kept --
they are reused by the alt mechanism in krakend's check() override.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the agent schedules krakend with __discovery_service__ in the
instance config, KrakendCheck probes candidate ports (hint: 9090) on
the first check() call and reconfigures the OpenMetrics scraper for
the responding /metrics endpoint. Subsequent calls run the normal
OpenMetrics path against the cached endpoint.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cleanup of leftover tests from the strip commit (5298b2b) that
targeted the removed discover() classmethod.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Push the placeholder-endpoint injection, the __discovery_service__
detection in check(), the port-probing/scraper-reconfiguration logic, and
the configure_scrapers skip into OpenMetricsBaseCheckV2. Per-integration
adoption now requires only DISCOVERY_PORT_HINTS = [...] (and optionally
DISCOVERY_METRICS_PATH).

Strip krakend/check.py of its custom __init__, check(), and
_configure_from_discovery; the diff vs main is now a single
DISCOVERY_PORT_HINTS line.

Verified krakend unit + e2e_discovery + non-discovery e2e all pass, and
the krakend-delayed smoke shows trial-mode failures suppressed at DEBUG
with tracebacks pointing at openmetrics/v2/base.py:_resolve_discovery.
…overy

Per-integration adoption story now matches PoC A's: most integrations only
add DISCOVERY_PORT_HINTS = [...] (1-line). Boundary's PoC A discover()
override pattern -- 'super() then derive a related field from the
endpoint' -- maps to the new _post_discovery_hook on the base class.

Base class additions:
- ensure_discovery_resolved() public hook so subclasses with custom check()
  bodies (boundary) can resolve discovery before reading self.config fields
  whose values are derived during discovery (e.g. health_endpoint).
- _post_discovery_hook() extension point: subclasses can update
  self.instance fields after the endpoint is resolved; the base class
  rebuilds the config-model (self._config_model_instance) so ConfigMixin
  subclasses see the post-discovery values.
- candidate_ports now always yields hints, even if Docker reports the
  service with an empty ports list (e.g. ray's task containers). Hints
  are best-effort probes; http_probe handles unreachable ports gracefully.
- http_probe default timeout 0.5s -> 2.0s. Larger Prometheus exporters
  (ray's /metrics is ~310KB) reliably exceeded 0.5s on the docker network.

Per-integration changes:
- boundary: __init__ placeholder for health_endpoint, _post_discovery_hook
  derives health_endpoint from openmetrics_endpoint, check() now calls
  ensure_discovery_resolved() before reading self.config.health_endpoint.
- kong, cockroachdb (legacy wrappers): __new__ now routes both
  openmetrics_endpoint and __discovery_service__ instances to the V2 class;
  redundant discover() classmethods removed.
- cockroachdb (V2): merged configure_additional_transformers into
  configure_scrapers, iterating self.scrapers.values() so transformers are
  attached after every scraper rebuild (including the post-discovery one).
- temporal: configure_scrapers now iterates safely instead of indexing.
- n8n: DISCOVERY_PORT_HINTS=[5678]; _post_discovery_hook refreshes the
  cached self.openmetrics_endpoint that __init__ captured from the
  placeholder.

Verified:
- All affected integrations: unit tests pass.
- e2e_discovery PASS: krakend, boundary, cockroachdb, n8n, pulsar, temporal.
- ray e2e remains unstable: the auto_conf matches all rayproject/ray
  containers (head + workers + short-lived task containers); dd_agent_check's
  replay_check_run surfaces per-instance ConfigurationError from any
  trial-mode failure even when other instances succeed. Manual `agent
  check ray --discovery-min-instances 1` exits 0 with "Total Runs: 1, Last
  Successful Execution Date" -- the implementation works; the test
  framework needs adjustment for multi-container AD matches. Out of scope
  for this commit.
- kuma: e2e infra runs in a kind cluster, would require separate
  alt-PoC plumbing in conftest. Out of scope.
- kong: no test_e2e.py exists. Out of scope.
This change was added speculatively while debugging ray's e2e failure --
ray's actual blocker turned out to be dd_agent_check.replay_check_run
surfacing per-instance ConfigurationError, not the empty hints. The 6
passing e2e tests (krakend, boundary, cockroachdb, n8n, pulsar, temporal)
all have their hint port in the container's EXPOSE list, so candidate_ports
yields it identically before and after the change. Reverting to keep the
behavior unchanged from PoC A.
Same justification as the candidate_ports revert: this was speculative
while debugging ray, not a fix for any passing test. Ray's blocker was
dd_agent_check.replay_check_run, not probe timeout. The 6 passing e2e
tests all probe /metrics responses small enough for 0.5s.
PoC A's krakend had no port hint -- candidate_ports yielded the service's
exposed ports in order. The container exposes 9090 (and 8080, 8090);
the probe rejects 8080's non-prometheus response and accepts 9090. The
hint was speculative on my part and just an optimization; remove to keep
the diff vs main minimal and matching PoC A. Krakend's check.py is now
unchanged from main.

Drop the unit test that asserted the hint -- the remaining
test_trial_mode_probes_and_configures_scraper still verifies the
probe-and-configure flow end-to-end with mocked http_probe.

Verified: krakend e2e_discovery passes, krakend unit suite passes (16/16).
PoC A's n8n had no port hint either. The container exposes only 5678 so
candidate_ports yields it from service.ports without needing a hint.
Verified: e2e_discovery passes without the hint.

n8n's check.py diff vs main is now just the _post_discovery_hook to
refresh the cached self.openmetrics_endpoint after discovery.
Replace OpenMetricsBaseCheckV2-level trial-mode plumbing with an AgentCheck-
level mechanism. AgentCheck.__new__ detects __discovery_service__ in the
instance and routes construction through a dynamically-generated proxy
subclass (_TrialModeMixin + target_cls).

The proxy's run() iterates target_cls.generate_configs(service), constructs
a fresh target_cls instance per candidate (full normal __init__ +
run_check_initializations + check), and commits the first candidate whose
run() completes without an error report. Subsequent runs delegate to the
winning instance.

Per-integration adoption is back to PoC A's level: most integrations gain
zero new code beyond DISCOVERY_PORT_HINTS / DISCOVERY_METRICS_PATH class
attrs. Boundary's "derive a related field from the endpoint" pattern is a
classmethod override of generate_configs mirroring PoC A's super().discover()
pattern.

Removed (vs the previous alt-PoC iteration): _DISCOVERY_PLACEHOLDER_ENDPOINT
+ __init__ injection, ensure_discovery_resolved, _resolve_discovery,
_post_discovery_hook, configure_scrapers placeholder skip, config-model
rebuild after discovery, boundary's __init__ override and the cached_property
clearing, n8n's _post_discovery_hook, temporal's configure_scrapers
.values() iteration, cockroachdb V2's configure_scrapers merge.

Added: AgentCheck.__new__ dispatch, AgentCheck.generate_configs default,
_TrialModeMixin proxy, OpenMetricsBaseCheckV2.generate_configs.

Conftest changes: each affected integration's e2e fixture now mounts
datadog_checks/base/checks/base.py into the agent so the new __new__ +
mixin reach the running check loader.

E2E results: krakend, boundary, cockroachdb, n8n, pulsar, temporal --
test_e2e_discovery PASS.
Three fixes from running ray's e2e_discovery to ground:

1. AgentCheck.__new__: explicitly call _TrialModeProxy.__init__ on the
   newly-created proxy. Python skips __init__ when __new__ returns an
   instance whose class is not a subclass of cls -- and _TrialModeProxy
   intentionally is NOT a subclass of cls (we can't subclass cls without
   tripping rtloader's "no subclasses" detector). Without this, the proxy
   ran with the parent AgentCheck.__init__ already executed but
   _TrialModeProxy's own __init__ (which sets _service_dict and _winner)
   was never called, so the first proxy.run() raised AttributeError.

   Replace the previous dynamic-subclass approach (proxy was a
   `(_TrialModeMixin, target_cls)` subclass) with a single fixed
   _TrialModeProxy(AgentCheck) class. target_cls is stashed as an
   instance attribute. This avoids generating subclasses of integration
   check classes, which would break rtloader's class detector at
   three.cpp:727 ("Agent integrations are supposed to have no
   subclasses"): the loader skips any AgentCheck subclass that itself
   has subclasses, so the dynamic class would prevent re-loading the
   same integration on subsequent agent restarts/configurations.

2. dd_agent_check + replay_check_run: when the test passes
   discovery_min_instances, treat per-instance check errors as expected
   noise. Auto-discovery may match multiple containers (ray's case: head
   + workers all serve metrics, but task-runner containers using the
   same image don't). Surfacing per-instance ConfigurationError to
   replay_check_run defeated the purpose of multi-match discovery
   testing. The test's metric/service-check assertions remain the
   source of truth.

3. ray/test_e2e_discovery: bump discovery_min_instances from 1 to 6 and
   discovery_timeout from 30 to 60. The agent's check command exits as
   soon as discovery_min_instances configs are SCHEDULED (not run
   successfully), which can be a single task container that never
   produces metrics. Waiting for all 6 matches ensures the head and
   workers are scheduled and run at least once.

Verified all 7 e2e_discovery tests pass: krakend, boundary, cockroachdb,
n8n, pulsar, ray, temporal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant