[Backport 7.80.x] [incident-54830] docker: tolerate invalid CIDRs in /info DefaultAddressPools#51382
Conversation
…AddressPools` (#51235) ### What does this PR do? Adds a tolerant decoder for the Docker daemon's `/info` response in `pkg/util/docker/`, so that daemons emitting invalid CIDRs in `DefaultAddressPools[].Base` no longer break `DockerUtil` and its downstream callers. Concretely: - New helper `safeInfo(ctx, cli)` (`pkg/util/docker/safe_info.go`): tries `cli.Info(...)` first; on the SDK's JSON-decode failure prefix only, reissues `GET /info` via `cli.Dialer()` and decodes into a mirror of `system.Info` where `DefaultAddressPools` is shadowed by `json.RawMessage`, so the strict `netip.Prefix` field stays at zero value and the rest of the struct still parses. - 5 callsites of `cli.Info(ctx, client.InfoOptions{})` switched to `safeInfo(...)`: `ConnectToDocker`, `DockerUtil.GetHostname`, `DockerUtil.GetStorageStats`, `GetMetadata`, `GetTags`. - `host_tags.go` refactored: pure `buildSwarmTags(info system.Info) []string` extracted; `host_tags_test.go` tests it directly without the `client.SystemAPIClient` mock. - `BUILD.bazel` updated: new files added, unused `//pkg/util/docker/fake` test dep removed. - Release note added. ### Motivation [Incident #54830](https://dd.slack.com/archives/C0B55994JGL/p1779448036913569). Reported symptom (from agent debug logs): ``` PROCESS | DEBUG | (pkg/util/hostname/container.go:34 in callContainerProvider) | GetHostname trying provider 'docker' ... PROCESS | DEBUG | (pkg/util/docker/global.go:40 in GetDockerUtilWithRetrier) | Docker init error: temporary failure in dockerutil, will retry later: Error reading remote info: netip.ParsePrefix("invalid Prefix"): no '/' ``` Root cause: PR #48777 migrated the Docker SDK to `moby/moby` v29 to fix CVE-2026-34040 / CVE-2026-33997. The new `system.Info.DefaultAddressPools[].Base` field is typed as `netip.Prefix`, whose `UnmarshalText` is strict and rejects any string without `/`. Some Docker daemons emit the literal string `"invalid Prefix"` (exactly the output of `netip.Prefix.String()` for the zero/invalid value) for that field, which makes the entire `/info` JSON decode fail. That single decode failure cascaded into: - `DockerUtil` init failure → workloadmeta Docker collector failure → Docker core check failure → missing container/image tags on metrics, traces, logs. - Hostname provider `docker` failure → in containerized environments without an explicit `DD_HOSTNAME`, the agent could refuse to start if other providers (kubelet, EC2/GCE metadata, OS) also failed. - Host metadata (`docker_version`, `docker_swarm`) and host tags (`docker_swarm_node_role`) silently missing. This PR makes the SDK's strict decoding tolerant of the offending field while preserving the strict typing of every other field. Relates to incident #54830. Functionally supersedes #51128 (which patches only the init probe via `Ping`); the two approaches can be discussed and merged into one. ### Describe how you validated your changes - `dda inv agent.build --build-exclude=systemd` — PASS - `dda inv cluster-agent.build` — PASS - `dda inv test --targets=./pkg/util/docker/...` — PASS (30 tests, 3 new in `safe_info_test.go` covering happy path, fallback on `"invalid Prefix"`, and non-decode-error propagation) - `dda inv test --targets=./pkg/util/docker/...,./pkg/collector/corechecks/containers/docker/...,./pkg/util/containers/metrics/docker/...,./comp/core/workloadmeta/collectors/internal/docker/...` — 71/71 PASS - Bazel verification: not run locally (no `bazel` binary on the dev box); BUILD.bazel was updated manually to add `safe_info.go`/`safe_info_test.go` srcs and drop the now-unused `//pkg/util/docker/fake` test dep — CI will validate. ### Additional Notes The fallback's URL construction mirrors the SDK's `getAPIPath` (preserves the configured base path and `cli.ClientVersion()` prefix) so daemons behind reverse proxies that route on `Host` header or `/v<version>/` path receive the request the same way the SDK does. A `log.Debugf` line fires whenever the fallback engages so operators can see when the workaround is active. The match on `"Error reading remote info"` is a string prefix on the SDK's wrapped JSON-decode error. It is stable today (`moby/moby/client@v0.4.1/system_info.go:32`) and unique to JSON-decode failures of `/info`, but is the most fragile part of the patch. If moby ever exposes a typed sentinel for this case, we should switch to `errors.As`/`errors.Is`. Co-authored-by: lenaic.huard <lenaic.huard@datadoghq.com> (cherry picked from commit 309b25f) ___ Co-authored-by: Lénaïc Huard <L3n41c@users.noreply.github.com>
|
Files inventory check summaryFile checks results against ancestor 8a5ef87d: Results for datadog-agent_7.80.0~rc.3.git.12.d42099d.pipeline.115456985-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
11 successful checks with minimal change (< 2 KiB)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 8a5ef87 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.65 | [-1.27, +4.57] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.65 | [-1.27, +4.57] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | +1.39 | [+0.38, +2.41] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.76 | [+0.51, +1.01] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_metrics | memory utilization | +0.50 | [+0.35, +0.66] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.21 | [+0.01, +0.41] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.19 | [+0.14, +0.24] | 1 | Logs bounds checks dashboard |
| ➖ | file_tree | memory utilization | +0.15 | [+0.10, +0.20] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.15 | [-0.08, +0.38] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.14 | [+0.10, +0.18] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.09 | [-0.03, +0.21] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.03 | [-0.18, +0.23] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.01 | [-0.38, +0.41] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.09, +0.10] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.19, +0.20] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.01 | [-0.46, +0.44] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.03 | [-0.55, +0.48] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.05 | [-0.21, +0.11] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.12 | [-0.19, -0.06] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.15 | [-0.20, -0.09] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.18 | [-0.37, +0.01] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.31 | [-0.41, -0.21] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.55 | [-0.73, -0.36] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.62 | [-0.71, -0.53] | 1 | Logs |
Bounds Checks: ❌ Failed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 536 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 247.24MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 718 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ❌ | file_to_blackhole_0ms_latency | missed_bytes | 9/10 | 500MiB > 0MiB | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.18GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ❌ | quality_gate_idle | memory_usage | 9/10 | 148.31MiB > 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 479.67MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 176.65MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 350.65 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 383.39MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
❌ Failed. Some Quality Gates were violated.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 9/10 replicas passed. Failed 1 which is > 0. Gate FAILED.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
Backport 309b25f from #51235.
What does this PR do?
Adds a tolerant decoder for the Docker daemon's
/inforesponse inpkg/util/docker/, so that daemons emitting invalid CIDRs inDefaultAddressPools[].Baseno longer breakDockerUtiland its downstream callers.Concretely:
safeInfo(ctx, cli)(pkg/util/docker/safe_info.go): triescli.Info(...)first; on the SDK's JSON-decode failure prefix only, reissuesGET /infoviacli.Dialer()and decodes into a mirror ofsystem.InfowhereDefaultAddressPoolsis shadowed byjson.RawMessage, so the strictnetip.Prefixfield stays at zero value and the rest of the struct still parses.cli.Info(ctx, client.InfoOptions{})switched tosafeInfo(...):ConnectToDocker,DockerUtil.GetHostname,DockerUtil.GetStorageStats,GetMetadata,GetTags.host_tags.gorefactored: purebuildSwarmTags(info system.Info) []stringextracted;host_tags_test.gotests it directly without theclient.SystemAPIClientmock.BUILD.bazelupdated: new files added, unused//pkg/util/docker/faketest dep removed.Motivation
Incident #54830. Reported symptom (from agent debug logs):
Root cause: PR #48777 migrated the Docker SDK to
moby/mobyv29 to fix CVE-2026-34040 / CVE-2026-33997. The newsystem.Info.DefaultAddressPools[].Basefield is typed asnetip.Prefix, whoseUnmarshalTextis strict and rejects any string without/. Some Docker daemons emit the literal string"invalid Prefix"(exactly the output ofnetip.Prefix.String()for the zero/invalid value) for that field, which makes the entire/infoJSON decode fail.That single decode failure cascaded into:
DockerUtilinit failure → workloadmeta Docker collector failure → Docker core check failure → missing container/image tags on metrics, traces, logs.dockerfailure → in containerized environments without an explicitDD_HOSTNAME, the agent could refuse to start if other providers (kubelet, EC2/GCE metadata, OS) also failed.docker_version,docker_swarm) and host tags (docker_swarm_node_role) silently missing.This PR makes the SDK's strict decoding tolerant of the offending field while preserving the strict typing of every other field.
Relates to incident #54830. Functionally supersedes #51128 (which patches only the init probe via
Ping); the two approaches can be discussed and merged into one.Describe how you validated your changes
dda inv agent.build --build-exclude=systemd— PASSdda inv cluster-agent.build— PASSdda inv test --targets=./pkg/util/docker/...— PASS (30 tests, 3 new insafe_info_test.gocovering happy path, fallback on"invalid Prefix", and non-decode-error propagation)dda inv test --targets=./pkg/util/docker/...,./pkg/collector/corechecks/containers/docker/...,./pkg/util/containers/metrics/docker/...,./comp/core/workloadmeta/collectors/internal/docker/...— 71/71 PASSbazelbinary on the dev box); BUILD.bazel was updated manually to addsafe_info.go/safe_info_test.gosrcs and drop the now-unused//pkg/util/docker/faketest dep — CI will validate.Additional Notes
The fallback's URL construction mirrors the SDK's
getAPIPath(preserves the configured base path andcli.ClientVersion()prefix) so daemons behind reverse proxies that route onHostheader or/v<version>/path receive the request the same way the SDK does. Alog.Debugfline fires whenever the fallback engages so operators can see when the workaround is active.The match on
"Error reading remote info"is a string prefix on the SDK's wrapped JSON-decode error. It is stable today (moby/moby/client@v0.4.1/system_info.go:32) and unique to JSON-decode failures of/info, but is the most fragile part of the patch. If moby ever exposes a typed sentinel for this case, we should switch toerrors.As/errors.Is.