[incident-54830] docker: tolerate invalid CIDRs in /info DefaultAddressPools#51235
Conversation
…ssPools Some Docker daemons emit "invalid Prefix" (the String() output of an invalid netip.Prefix) for DefaultAddressPools[].Base in the /info response. The moby v29 client (adopted in #48777) decodes Base into a typed netip.Prefix and rejects such values, aborting the entire /info JSON decode. This cascaded into DockerUtil init failures, hostname provider failures, missing host tags and metadata, and in containerized environments without DD_HOSTNAME could prevent the Agent from starting. Add a safeInfo helper that falls back to a raw HTTP /info request and decodes into a mirror struct shadowing DefaultAddressPools with a json.RawMessage. The fallback is gated on the SDK's JSON-decode error prefix and only engages when needed; the SDK's happy path is unchanged. Refactor host_tags.go to extract a pure buildSwarmTags helper so the test no longer needs to mock the client.SystemAPIClient interface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. Hooray! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
|
Files inventory check summaryFile checks results against ancestor 92a16ba9: Results for datadog-agent_7.81.0~devel.git.184.dd4a5d0.pipeline.114793458-1_amd64.deb:No change detected |
docker: tolerate invalid CIDRs in /info DefaultAddressPools
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
11 successful checks with minimal change (< 2 KiB)
|
Lead with the user-visible symptom (the verbatim Agent log line) and the cascading effects on tags, host metadata and Agent startup. Drop the implementation details (struct/field names, moby v29 internals, CVE references) that are not useful to end users reading the changelog. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 4e8cbb8 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -2.49 | [-5.39, +0.41] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_logs | % cpu utilization | +1.39 | [+0.38, +2.41] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.65 | [+0.39, +0.90] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.50 | [+0.26, +0.73] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.24 | [+0.04, +0.44] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.18 | [+0.08, +0.29] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.04 | [-0.40, +0.48] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.03 | [-0.11, +0.16] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.19, +0.22] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.01 | [-0.48, +0.49] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.01 | [-0.09, +0.10] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.01 | [-0.23, +0.21] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.03 | [-0.44, +0.37] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.13 | [-0.29, +0.03] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.17 | [-0.36, +0.02] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.26 | [-0.35, -0.16] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.38 | [-0.42, -0.33] | 1 | Logs bounds checks dashboard |
| ➖ | file_tree | memory utilization | -0.38 | [-0.42, -0.34] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.41 | [-0.46, -0.36] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.49 | [-0.68, -0.30] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.50 | [-0.57, -0.44] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.53 | [-0.59, -0.47] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_metrics | memory utilization | -0.64 | [-0.80, -0.49] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -2.49 | [-5.39, +0.41] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 722 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 246.16MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 724 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 143.49MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle | total_bytes_received | 10/10 | 736.17KiB ≤ 819.20KiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 2 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 423.81MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | total_bytes_received | 10/10 | 1.11MiB ≤ 1.25MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 175.38MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_logs | total_bytes_received | 10/10 | 264.24MiB ≤ 292MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 344.14 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 377.16MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | total_bytes_received | 10/10 | 0.94GiB ≤ 1.04GiB | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
|
Codex Review: Didn't find any major issues. 🎉 ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
Switch from context.Background() to t.Context() so the context is auto-cancelled when the test finishes (Go 1.24+). Matches the modern idiom already used in 21 other files across the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
| // to verify availability. safeInfo tolerates daemons that emit invalid | ||
| // CIDRs in /info's DefaultAddressPools, which would otherwise fail the | ||
| // strict netip.Prefix decoding introduced by the moby v29 client. | ||
| if _, err := safeInfo(ctx, cli); err != nil { |
There was a problem hiding this comment.
nit: Just in case there's any issues with /info still, do you think it could make sense to include the changes from #51128 to instead just ping docker to establish the API connection version. Having the ping did help resolve some of the e2e tests and allowed container integrations to keep working, just host tags were failing to resolve if the customer doesn't have one set.
There was a problem hiding this comment.
Indeed, @ajgajg1134 confirmed that both are needed.
| // The moby client wraps JSON-decode failures of /info with this prefix | ||
| // (see github.com/moby/moby/client/system_info.go). Network, HTTP-status | ||
| // and other connection errors do not, and the tolerant fallback would not | ||
| // help in those cases — propagate the original error. |
There was a problem hiding this comment.
is this worth trying to push to moby repo a new error dedicated to JSON parsing ?
|
This usually happens when the cherry-pick has merge conflicts and needs manual resolution. To backport manually, run: git fetch
git worktree add .worktrees/backport-7.79.x 7.79.x
cd .worktrees/backport-7.79.x
git switch --create backport-51235-to-7.79.x
git cherry-pick -x --mainline 1 309b25fa0a2d7382a728a49b52a4c22de97ff5fc
git push --set-upstream origin backport-51235-to-7.79.xWorkflow logs: https://github.com/DataDog/datadog-agent/actions/runs/26527318546 |
…n `/info` `DefaultAddressPools` (#51382) Backport 309b25f from #51235. ___ ### What does this PR do? Adds a tolerant decoder for the Docker daemon's `/info` response in `pkg/util/docker/`, so that daemons emitting invalid CIDRs in `DefaultAddressPools[].Base` no longer break `DockerUtil` and its downstream callers. Concretely: - New helper `safeInfo(ctx, cli)` (`pkg/util/docker/safe_info.go`): tries `cli.Info(...)` first; on the SDK's JSON-decode failure prefix only, reissues `GET /info` via `cli.Dialer()` and decodes into a mirror of `system.Info` where `DefaultAddressPools` is shadowed by `json.RawMessage`, so the strict `netip.Prefix` field stays at zero value and the rest of the struct still parses. - 5 callsites of `cli.Info(ctx, client.InfoOptions{})` switched to `safeInfo(...)`: `ConnectToDocker`, `DockerUtil.GetHostname`, `DockerUtil.GetStorageStats`, `GetMetadata`, `GetTags`. - `host_tags.go` refactored: pure `buildSwarmTags(info system.Info) []string` extracted; `host_tags_test.go` tests it directly without the `client.SystemAPIClient` mock. - `BUILD.bazel` updated: new files added, unused `//pkg/util/docker/fake` test dep removed. - Release note added. ### Motivation [Incident #54830](https://dd.slack.com/archives/C0B55994JGL/p1779448036913569). Reported symptom (from agent debug logs): ``` PROCESS | DEBUG | (pkg/util/hostname/container.go:34 in callContainerProvider) | GetHostname trying provider 'docker' ... PROCESS | DEBUG | (pkg/util/docker/global.go:40 in GetDockerUtilWithRetrier) | Docker init error: temporary failure in dockerutil, will retry later: Error reading remote info: netip.ParsePrefix("invalid Prefix"): no '/' ``` Root cause: PR #48777 migrated the Docker SDK to `moby/moby` v29 to fix CVE-2026-34040 / CVE-2026-33997. The new `system.Info.DefaultAddressPools[].Base` field is typed as `netip.Prefix`, whose `UnmarshalText` is strict and rejects any string without `/`. Some Docker daemons emit the literal string `"invalid Prefix"` (exactly the output of `netip.Prefix.String()` for the zero/invalid value) for that field, which makes the entire `/info` JSON decode fail. That single decode failure cascaded into: - `DockerUtil` init failure → workloadmeta Docker collector failure → Docker core check failure → missing container/image tags on metrics, traces, logs. - Hostname provider `docker` failure → in containerized environments without an explicit `DD_HOSTNAME`, the agent could refuse to start if other providers (kubelet, EC2/GCE metadata, OS) also failed. - Host metadata (`docker_version`, `docker_swarm`) and host tags (`docker_swarm_node_role`) silently missing. This PR makes the SDK's strict decoding tolerant of the offending field while preserving the strict typing of every other field. Relates to incident #54830. Functionally supersedes #51128 (which patches only the init probe via `Ping`); the two approaches can be discussed and merged into one. ### Describe how you validated your changes - `dda inv agent.build --build-exclude=systemd` — PASS - `dda inv cluster-agent.build` — PASS - `dda inv test --targets=./pkg/util/docker/...` — PASS (30 tests, 3 new in `safe_info_test.go` covering happy path, fallback on `"invalid Prefix"`, and non-decode-error propagation) - `dda inv test --targets=./pkg/util/docker/...,./pkg/collector/corechecks/containers/docker/...,./pkg/util/containers/metrics/docker/...,./comp/core/workloadmeta/collectors/internal/docker/...` — 71/71 PASS - Bazel verification: not run locally (no `bazel` binary on the dev box); BUILD.bazel was updated manually to add `safe_info.go`/`safe_info_test.go` srcs and drop the now-unused `//pkg/util/docker/fake` test dep — CI will validate. ### Additional Notes The fallback's URL construction mirrors the SDK's `getAPIPath` (preserves the configured base path and `cli.ClientVersion()` prefix) so daemons behind reverse proxies that route on `Host` header or `/v<version>/` path receive the request the same way the SDK does. A `log.Debugf` line fires whenever the fallback engages so operators can see when the workaround is active. The match on `"Error reading remote info"` is a string prefix on the SDK's wrapped JSON-decode error. It is stable today (`moby/moby/client@v0.4.1/system_info.go:32`) and unique to JSON-decode failures of `/info`, but is the most fragile part of the patch. If moby ever exposes a typed sentinel for this case, we should switch to `errors.As`/`errors.Is`. Co-authored-by: sabrina.lu <sabrina.lu@datadoghq.com>
What does this PR do?
Adds a tolerant decoder for the Docker daemon's
/inforesponse inpkg/util/docker/, so that daemons emitting invalid CIDRs inDefaultAddressPools[].Baseno longer breakDockerUtiland its downstream callers.Concretely:
safeInfo(ctx, cli)(pkg/util/docker/safe_info.go): triescli.Info(...)first; on the SDK's JSON-decode failure prefix only, reissuesGET /infoviacli.Dialer()and decodes into a mirror ofsystem.InfowhereDefaultAddressPoolsis shadowed byjson.RawMessage, so the strictnetip.Prefixfield stays at zero value and the rest of the struct still parses.cli.Info(ctx, client.InfoOptions{})switched tosafeInfo(...):ConnectToDocker,DockerUtil.GetHostname,DockerUtil.GetStorageStats,GetMetadata,GetTags.host_tags.gorefactored: purebuildSwarmTags(info system.Info) []stringextracted;host_tags_test.gotests it directly without theclient.SystemAPIClientmock.BUILD.bazelupdated: new files added, unused//pkg/util/docker/faketest dep removed.Motivation
Incident #54830. Reported symptom (from agent debug logs):
Root cause: PR #48777 migrated the Docker SDK to
moby/mobyv29 to fix CVE-2026-34040 / CVE-2026-33997. The newsystem.Info.DefaultAddressPools[].Basefield is typed asnetip.Prefix, whoseUnmarshalTextis strict and rejects any string without/. Some Docker daemons emit the literal string"invalid Prefix"(exactly the output ofnetip.Prefix.String()for the zero/invalid value) for that field, which makes the entire/infoJSON decode fail.That single decode failure cascaded into:
DockerUtilinit failure → workloadmeta Docker collector failure → Docker core check failure → missing container/image tags on metrics, traces, logs.dockerfailure → in containerized environments without an explicitDD_HOSTNAME, the agent could refuse to start if other providers (kubelet, EC2/GCE metadata, OS) also failed.docker_version,docker_swarm) and host tags (docker_swarm_node_role) silently missing.This PR makes the SDK's strict decoding tolerant of the offending field while preserving the strict typing of every other field.
Relates to incident #54830. Functionally supersedes #51128 (which patches only the init probe via
Ping); the two approaches can be discussed and merged into one.Describe how you validated your changes
dda inv agent.build --build-exclude=systemd— PASSdda inv cluster-agent.build— PASSdda inv test --targets=./pkg/util/docker/...— PASS (30 tests, 3 new insafe_info_test.gocovering happy path, fallback on"invalid Prefix", and non-decode-error propagation)dda inv test --targets=./pkg/util/docker/...,./pkg/collector/corechecks/containers/docker/...,./pkg/util/containers/metrics/docker/...,./comp/core/workloadmeta/collectors/internal/docker/...— 71/71 PASSbazelbinary on the dev box); BUILD.bazel was updated manually to addsafe_info.go/safe_info_test.gosrcs and drop the now-unused//pkg/util/docker/faketest dep — CI will validate.Additional Notes
The fallback's URL construction mirrors the SDK's
getAPIPath(preserves the configured base path andcli.ClientVersion()prefix) so daemons behind reverse proxies that route onHostheader or/v<version>/path receive the request the same way the SDK does. Alog.Debugfline fires whenever the fallback engages so operators can see when the workaround is active.The match on
"Error reading remote info"is a string prefix on the SDK's wrapped JSON-decode error. It is stable today (moby/moby/client@v0.4.1/system_info.go:32) and unique to JSON-decode failures of/info, but is the most fragile part of the patch. If moby ever exposes a typed sentinel for this case, we should switch toerrors.As/errors.Is.