Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions roles/container-engine/containerd/templates/config.toml.j2
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
version = 3

imports = ["/etc/containerd/conf.d/*.toml"]

Check failure on line 3 in roles/container-engine/containerd/templates/config.toml.j2

View check run for this annotation

Claude / Claude Code Review

imports hardcodes /etc/containerd/conf.d, ignoring containerd_cfg_dir

The new `imports` line hardcodes `/etc/containerd/conf.d/*.toml` while the rest of the template and role uses the configurable `containerd_cfg_dir` (default `/etc/containerd`). If a user overrides `containerd_cfg_dir` (e.g. to `/opt/containerd`), `config.toml` will be written under the override but containerd will still scan the hardcoded `/etc/containerd/conf.d/`, silently dropping their drop-ins. Fix: `imports = ["{{ containerd_cfg_dir }}/conf.d/*.toml"]`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new imports line hardcodes /etc/containerd/conf.d/*.toml while the rest of the template and role uses the configurable containerd_cfg_dir (default /etc/containerd). If a user overrides containerd_cfg_dir (e.g. to /opt/containerd), config.toml will be written under the override but containerd will still scan the hardcoded /etc/containerd/conf.d/, silently dropping their drop-ins. Fix: imports = ["{{ containerd_cfg_dir }}/conf.d/*.toml"].

Extended reasoning...

What the bug is

The new line

imports = ["/etc/containerd/conf.d/*.toml"]

hardcodes the path /etc/containerd/conf.d/*.toml, but everywhere else in the role the configurable variable containerd_cfg_dir is used to derive containerd-config paths.

Why this is inconsistent with the rest of the role

  • roles/container-engine/containerd/defaults/main.yml:81 declares containerd_cfg_dir: /etc/containerd (overridable).
  • roles/container-engine/containerd/tasks/main.yml:
    • line 36 creates {{ containerd_cfg_dir }} itself,
    • line 68 templates config.toml to {{ containerd_cfg_dir }}/config.toml,
    • lines 79/86 manage {{ containerd_cfg_dir }}/certs.d/ hosts files.
  • Within this same template, containerd_cfg_dir is already used at:
    • line 58 for base_runtime_spec ("{{ containerd_cfg_dir }}/{{ runtime.base_runtime_spec }}"),
    • line 88 for [plugins."io.containerd.cri.v1.images".registry] config_path = "{{ containerd_cfg_dir }}/certs.d".

The new imports line is the only place that bakes in /etc/containerd literally.

Concrete proof / step-by-step

  1. User sets containerd_cfg_dir: /opt/containerd in their inventory.
  2. The role creates /opt/containerd/ (tasks/main.yml:36).
  3. The template renders config.toml and Ansible writes it to /opt/containerd/config.toml (tasks/main.yml:68).
  4. The rendered file contains imports = ["/etc/containerd/conf.d/*.toml"] — pointing at a directory the role never creates or manages under the override.
  5. The user (or another role, e.g. nvidia toolkit) drops 99-nvidia.toml into /opt/containerd/conf.d/ (mirroring where everything else lives).
  6. containerd starts and never sees the drop-in: it reads only the empty/non-existent /etc/containerd/conf.d/. The PR description specifically motivates imports for picking up post-bringup drop-ins, so this silently defeats the feature for the override case.

Why existing code doesn’t prevent it

Nothing else in the role enforces the /etc/containerd literal — defaults match the hardcoded path, so for default deployments the bug is invisible. There is no validation that containerd_cfg_dir equals /etc/containerd, and no symlink or fallback creating /etc/containerd/conf.d/ when an override is in use.

Impact

Limited but real: any operator who overrides containerd_cfg_dir gets a config that silently disagrees with itself. Drop-ins land in {{ containerd_cfg_dir }}/conf.d/ (the natural location, and where the rest of the role keeps subdirectories like certs.d) but imports looks elsewhere. Failures are silent — containerd just doesn’t apply the drop-in.

Fix

One-line change at roles/container-engine/containerd/templates/config.toml.j2:3:

imports = ["{{ containerd_cfg_dir }}/conf.d/*.toml"]

This matches the convention already used at lines 58 and 88 of the same template and at all containerd_cfg_dir callsites in tasks/main.yml.


root = "{{ containerd_storage_dir }}"
state = "{{ containerd_state_dir }}"
oom_score = {{ containerd_oom_score }}
Expand Down Expand Up @@ -88,6 +90,12 @@
[plugins."io.containerd.nri.v1.nri"]
disable = {{ 'false' if nri_enabled else 'true' }}

[plugins."io.containerd.transfer.v1.local"]
[[plugins."io.containerd.transfer.v1.local".unpack_config]]
differ = ""
platform = "linux/amd64"
snapshotter = "{{ containerd_snapshotter }}"

Check failure on line 97 in roles/container-engine/containerd/templates/config.toml.j2

View check run for this annotation

Claude / Claude Code Review

Hardcoded platform="linux/amd64" regresses arm64/arm hosts

The new `unpack_config` block hardcodes `platform = "linux/amd64"`, which regresses arm64/arm hosts: on aarch64 nodes the declared unpack platform won't match the host and containerd will reproduce the very "no unpack platforms defined" failure this PR is meant to fix. Replace with `platform = "linux/{{ image_arch }}"` (or `linux/{{ host_architecture }}`) to stay consistent with how the rest of the codebase handles arch-aware values.
Comment on lines +93 to +97
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new unpack_config block hardcodes platform = "linux/amd64", which regresses arm64/arm hosts: on aarch64 nodes the declared unpack platform won't match the host and containerd will reproduce the very "no unpack platforms defined" failure this PR is meant to fix. Replace with platform = "linux/{{ image_arch }}" (or linux/{{ host_architecture }}) to stay consistent with how the rest of the codebase handles arch-aware values.

Extended reasoning...

What the bug is

The new [[plugins."io.containerd.transfer.v1.local".unpack_config]] block at roles/container-engine/containerd/templates/config.toml.j2:93-97 hardcodes:

platform = "linux/amd64"

This is the same template that is rendered on every containerd node regardless of CPU architecture.

Why it's wrong in this codebase

Kubespray is explicitly multi-arch. The defaults already provide arch-aware variables:

  • roles/kubespray_defaults/defaults/main/main.yml:734-743 defines host_architecture mapping (x86_64amd64, aarch64arm64, armv7larm).
  • roles/kubespray_defaults/defaults/main/download.yml:75 defines image_arch defaulting to host_architecture.

These are used pervasively throughout the project for arch-aware download URLs, checksums, and container image refs (kubelet, kubectl, etcd, cni, containerd itself, etc.). Hardcoding linux/amd64 in this template breaks that pattern.

The failure mode this re-introduces

containerd 2.x's transfer.v1.local selects an unpacker by matching the image's manifest platform against the configured unpack_config entries. If none match the host platform, containerd logs "Unpack configuration not supported, skipping" and pulls fail with "unable to initialize unpacker: no unpack platforms defined" — which is exactly the symptom the PR description cites as the motivation for this change. So on arm64/arm workers this PR replaces a previously-implicit working default with an explicitly-wrong configuration that resurfaces that exact failure.

Step-by-step proof on an aarch64 worker

  1. Operator runs the playbook against an aarch64 ZFS worker. Ansible facts set ansible_architecture = aarch64, so host_architecture = arm64 and image_arch = arm64.
  2. The containerd role renders /etc/containerd/config.toml from this template. The hardcoded line emits platform = "linux/amd64".
  3. containerd starts and registers a single unpacker for linux/amd64 with the configured snapshotter.
  4. kubelet asks containerd to pull a multi-arch sandbox/pod image. The CRI plugin resolves the manifest list and selects the linux/arm64 manifest for the host.
  5. transfer.v1.local walks its unpack_config list looking for an entry whose platform matches linux/arm64. None match (linux/amd64linux/arm64).
  6. containerd logs "Unpack configuration not supported, skipping" and the pull fails with "unable to initialize unpacker: no unpack platforms defined" — the precise failure the PR set out to fix, now triggered on arm64 instead of being avoided.

Fix

One-line change:

platform = "linux/{{ image_arch }}"

(Or linux/{{ host_architecture }} — both resolve to the right value, and either matches the convention already used by the rest of the role.)


{% if containerd_tracing_enabled %}
[plugins."io.containerd.tracing.processor.v1.otlp"]
endpoint = "{{ containerd_tracing_endpoint }}"
Expand Down