Skip to content

Might be a bug: AllowedCPUs set to a fixed range due to endianness mismatch #986

@zehongwong

Description

@zehongwong

Background

We experienced significant performance degration for workloads running on sysbox after upgrading k8s os system from Ubuntu 20.04 (cgroup v1) to Ubuntu 22.04 (cgroup v2).

Potential Root Cause

  • When creating containers, sysbox writes AllowedCPUs into the transient systemd scope unit and its drop-in via the D-Bus APIs:
    • /run/systemd/transient/crio-<unit>.scope - created via StartTransientUnit
    • /run/systemd/transient/crio-<unit>.scope.d/50-AllowedCPUs.conf - set via SetUnitProperties
  • When building the above D-Bus requests, sysbox packs the CPU range into a byte stream using big-endian ordering.
  • However, based on systemd’s D-Bus CPU mask handling (see cpu-set-util.c), the D-Bus CPU mask representation appears to expect a little-endian byte stream. Using big-endian ordering therefore reverses the byte order and can cause the interpreted CPU range to drift. For example:
# big-endian encoding (current behavior)
CPU Range: 2-5      Bytes: 3c         Actual CPUs: [2 3 4 5]
CPU Range: 6-9      Bytes: 03 c0      Actual CPUs: [0 1 14 15]
CPU Range: 10-13    Bytes: 3c 00      Actual CPUs: [2 3 4 5]
CPU Range: 14-17    Bytes: 03 c0 00   Actual CPUs: [0 1 14 15]

# little-endian encoding (systemd expectation)
CPU Range: 2-5      Bytes: 3c         Actual CPUs: [2 3 4 5]
CPU Range: 6-9      Bytes: c0 03      Actual CPUs: [6 7 8 9]
CPU Range: 10-13    Bytes: 00 3c      Actual CPUs: [10 11 12 13]
CPU Range: 14-17    Bytes: 00 c0 03   Actual CPUs: [14 15 16 17]
  • Only until with cgroup v2, the EffectiveCPUs is constrained by AllowedCPUs(ref) after systemd reloads, which causes significant CPU idling and forces workloads to contend for an overlapping CPU range, resulting in performance degradation.

Direct Proof

Using the sysbox Unit Tests (cpuset_test.go) byte stream as input and the systemd-dbus function to retrieve the actual CPU set, we can observe a drift (see C program). Only when reversing the byte string can we retrieve the correct CPU set.

Reproduce

  1. Enable static cpu manager policy on kubelet
--cpu-cfs-quota=false --cpu-manager-policy=static
  1. Create 3 runners with 4 static CPU each running on sysbox
template:
  metadata:
    annotations:
      io.kubernetes.cri-o.userns-mode: auto:size=65536
  spec:
    runtimeClassName: sysbox-runc
    containers:
      - name: runner-x
        resources:
          limits:
            cpu: "4"
            memory: 256Mi
          requests:
            cpu: "4"
            memory: 256Mi
  1. Get the init state: effective cpu is correctly matched the kubelet setup 10-13
cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-1,34-63","entries":{"xx":{"runner-1":"2-5","runner-2":"6-9","runner-3":"10-13"}},"checksum":2693667676}

systemctl show crio-<unit>.scope -p EffectiveCPUs,AllowedCPUs
EffectiveCPUs=10-13
AllowedCPUs=2-5

cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<uid>.slice/crio-<unit>.scope/cpuset.cpus
10-13
cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<uid>.slice/crio-<unit>.scope/cpuset.cpus.effective
10-13

# Set by dbus StartTransientUnit
cat /run/systemd/transient/crio-<unit>.scope | grep AllowedCPUs
AllowedCPUs=2-5

# Set by dbus SetUnitProperties
cat /run/systemd/transient/crio-<unit>.scope.d/50-AllowedCPUs.conf  | grep AllowedCPUs
AllowedCPUs=2-5
  1. Trigger a systemd reloads
systemctl daemon-reload
  1. Effective CPU will then drift into the wrong range 2-5
systemctl show crio-<unit>.scope -p EffectiveCPUs,AllowedCPUs
EffectiveCPUs=2-5
AllowedCPUs=2-5

cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<uid>.slice/crio-<unit>.scope/cpuset.cpus
2-5
cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<uid>.slice/crio-<unit>.scope/cpuset.cpus.effective
2-5

Potential Fix

zehongwong/sysbox-runc#1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions