-
Notifications
You must be signed in to change notification settings - Fork 196
Open
Description
Background
We experienced significant performance degration for workloads running on sysbox after upgrading k8s os system from Ubuntu 20.04 (cgroup v1) to Ubuntu 22.04 (cgroup v2).
Potential Root Cause
- When creating containers, sysbox writes
AllowedCPUsinto the transient systemd scope unit and its drop-in via the D-Bus APIs:/run/systemd/transient/crio-<unit>.scope- created viaStartTransientUnit/run/systemd/transient/crio-<unit>.scope.d/50-AllowedCPUs.conf- set viaSetUnitProperties
- When building the above D-Bus requests, sysbox packs the CPU range into a byte stream using big-endian ordering.
- However, based on systemd’s D-Bus CPU mask handling (see cpu-set-util.c), the D-Bus CPU mask representation appears to expect a little-endian byte stream. Using big-endian ordering therefore reverses the byte order and can cause the interpreted CPU range to drift. For example:
# big-endian encoding (current behavior)
CPU Range: 2-5 Bytes: 3c Actual CPUs: [2 3 4 5]
CPU Range: 6-9 Bytes: 03 c0 Actual CPUs: [0 1 14 15]
CPU Range: 10-13 Bytes: 3c 00 Actual CPUs: [2 3 4 5]
CPU Range: 14-17 Bytes: 03 c0 00 Actual CPUs: [0 1 14 15]
# little-endian encoding (systemd expectation)
CPU Range: 2-5 Bytes: 3c Actual CPUs: [2 3 4 5]
CPU Range: 6-9 Bytes: c0 03 Actual CPUs: [6 7 8 9]
CPU Range: 10-13 Bytes: 00 3c Actual CPUs: [10 11 12 13]
CPU Range: 14-17 Bytes: 00 c0 03 Actual CPUs: [14 15 16 17]
- Only until with
cgroup v2, theEffectiveCPUsis constrained byAllowedCPUs(ref) after systemd reloads, which causes significant CPU idling and forces workloads to contend for an overlapping CPU range, resulting in performance degradation.
Direct Proof
Using the sysbox Unit Tests (cpuset_test.go) byte stream as input and the systemd-dbus function to retrieve the actual CPU set, we can observe a drift (see C program). Only when reversing the byte string can we retrieve the correct CPU set.
Reproduce
- Enable static cpu manager policy on kubelet
--cpu-cfs-quota=false --cpu-manager-policy=static
- Create 3 runners with 4 static CPU each running on sysbox
template:
metadata:
annotations:
io.kubernetes.cri-o.userns-mode: auto:size=65536
spec:
runtimeClassName: sysbox-runc
containers:
- name: runner-x
resources:
limits:
cpu: "4"
memory: 256Mi
requests:
cpu: "4"
memory: 256Mi
- Get the init state: effective cpu is correctly matched the kubelet setup
10-13
cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-1,34-63","entries":{"xx":{"runner-1":"2-5","runner-2":"6-9","runner-3":"10-13"}},"checksum":2693667676}
systemctl show crio-<unit>.scope -p EffectiveCPUs,AllowedCPUs
EffectiveCPUs=10-13
AllowedCPUs=2-5
cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<uid>.slice/crio-<unit>.scope/cpuset.cpus
10-13
cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<uid>.slice/crio-<unit>.scope/cpuset.cpus.effective
10-13
# Set by dbus StartTransientUnit
cat /run/systemd/transient/crio-<unit>.scope | grep AllowedCPUs
AllowedCPUs=2-5
# Set by dbus SetUnitProperties
cat /run/systemd/transient/crio-<unit>.scope.d/50-AllowedCPUs.conf | grep AllowedCPUs
AllowedCPUs=2-5
- Trigger a systemd reloads
systemctl daemon-reload
- Effective CPU will then drift into the wrong range
2-5
systemctl show crio-<unit>.scope -p EffectiveCPUs,AllowedCPUs
EffectiveCPUs=2-5
AllowedCPUs=2-5
cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<uid>.slice/crio-<unit>.scope/cpuset.cpus
2-5
cat /sys/fs/cgroup/kubepods.slice/kubepods-pod<uid>.slice/crio-<unit>.scope/cpuset.cpus.effective
2-5
Potential Fix
Metadata
Metadata
Assignees
Labels
No labels