Skip to content

[bpf-ci-bot] fd_array_cnt/referenced_btfs flaky test: insufficient wait timeout #484

@kernel-patches-review-bot

Description

@kernel-patches-review-bot

Summary

The check_fd_array_cnt__referenced_btfs subtest has a polling loop that waits for a BTF object to be freed after closing the owning BPF program. The loop comment states "max ~1 second" but only iterates 100 times with 1ms sleeps (= 100ms actual), causing flaky failures under CI load when async BPF program freeing (RCU + workqueue) takes longer than 100ms.

Failure Details

  • Test / Component: fd_array_cnt/referenced_btfs (test Make job matrix reusable #115/5 in test_progs)
  • Frequency: Rare — observed in 1 out of ~30 CI runs examined (aarch64 cpuv4, May 10 2026)
  • Failure mode: Flaky — timing-dependent false failure when workqueue is slow
  • Affected architectures: All (observed on aarch64, theoretically any under load)
  • CI runs observed:

Root Cause Analysis

In tools/testing/selftests/bpf/prog_tests/fd_array.c:330:

/* The program is freed by a workqueue, so no reliable
 * way to sync, so just wait a bit (max ~1 second). */
for (tries = 100; tries >= 0; tries--) {
    usleep(1000);
    ...
}

The BPF program free path after close(prog_fd) is:

  1. bpf_prog_put__bpf_prog_put → (potentially schedule_work)
  2. __bpf_prog_put_norefcall_rcu_tasks_trace or call_rcu
  3. RCU callback → __bpf_prog_put_rcuschedule_work
  4. Workqueue → bpf_prog_free_deferredbpf_free_used_btfs

This involves at least one RCU grace period plus workqueue scheduling. Under load, RCU grace periods alone can exceed 100ms, making the 100ms total timeout insufficient.

The comment says "max ~1 second" which matches the pattern in prog_tests/exe_ctx.c:45:

usleep(1000); /* Wait 1ms per iteration, up to 1 sec total */

...where the loop runs 1000 iterations (not 100).

Proposed Fix

Change tries = 100 to tries = 1000 so the actual maximum wait matches the documented intent of ~1 second. This is a 10x increase in maximum wait time, which costs nothing in the success path (the loop breaks immediately once BTF is freed) but prevents false failures under load.

Patch file: 0001-selftests-bpf-Fix-insufficient-wait-timeout-in-fd_a.patch

Impact

Without this fix, the test will continue to flake under CI load. While currently rare, the failure rate will increase as CI workloads grow or KASAN/debug options slow the kernel. The test provides important coverage for fd_array BTF reference counting behavior, so it should be reliable rather than denylisted.

References

  • tools/testing/selftests/bpf/prog_tests/fd_array.c:330 (the bug)
  • tools/testing/selftests/bpf/prog_tests/exe_ctx.c:45 (correct pattern for comparison)
  • Commit 1c593d7402b1 ("selftests/bpf: Add tests for fd_array_cnt") — original introduction
  • kernel/bpf/syscall.c:2404 (__bpf_prog_put — the async free path)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions