Skip to content

[bpf-ci-bot] tc_tunnel/udp_mpls times out on aarch64 due to insufficient connect timeout #475

@kernel-patches-review-bot

Description

@kernel-patches-review-bot

Summary

The tc_tunnel/udp_mpls subtest fails intermittently on aarch64 with EINPROGRESS (errno 115) because the 1000ms connect() timeout is insufficient for the FOU+MPLS+ipip tunnel path to converge under QEMU emulation. Increasing the timeout to 2000ms and using the TIMEOUT_MS macro consistently fixes the issue.

Failure Details

Root Cause Analysis

The tc_tunnel test suite validates BPF-based tunnel encapsulation/decapsulation. The udp_mpls subtest has the most complex kernel decapsulation path, requiring:

  1. FOU (Foo-over-UDP) rx port registration (ip fou add port 6635 ipproto 137)
  2. An ipip tunnel in mode any ttl 255
  3. MPLS label table setup (65536 entries, route, loopback, testtun0 input enabled)
  4. Reverse path filter disabled

After configure_kernel_decapsulation() completes all setup commands and brings testtun0 up, the test immediately tries to establish a TCP connection through the tunnel path. On aarch64 QEMU, the kernel needs more time after interface up for the full data path (BPF encap → FOU/MPLS encap → veth → FOU decap → MPLS routing → server) to become operational.

The client connect() call in connect_client_to_server() (test_tc_tunnel.c:171) has SO_SNDTIMEO set to 1000ms. When the tunnel path isn't ready within this window, connect() returns EINPROGRESS and the test fails.

Contributing factors:

  • Commit 2790db208b44 increased the timeout from 500ms to 1000ms but only tested on x86_64, not aarch64 QEMU
  • connect_client_to_server() used a hardcoded 1000 instead of the TIMEOUT_MS macro, meaning the timeout wasn't consolidated with the server-side timeout
  • udp_mpls is the only subtest combining all three of: FOU, MPLS, and kernel decapsulation (other MPLS variants either skip kernel decap via expect_kern_decap_failure or don't use FOU)

Proposed Fix

The patch (0001-selftests-bpf-Fix-tc_tunnel-udp_mpls-timeout-on-aarc.patch) makes two changes:

  1. Increase TIMEOUT_MS from 1000 to 2000 — provides sufficient margin for tunnel path convergence on emulated architectures while still detecting genuine connectivity failures promptly
  2. Use TIMEOUT_MS in connect_client_to_server() instead of the hardcoded 1000 — ensures server and client timeouts are consistent and controlled by a single macro

Impact

When this test fails, it is the sole failure in the CI run (1 FAILED out of 6202 subtests), causing the entire aarch64 test_progs job to report failure. This creates noise for unrelated patch submissions and wastes reviewer time investigating false negatives.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions