Skip to content

[bpf-ci-bot] tc_tunnel/ip6gre subtests flake on s390x due to connect timeout #483

@kernel-patches-review-bot

Description

@kernel-patches-review-bot

Summary

The tc_tunnel test's IPv6 GRE subtests (ip6gre_none, ip6gre_eth, ip6gre_mpls) intermittently fail on s390x with EINPROGRESS (errno 115) because the 1000ms connect timeout is insufficient for IPv6 neighbor resolution on s390x QEMU after repeated address flush/re-add cycles.

Failure Details

Root Cause Analysis

The tc_tunnel test iterates through 18 tunnel subtests. Each subtest cycle calls subtest_cleanup() which flushes all addresses (ip a flush), then subtest_setup() which re-adds IPv6 addresses with nodad. The ip6gre subtests (indices 10-12 in the subtest array) run after 9 prior subtests have each performed this flush/re-add cycle.

On s390x QEMU, after this repeated churn of the kernel's IPv6 neighbor state, resolving the link-layer address for the veth peer (fd::2) via Neighbor Solicitation occasionally takes longer than 1000ms. The connect_client_to_server() function (test_tc_tunnel.c:171) used a hardcoded 1000 instead of the TIMEOUT_MS macro, and the macro itself was only 1000ms.

Key observations:

  • The failure always occurs in the "connect without any encap" phase (before any tunnel is configured), confirming it's a basic IPv6 connectivity timing issue, not a tunnel convergence problem
  • Exactly one of the three ip6gre subtests fails per run (random selection), and the others pass — indicating the issue self-resolves within the time it takes to run the other subtests
  • Earlier IPv6 subtests (ip6tnl at index 2, ip6vxlan at index 6) never fail because fewer flush/re-add cycles have occurred by that point
  • Later IPv6 subtests (ip6udp at indices 16-18) never fail because by then neighbor state has stabilized

The existing commit 2790db208b44 ("selftests/bpf: Improve tc_tunnel test reliability") increased TIMEOUT_MS from 500ms to 1000ms but left connect_client_to_server() using a separate hardcoded value and didn't account for s390x QEMU overhead.

Proposed Fix

The patch (0001-selftests-bpf-Fix-tc_tunnel-ip6gre-timeout-on-s390x.patch) makes two changes:

  1. Increase TIMEOUT_MS from 1000ms to 2000ms — provides sufficient margin for IPv6 neighbor resolution on emulated architectures after repeated address churn
  2. Use TIMEOUT_MS in connect_client_to_server() instead of hardcoded 1000 — ensures all timeout values are consolidated and can be tuned via a single macro

Impact

Without the fix, the ip6gre subtests will continue to randomly fail on s390x, causing approximately 1 spurious test failure per 3 s390x CI runs. This creates noise that masks real regressions and wastes developer time investigating false failures.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions