Summary
The tc_tunnel test's IPv6 GRE subtests (ip6gre_none, ip6gre_eth, ip6gre_mpls) intermittently fail on s390x with EINPROGRESS (errno 115) because the 1000ms connect timeout is insufficient for IPv6 neighbor resolution on s390x QEMU after repeated address flush/re-add cycles.
Failure Details
- Test / Component:
tc_tunnel/ip6gre_none, tc_tunnel/ip6gre_eth, tc_tunnel/ip6gre_mpls
- Frequency: Occasional on s390x — observed in 3 out of 3 independent s390x test_progs/test_progs_cpuv4 runs on May 3, 2026
- Failure mode: Flaky —
connect() returns EINPROGRESS (errno 115) due to SO_SNDTIMEO expiry during initial "connect without any encap" phase
- Affected architectures: s390x (QEMU emulation)
- CI runs observed:
Root Cause Analysis
The tc_tunnel test iterates through 18 tunnel subtests. Each subtest cycle calls subtest_cleanup() which flushes all addresses (ip a flush), then subtest_setup() which re-adds IPv6 addresses with nodad. The ip6gre subtests (indices 10-12 in the subtest array) run after 9 prior subtests have each performed this flush/re-add cycle.
On s390x QEMU, after this repeated churn of the kernel's IPv6 neighbor state, resolving the link-layer address for the veth peer (fd::2) via Neighbor Solicitation occasionally takes longer than 1000ms. The connect_client_to_server() function (test_tc_tunnel.c:171) used a hardcoded 1000 instead of the TIMEOUT_MS macro, and the macro itself was only 1000ms.
Key observations:
- The failure always occurs in the "connect without any encap" phase (before any tunnel is configured), confirming it's a basic IPv6 connectivity timing issue, not a tunnel convergence problem
- Exactly one of the three ip6gre subtests fails per run (random selection), and the others pass — indicating the issue self-resolves within the time it takes to run the other subtests
- Earlier IPv6 subtests (ip6tnl at index 2, ip6vxlan at index 6) never fail because fewer flush/re-add cycles have occurred by that point
- Later IPv6 subtests (ip6udp at indices 16-18) never fail because by then neighbor state has stabilized
The existing commit 2790db208b44 ("selftests/bpf: Improve tc_tunnel test reliability") increased TIMEOUT_MS from 500ms to 1000ms but left connect_client_to_server() using a separate hardcoded value and didn't account for s390x QEMU overhead.
Proposed Fix
The patch (0001-selftests-bpf-Fix-tc_tunnel-ip6gre-timeout-on-s390x.patch) makes two changes:
- Increase
TIMEOUT_MS from 1000ms to 2000ms — provides sufficient margin for IPv6 neighbor resolution on emulated architectures after repeated address churn
- Use
TIMEOUT_MS in connect_client_to_server() instead of hardcoded 1000 — ensures all timeout values are consolidated and can be tuned via a single macro
Impact
Without the fix, the ip6gre subtests will continue to randomly fail on s390x, causing approximately 1 spurious test failure per 3 s390x CI runs. This creates noise that masks real regressions and wastes developer time investigating false failures.
References
Summary
The
tc_tunneltest's IPv6 GRE subtests (ip6gre_none,ip6gre_eth,ip6gre_mpls) intermittently fail on s390x with EINPROGRESS (errno 115) because the 1000ms connect timeout is insufficient for IPv6 neighbor resolution on s390x QEMU after repeated address flush/re-add cycles.Failure Details
tc_tunnel/ip6gre_none,tc_tunnel/ip6gre_eth,tc_tunnel/ip6gre_mplsconnect()returnsEINPROGRESS(errno 115) due toSO_SNDTIMEOexpiry during initial "connect without any encap" phaseRoot Cause Analysis
The
tc_tunneltest iterates through 18 tunnel subtests. Each subtest cycle callssubtest_cleanup()which flushes all addresses (ip a flush), thensubtest_setup()which re-adds IPv6 addresses withnodad. The ip6gre subtests (indices 10-12 in the subtest array) run after 9 prior subtests have each performed this flush/re-add cycle.On s390x QEMU, after this repeated churn of the kernel's IPv6 neighbor state, resolving the link-layer address for the veth peer (
fd::2) via Neighbor Solicitation occasionally takes longer than 1000ms. Theconnect_client_to_server()function (test_tc_tunnel.c:171) used a hardcoded1000instead of theTIMEOUT_MSmacro, and the macro itself was only 1000ms.Key observations:
The existing commit
2790db208b44("selftests/bpf: Improve tc_tunnel test reliability") increasedTIMEOUT_MSfrom 500ms to 1000ms but leftconnect_client_to_server()using a separate hardcoded value and didn't account for s390x QEMU overhead.Proposed Fix
The patch (
0001-selftests-bpf-Fix-tc_tunnel-ip6gre-timeout-on-s390x.patch) makes two changes:TIMEOUT_MSfrom 1000ms to 2000ms — provides sufficient margin for IPv6 neighbor resolution on emulated architectures after repeated address churnTIMEOUT_MSinconnect_client_to_server()instead of hardcoded1000— ensures all timeout values are consolidated and can be tuned via a single macroImpact
Without the fix, the ip6gre subtests will continue to randomly fail on s390x, causing approximately 1 spurious test failure per 3 s390x CI runs. This creates noise that masks real regressions and wastes developer time investigating false failures.
References
tc_tunnel/udp_mplson aarch64 (same root cause, different arch/subtest)2790db208b44: Previous reliability improvement that raised timeout from 500ms to 1000ms