test: Remove badNonce retries and increase nonce maxConnectionAge#8661
test: Remove badNonce retries and increase nonce maxConnectionAge#8661
Conversation
25e5851 to
3f8196e
Compare
aa55a21 to
be7546b
Compare
jsha
left a comment
There was a problem hiding this comment.
Looks good in general but there is some weirdness with the diff showing changes from other PRs that were already merged.
be7546b to
14d4c84
Compare
14d4c84 to
d0eab68
Compare
|
@beautifulentropy, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values. |
All fixed, apologies for the force-push. |
No need, we don't plan to make a corresponding change to this value in staging/production. |
The nonce service's maxConnectionAge (30s) periodically results in a GOAWAY being sent to the WFE's gRPC connections, causing affected SubConns to briefly leave READY state while reconnecting. Due to jitter on maxConnectionAge, the getNonceService and redeemNonceService connections to the same backend can GOAWAY at slightly different times, creating a window where the WFE can still issue nonces from a backend it can no longer redeem against. The chisel2.py retry logic was added to paper over this, but retries mask real failures.
Note: no corresponding change is made/possible in the Go integration tests because badNonce retries are handled internally by github.com/eggsampler/acme.
Since integration test runs complete well within 30 minutes, increasing maxConnectionAge to 30m ensures nonce connections are never cycled during a CI run, which should eliminate the flake.
A follow-up PR will address the underlying issue.
Part of #8662