Configure retries/backoff for gitlab-runner k8s API requests#1176
Open
mvandenburgh wants to merge 1 commit intomainfrom
Open
Configure retries/backoff for gitlab-runner k8s API requests#1176mvandenburgh wants to merge 1 commit intomainfrom
mvandenburgh wants to merge 1 commit intomainfrom
Conversation
jjnesbitt
reviewed
Aug 1, 2025
Comment on lines
+104
to
+115
| retry_backoff_max = 30000 | ||
|
|
||
| # This is the default retry limit. We override this for specific classes of | ||
| # errors below. | ||
| retry_limit = 5 | ||
|
|
||
| [runners.kubernetes.retry_limits] | ||
| # Retry this type of error 10 times instead of 5. | ||
| # This error usually occurs when the EKS API server times out or | ||
| # is unreachable. Presumably the server will eventually become | ||
| # available again, so we want to give the pod plenty of time to retry. | ||
| "tls: internal error" = 10 |
Collaborator
There was a problem hiding this comment.
retry_backoff_max seems to just control the maximum value the retry interval can reach. Do you know what value the retry interval starts at? And how the backoff is incremented? Is it doubled each time, etc.?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Docs for these two settings: https://docs.gitlab.com/runner/executors/kubernetes/#configure-the-number-of-request-attempts-to-the-kubernetes-api
Job system failures like this one, i.e. an error that looks like
error dialing backend: remote error: tls: internal error, indicate that the pipeline pod failed to receive a response from the k8s/EKS API server. It's still unclear why this is happening, but one potential explanation is that the default timeout for EKS API requests (2 seconds) is getting exceeded.Long term, I would like to set up https://docs.aws.amazon.com/eks/latest/best-practices/control_plane_monitoring.html so we can get more insight into what's going on with the control plane.