-
Notifications
You must be signed in to change notification settings - Fork 58
template: wait for pod to teardown (if container is present) during delete #353
base: master
Are you sure you want to change the base?
Conversation
8f8befa to
3e3d553
Compare
3e3d553 to
37218ef
Compare
|
Created a better version of this:
I didn't get a chance to test this yet |
petr-muller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure is better, but I think the fundamental problem stays (more inline).
Plus, I caught myself wondering - what problem does this actually solve? If we wait for a pod to be deleted, what good is waiting for its containers to terminate? Can you describe the problem that this PR would prevent?
ci-operator would wait for 300 secs only. If teardown didn't finish by that time the pod would be removed: leftover artifacts (usually Route53 records) would remain and cause issues on next retest |
|
/hold I can't come up with a way to test this yet |
|
/cc @stevekuznetsov |
stevekuznetsov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing something here -- the pod actually being deleted and gone from the API server is a stronger requirement than the teardown container inside of it being terminated. Why are we making this change?
When test gets cancelled - a new commit pushed in rehearse tests for instance - ci-operator would send termination signal and wait for pod to be gone for 5 mins. In most install tests teardown + artifacts take longer than 5 mins. This change would wait longer if pod has See also #353 (comment) |
|
OK, makes sense. Why not start a watch? |
abb16d4 to
3448245
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vrutkovs The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
3448245 to
3211bf0
Compare
|
Reworked this to leverage polls and watches:
|
54c3b8a to
4993c4a
Compare
71b5c95 to
1f2f01f
Compare
1f2f01f to
b879a2a
Compare
b879a2a to
6adf813
Compare
|
LGTM, let's give @stevekuznetsov a chance to review |
stevekuznetsov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add unit tests for these functions?
pkg/steps/template.go
Outdated
| return nil | ||
| } | ||
|
|
||
| // Check that pod with this name exists and has the same UID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is misleading
pkg/steps/template.go
Outdated
| time.Sleep(2 * time.Second) | ||
|
|
||
| for _, status := range append(append([]coreapi.ContainerStatus{}, pod.Status.InitContainerStatuses...), pod.Status.ContainerStatuses...) { | ||
| if status.Name == "teardown" && status.State.Terminated != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is teardown ever an initcontainer?
pkg/steps/template.go
Outdated
| timeout := 5 * time.Minute | ||
|
|
||
| log.Printf("Waiting for pod %s to complete teardown ...", name) | ||
| wait.Poll(10*time.Second, timeout, func() (done bool, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to poll ever 2s -- why change?
pkg/steps/template.go
Outdated
| func waitForPodDeletion(podClient coreclientset.PodInterface, name string, uid types.UID) error { | ||
| timeout := 5 * time.Minute | ||
| pod, err := checkPodExistsAndValid(podClient, name, uid) | ||
| if err != nil || pod == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the case that err == nil but pod ==nil why do you return a nil err here? please leave a comment
pkg/steps/template.go
Outdated
| return pod, nil | ||
| } | ||
|
|
||
| func waitForPodDeletion(podClient coreclientset.PodInterface, name string, uid types.UID) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitForPodDeletion is no longer valid as a name -- were all callers expecting this new behavior?
| } | ||
| } | ||
|
|
||
| watcher, err := podClient.Watch(meta.ListOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're setting up a watch, why not just use it for all of the interaction? Why the poll?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think container terminate status can be watched, can it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? You'd get any changes to PodStatus if I understand Watches correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(watch the Pod, not the container)
|
|
||
| return fmt.Errorf("waited for pod %s deletion for %ds, was not deleted", name, timeout) | ||
| log.Printf("Waiting for pod %s to be deleted in %d seconds", name, timeout) | ||
| _, err = watch.Until(timeout, watcher, func(event watch.Event) (done bool, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a 5min watch for deleted after a 5min retry on the container step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Artifacts upload also take time to complete
pkg/steps/template.go
Outdated
| for _, status := range append(append([]coreapi.ContainerStatus{}, pod.Status.InitContainerStatuses...), pod.Status.ContainerStatuses...) { | ||
| names = append(names, status.Name) | ||
| } | ||
| sort.Strings(names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
pkg/steps/template.go
Outdated
|
|
||
| // Attempts to wait for teardown to complete | ||
| containerNames := podContainerNames(pod) | ||
| if sort.SearchStrings(containerNames, "teardown") < len(containerNames) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I like if sets.NewString(containerNames).Has("teardown") a lot more than these types of manipulations
f758b7d to
697f542
Compare
|
Simplified this:
|
pkg/steps/template.go
Outdated
| // pod was deleted | ||
| return true, nil | ||
| case watch.Added, watch.Modified: | ||
| if hasTeardownContainer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this logic. If we have a teardown container, we will exit out early every time. If we don't have a teardown container, we set this boolean to false. The comment says that will avoid re-checking, but in reality that means we check every time. Then, if the teardown container is terminated, you signal you are done, so the watch ends. I think we just need a dead-simple watch, or two watches. If you want to have one watch with a variable timeout, just wait for the deletion. If you want to wait for the teardown container completion and the pod deletion separately, you will want separate watches.
In general, if the issue was a too-short timeout that cut the teardown container short, why not just make this watch go on for an hour? In what cases do we not want to wait for the Pod to really be gone?
Also, if you look at the implementations in the build utils, we want a list then a watch with retires to handle transient errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a teardown container, we will exit out early every time
Fixed by introducing teardownFinished var
If you want to wait for the teardown container completion and the pod deletion separately, you will want separate watches.
That was my initial idea (see f758b7d), however there is short time (between teardown watch and pod to be deleted watch) where pod may be destroyed and could have been replaced with a new pod. Two watches don't seem reliable to me
why not just make this watch go on for an hour? In what cases do we not want to wait for the Pod to really be gone?
That would hide potential issues in teardown
we want a list then a watch with retires to handle transient errors.
Using event, ok := <-watcher.ResultChan()? It doesn't seem to have some kind of timeout
…elete Pod teardown may take longer than 5 mins (default ci-operator timeout). This commit would ensure the timeout is extended to wait for teardown container to complete This is useful for rehearse jobs, which reuse the namespace when testing a new commit
697f542 to
2a10196
Compare
Which? Before we spend more time working on an implementation can we determine why the (stupid simple) approach of doing more retries over a 10, 20, 30 minute period would not be appropriate? Do we have some SLA for teardown time? |
|
Extending the timeout is the simplest approach, and its valid, however it would apply to all ci-operator pods. e2e-aws's teardown is the only one I know of which takes longer than 5 mins at the moment, other types of tests may rely on existing timeout. This PR is just one possible way, of course. If it looks overcomplicated then lets just bump the timeout on teardown to fix rehearse failures at least |
|
Of course it would hit all pods, but we poll every 2 seconds right now, so the only cases where increasing the timeout would actually increase the time taken for the test to run is if the pod is not returning in the current timeout, and then it would only increase it by the time taken to finish tearing down, right? |
|
Right, it appears a larger timeout would act the same. Created #358 instead. I'll keep this open for now |
Pod teardown may take longer than 5 mins (default ci-operator timeout).
This commit would ensure the same timeout is applied to the teardown
container - and then applied to the pod again.
This is useful for rehearse jobs, which reuse the namespace when testing
a new commit.
TODO:
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1707486