template: wait for pod to teardown (if container is present) during delete #353

vrutkovs · 2019-05-27T14:42:26Z

Pod teardown may take longer than 5 mins (default ci-operator timeout).
This commit would ensure the same timeout is applied to the teardown
container - and then applied to the pod again.

This is useful for rehearse jobs, which reuse the namespace when testing
a new commit.

TODO:

Add tests
Avoid using two watchers? Extend timeout to 10 mins if there is a teardown container?

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1707486

pkg/steps/template.go

vrutkovs · 2019-05-29T17:45:42Z

Created a better version of this:

added a few helper functions
Before the loop wait for teardown to complete (if this container exists in the pod)
Wait for pod to be deleted

I didn't get a chance to test this yet

petr-muller

The structure is better, but I think the fundamental problem stays (more inline).

Plus, I caught myself wondering - what problem does this actually solve? If we wait for a pod to be deleted, what good is waiting for its containers to terminate? Can you describe the problem that this PR would prevent?

pkg/steps/template.go

vrutkovs · 2019-06-03T08:53:29Z

what problem does this actually solve? If we wait for a pod to be deleted, what good is waiting for its containers to terminate?

ci-operator would wait for 300 secs only. If teardown didn't finish by that time the pod would be removed: leftover artifacts (usually Route53 records) would remain and cause issues on next retest

pkg/steps/template.go

vrutkovs · 2019-06-04T09:54:31Z

/hold

I can't come up with a way to test this yet

vrutkovs · 2019-06-07T23:33:10Z

/cc @stevekuznetsov

stevekuznetsov

I think I'm missing something here -- the pod actually being deleted and gone from the API server is a stronger requirement than the teardown container inside of it being terminated. Why are we making this change?

pkg/steps/template.go

vrutkovs · 2019-06-08T08:24:08Z

Why are we making this change?

When test gets cancelled - a new commit pushed in rehearse tests for instance - ci-operator would send termination signal and wait for pod to be gone for 5 mins. In most install tests teardown + artifacts take longer than 5 mins.

This change would wait longer if pod has teardown container - so that it won't affect unit tests / release jobs etc. It would first wait 5 mins for teardown to complete and then 5 mins for pod to disappear (in most of the installs teardown completes in 5-7 minutes).

See also #353 (comment)

stevekuznetsov · 2019-06-08T15:24:23Z

OK, makes sense. Why not start a watch?

openshift-ci-robot · 2019-06-10T09:03:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vrutkovs
To complete the pull request process, please assign stevekuznetsov
You can assign the PR to them by writing /assign @stevekuznetsov in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vrutkovs · 2019-06-10T09:09:43Z

Reworked this to leverage polls and watches:

waitForTeardownToComplete polls for a list of containers
waitForPodDeletion watches for pod to be removed with a timeout

pkg/steps/template.go

petr-muller · 2019-06-10T13:40:28Z

LGTM, let's give @stevekuznetsov a chance to review

stevekuznetsov

Can we add unit tests for these functions?

stevekuznetsov · 2019-06-10T14:51:24Z

pkg/steps/template.go

+	return nil
+}
+
+// Check that pod with this name exists and has the same UID


this is misleading

stevekuznetsov · 2019-06-10T14:51:48Z

pkg/steps/template.go

-		time.Sleep(2 * time.Second)
+
+		for _, status := range append(append([]coreapi.ContainerStatus{}, pod.Status.InitContainerStatuses...), pod.Status.ContainerStatuses...) {
+			if status.Name == "teardown" && status.State.Terminated != nil {


Is teardown ever an initcontainer?

stevekuznetsov · 2019-06-10T14:52:32Z

pkg/steps/template.go

+	timeout := 5 * time.Minute
+
+	log.Printf("Waiting for pod %s to complete teardown ...", name)
+	wait.Poll(10*time.Second, timeout, func() (done bool, err error) {


We used to poll ever 2s -- why change?

stevekuznetsov · 2019-06-10T14:53:44Z

pkg/steps/template.go

+func waitForPodDeletion(podClient coreclientset.PodInterface, name string, uid types.UID) error {
+	timeout := 5 * time.Minute
+	pod, err := checkPodExistsAndValid(podClient, name, uid)
+	if err != nil || pod == nil {


in the case that err == nil but pod ==nil why do you return a nil err here? please leave a comment

stevekuznetsov · 2019-06-10T14:54:41Z

pkg/steps/template.go

+	return pod, nil
+}
+
+func waitForPodDeletion(podClient coreclientset.PodInterface, name string, uid types.UID) error {


waitForPodDeletion is no longer valid as a name -- were all callers expecting this new behavior?

stevekuznetsov · 2019-06-10T14:55:48Z

pkg/steps/template.go

+		}
+	}
+
+	watcher, err := podClient.Watch(meta.ListOptions{


If you're setting up a watch, why not just use it for all of the interaction? Why the poll?

I don't think container terminate status can be watched, can it?

Why not? You'd get any changes to PodStatus if I understand Watches correctly

(watch the Pod, not the container)

stevekuznetsov · 2019-06-10T14:56:18Z

pkg/steps/template.go


-	return fmt.Errorf("waited for pod %s deletion for %ds, was not deleted", name, timeout)
+	log.Printf("Waiting for pod %s to be deleted in %d seconds", name, timeout)
+	_, err = watch.Until(timeout, watcher, func(event watch.Event) (done bool, err error) {


Why a 5min watch for deleted after a 5min retry on the container step?

Artifacts upload also take time to complete

stevekuznetsov · 2019-06-10T14:56:37Z

pkg/steps/template.go

+	for _, status := range append(append([]coreapi.ContainerStatus{}, pod.Status.InitContainerStatuses...), pod.Status.ContainerStatuses...) {
+		names = append(names, status.Name)
+	}
+	sort.Strings(names)


stevekuznetsov · 2019-06-10T14:57:11Z

pkg/steps/template.go

+
+	// Attempts to wait for teardown to complete
+	containerNames := podContainerNames(pod)
+	if sort.SearchStrings(containerNames, "teardown") < len(containerNames) {


nit: I like if sets.NewString(containerNames).Has("teardown") a lot more than these types of manipulations

vrutkovs · 2019-06-11T11:35:59Z

Simplified this:

If pod has teardown container timeout is extended by 10 mins (average teardown takes 10 min on AWS, but might take longer on openstack)
waitForPodDeletion now uses a watcher instead of polling.
Should it also print the message when teardown has completed?

stevekuznetsov · 2019-06-11T16:50:40Z

pkg/steps/template.go

+			// pod was deleted
+			return true, nil
+		case watch.Added, watch.Modified:
+			if hasTeardownContainer {


I don't understand this logic. If we have a teardown container, we will exit out early every time. If we don't have a teardown container, we set this boolean to false. The comment says that will avoid re-checking, but in reality that means we check every time. Then, if the teardown container is terminated, you signal you are done, so the watch ends. I think we just need a dead-simple watch, or two watches. If you want to have one watch with a variable timeout, just wait for the deletion. If you want to wait for the teardown container completion and the pod deletion separately, you will want separate watches.

In general, if the issue was a too-short timeout that cut the teardown container short, why not just make this watch go on for an hour? In what cases do we not want to wait for the Pod to really be gone?

Also, if you look at the implementations in the build utils, we want a list then a watch with retires to handle transient errors.

If we have a teardown container, we will exit out early every time

Fixed by introducing teardownFinished var

If you want to wait for the teardown container completion and the pod deletion separately, you will want separate watches.

That was my initial idea (see f758b7d), however there is short time (between teardown watch and pod to be deleted watch) where pod may be destroyed and could have been replaced with a new pod. Two watches don't seem reliable to me

why not just make this watch go on for an hour? In what cases do we not want to wait for the Pod to really be gone?

That would hide potential issues in teardown

we want a list then a watch with retires to handle transient errors.

Using event, ok := <-watcher.ResultChan()? It doesn't seem to have some kind of timeout

…elete Pod teardown may take longer than 5 mins (default ci-operator timeout). This commit would ensure the timeout is extended to wait for teardown container to complete This is useful for rehearse jobs, which reuse the namespace when testing a new commit

stevekuznetsov · 2019-06-12T16:00:49Z

That would hide potential issues in teardown

Which? Before we spend more time working on an implementation can we determine why the (stupid simple) approach of doing more retries over a 10, 20, 30 minute period would not be appropriate? Do we have some SLA for teardown time?

vrutkovs · 2019-06-12T17:26:08Z

Extending the timeout is the simplest approach, and its valid, however it would apply to all ci-operator pods.

e2e-aws's teardown is the only one I know of which takes longer than 5 mins at the moment, other types of tests may rely on existing timeout.

This PR is just one possible way, of course. If it looks overcomplicated then lets just bump the timeout on teardown to fix rehearse failures at least

stevekuznetsov · 2019-06-12T17:47:08Z

Of course it would hit all pods, but we poll every 2 seconds right now, so the only cases where increasing the timeout would actually increase the time taken for the test to run is if the pod is not returning in the current timeout, and then it would only increase it by the time taken to finish tearing down, right?

vrutkovs · 2019-06-12T19:08:56Z

Right, it appears a larger timeout would act the same. Created #358 instead.

I'll keep this open for now

openshift-ci-robot requested review from hongkailiu and petr-muller May 27, 2019 14:42

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 27, 2019

petr-muller reviewed May 28, 2019

View reviewed changes

pkg/steps/template.go Outdated Show resolved Hide resolved

vrutkovs force-pushed the wait-for-pod-to-teardown branch from 8f8befa to 3e3d553 Compare May 28, 2019 15:12

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 28, 2019

vrutkovs force-pushed the wait-for-pod-to-teardown branch from 3e3d553 to 37218ef Compare May 29, 2019 17:43

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 29, 2019

petr-muller reviewed May 30, 2019

View reviewed changes

pkg/steps/template.go Outdated Show resolved Hide resolved

pkg/steps/template.go Outdated Show resolved Hide resolved

pkg/steps/template.go Outdated Show resolved Hide resolved

pkg/steps/template.go Outdated Show resolved Hide resolved

petr-muller reviewed Jun 3, 2019

View reviewed changes

pkg/steps/template.go Outdated Show resolved Hide resolved

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 4, 2019

openshift-ci-robot requested a review from stevekuznetsov June 7, 2019 23:33

stevekuznetsov reviewed Jun 7, 2019

View reviewed changes

pkg/steps/template.go Outdated Show resolved Hide resolved

pkg/steps/template.go Outdated Show resolved Hide resolved

pkg/steps/template.go Outdated Show resolved Hide resolved

vrutkovs force-pushed the wait-for-pod-to-teardown branch from abb16d4 to 3448245 Compare June 10, 2019 09:03

vrutkovs force-pushed the wait-for-pod-to-teardown branch from 3448245 to 3211bf0 Compare June 10, 2019 09:08

vrutkovs force-pushed the wait-for-pod-to-teardown branch 2 times, most recently from 54c3b8a to 4993c4a Compare June 10, 2019 09:32

petr-muller reviewed Jun 10, 2019

View reviewed changes

pkg/steps/template.go Outdated Show resolved Hide resolved

vrutkovs force-pushed the wait-for-pod-to-teardown branch 2 times, most recently from 71b5c95 to 1f2f01f Compare June 10, 2019 11:52

vrutkovs force-pushed the wait-for-pod-to-teardown branch from 1f2f01f to b879a2a Compare June 10, 2019 12:41

petr-muller reviewed Jun 10, 2019

View reviewed changes

pkg/steps/template.go Show resolved Hide resolved

vrutkovs force-pushed the wait-for-pod-to-teardown branch from b879a2a to 6adf813 Compare June 10, 2019 13:33

stevekuznetsov reviewed Jun 10, 2019

View reviewed changes

vrutkovs force-pushed the wait-for-pod-to-teardown branch 2 times, most recently from f758b7d to 697f542 Compare June 11, 2019 11:33

stevekuznetsov reviewed Jun 11, 2019

View reviewed changes

vrutkovs force-pushed the wait-for-pod-to-teardown branch from 697f542 to 2a10196 Compare June 12, 2019 08:33

vrutkovs mentioned this pull request Jun 12, 2019

template: extend timeout in waitForPodDeletion #358

Merged

template: wait for pod to teardown (if container is present) during delete #353

Are you sure you want to change the base?

template: wait for pod to teardown (if container is present) during delete #353

Uh oh!

Conversation

vrutkovs commented May 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vrutkovs commented May 29, 2019

Uh oh!

petr-muller left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vrutkovs commented Jun 3, 2019

Uh oh!

Uh oh!

vrutkovs commented Jun 4, 2019

Uh oh!

vrutkovs commented Jun 7, 2019

Uh oh!

stevekuznetsov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vrutkovs commented Jun 8, 2019

Uh oh!

stevekuznetsov commented Jun 8, 2019

Uh oh!

openshift-ci-robot commented Jun 10, 2019

Uh oh!

vrutkovs commented Jun 10, 2019

Uh oh!

Uh oh!

Uh oh!

petr-muller commented Jun 10, 2019

Uh oh!

stevekuznetsov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrutkovs commented Jun 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevekuznetsov commented Jun 12, 2019

Uh oh!

vrutkovs commented May 27, 2019 •

edited

Loading