cmd/ci-operator: Add --grace-period #264

wking · 2019-02-15T06:49:12Z

Because while two seconds may be sufficient teardown time for unit tests and such, it's too short for e2e teardown (where cluster logs should be collected, cluster resources need to be reaped, and collected assets need to be uploaded). The two-second timeout is from 2ac3abf (#16). And actually, I'm not sure how the 2-second default worked even for unit tests and such, because Kubernetes pods have a 30-second grace period by default.

A recent aborted e2e job that did not run teardown to completion or collect any cluster assets is here and here. A recent successful e2e job showing current teardown timing is here and here. From the build log:

2019/02/14 18:19:54 Container setup in pod e2e-aws completed successfully
2019/02/14 18:46:08 Container test in pod e2e-aws completed successfully
2019/02/14 18:51:38 Container teardown in pod e2e-aws completed successfully

So 5.5 minutes to teardown. And from the installer's destroy cluster logs:

time="2019-02-14T18:47:40Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"1341ccc2-66ac-46bb-ae15-0295d4a126ba\"}"
...
time="2019-02-14T18:51:38Z" level=debug msg="Purging asset \"Cluster\" from disk"

So about 4 minutes of that is resource reaping (with the previous 1.5 minutes being log collection).

I'm still not sure where we set up the ci-operator invocation, which is where we'd set the new command-line option. And if we aren't comfortable setting this to six minutes or similar for all jobs, we may want to drop --grace-period and instead load this from the job's YAML configuration. @smarterclayton, thoughts?

Because while two seconds may be sufficient teardown time for unit tests and such, it's too short for e2e teardown (where cluster logs should be collected, cluster resources need to be reaped, and collected assets need to be uploaded). The two-second timeout is from c2ac3ab (When interrupted, mark any in progress pods/templates as deleted, 2018-05-04, openshift#16). And actually, I'm not sure how the 2-second default worked even for unit tests and such, because Kubernetes pods have a 30-second grace period by default [1]. A recent aborted e2e job that did not run teardown to completion or collect any cluster assets is [2,3]. A recent successful e2e job showing current teardown timing is [4,5]. From [4]: 2019/02/14 18:19:54 Container setup in pod e2e-aws completed successfully 2019/02/14 18:46:08 Container test in pod e2e-aws completed successfully 2019/02/14 18:51:38 Container teardown in pod e2e-aws completed successfully So 5.5 minutes to teardown. And from [5]: time="2019-02-14T18:47:40Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"1341ccc2-66ac-46bb-ae15-0295d4a126ba\"}" ... time="2019-02-14T18:51:38Z" level=debug msg="Purging asset \"Cluster\" from disk" So about 4 minutes of that is resource reaping (with the previous 1.5 minutes being log collection). [1]: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods [2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/build-log.txt [3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/artifacts/ [4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/build-log.txt [5]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/artifacts/e2e-aws/installer/.openshift_install.log

openshift-ci-robot · 2019-02-15T06:49:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: stevekuznetsov

If they are not already assigned, you can assign the PR to them by writing /assign @stevekuznetsov in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

petr-muller · 2019-02-15T13:58:37Z

I'm still not sure where we set up the ci-operator invocation, which is where we'd set the new command-line option.

ci-operator invocation is part of Prow job configuration, which is roughly 90% Prowgen-generated and 10% hand-crafted. Prowgen already has separate paths for container-based jobs and template-based jobs, so it could generate different parameters for both types. Or as you suggest, we could make this configurable in config, but then people would need to know what to set there, and it would probably end up very similar to "containers->short, template->long".

smarterclayton · 2019-02-15T15:20:07Z

/hold

Will review today, I also wanted this but there are some subtleties that can be an issue

stevekuznetsov

Is a globally long period bad? Why does this need to be configurable per job?

wking · 2019-02-15T19:23:34Z

Is a globally long period bad?

If you use CI for fixing your bugs, you may not want to have to wait 7 minutes (or whatever we want to use to accomodate e2e) after a force-push before your aborted unit test job cleans up and you can launch the new one. I'm actually not sure how Prow handles triggering a new round after an abort. Does it hold jobs on the new tip until the old jobs have all been reaped?

stevekuznetsov · 2019-02-16T16:19:41Z

Isn't that just the grace period though, so if your test exited faster we'd move on to cleanup faster, right?

stevekuznetsov · 2019-02-16T16:20:26Z

Does it hold jobs on the new tip until the old jobs have all been reaped?

No

wking · 2019-02-16T16:58:19Z

Isn't that just the grace period though, so if your test exited faster we'd move on to cleanup faster, right?

No, because there's not currently a channel from the pod-deletion code back to the handler manager, so it just sleeps to give the pod-deleter the time it might need.

Does it hold jobs on the new tip until the old jobs have all been reaped?

No

How does that work with recycling jobs on the same namespace? Maybe if someone runs /test e2e-aws while there's an existing run in progress? Does that abort the earlier job?

stevekuznetsov · 2019-02-17T01:25:05Z

No, because there's not currently a channel from the pod-deletion code back to the handler manager, so it just sleeps to give the pod-deleter the time it might need.

Wow, here I am learning about how ci-operator works. Ew. Can we fix this?

How does that work with recycling jobs on the same namespace? Maybe if someone runs /test e2e-aws while there's an existing run in progress? Does that abort the earlier job?

Uh -- individual jobs will be killed, yes. But there is no interdependency between jobs. When a ProwJob is created, if a ProwJob already exists that is running for the same org/repo#pull and started earlier, we issue a pod deletion... does default k8s client do blocking deletions? If not, this might be problematic and we may have multiple pods running

stevekuznetsov · 2019-03-06T01:15:40Z

@smarterclayton this is outstanding for a review

wking · 2019-03-06T03:31:58Z

No need for review yet, I was going to reroll to get better abort linking between the goroutines first.

smarterclayton · 2019-03-06T14:56:22Z

Note that we don’t have to kill thenpod with grace period or wait for it to exit for teardown to happen.

We “fire and forget” pod deletion because the pod itself manages teardown within its grace period. It is possible for a pod being shutdown to be evicted, but it should be vanishingly rare.

openshift-ci-robot · 2019-04-24T22:21:44Z

@wking: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from smarterclayton and stevekuznetsov February 15, 2019 06:49

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 15, 2019

wking force-pushed the configurable-grace-period branch from 1f42ba4 to 434ef85 Compare February 15, 2019 06:49

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 15, 2019

stevekuznetsov reviewed Feb 15, 2019

View reviewed changes

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2019

cmd/ci-operator: Add --grace-period #264

Are you sure you want to change the base?

cmd/ci-operator: Add --grace-period #264

Uh oh!

Conversation

wking commented Feb 15, 2019

Uh oh!

openshift-ci-robot commented Feb 15, 2019

Uh oh!

petr-muller commented Feb 15, 2019

Uh oh!

smarterclayton commented Feb 15, 2019

Uh oh!

stevekuznetsov left a comment

Choose a reason for hiding this comment

Uh oh!

wking commented Feb 15, 2019

Uh oh!

stevekuznetsov commented Feb 16, 2019

Uh oh!

stevekuznetsov commented Feb 16, 2019

Uh oh!

wking commented Feb 16, 2019

Uh oh!

stevekuznetsov commented Feb 17, 2019

Uh oh!

stevekuznetsov commented Mar 6, 2019

Uh oh!

wking commented Mar 6, 2019

Uh oh!

smarterclayton commented Mar 6, 2019

Uh oh!

openshift-ci-robot commented Apr 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants