-
Notifications
You must be signed in to change notification settings - Fork 58
cmd/ci-operator: Add --grace-period #264
base: master
Are you sure you want to change the base?
Conversation
Because while two seconds may be sufficient teardown time for unit tests and such, it's too short for e2e teardown (where cluster logs should be collected, cluster resources need to be reaped, and collected assets need to be uploaded). The two-second timeout is from c2ac3ab (When interrupted, mark any in progress pods/templates as deleted, 2018-05-04, openshift#16). And actually, I'm not sure how the 2-second default worked even for unit tests and such, because Kubernetes pods have a 30-second grace period by default [1]. A recent aborted e2e job that did not run teardown to completion or collect any cluster assets is [2,3]. A recent successful e2e job showing current teardown timing is [4,5]. From [4]: 2019/02/14 18:19:54 Container setup in pod e2e-aws completed successfully 2019/02/14 18:46:08 Container test in pod e2e-aws completed successfully 2019/02/14 18:51:38 Container teardown in pod e2e-aws completed successfully So 5.5 minutes to teardown. And from [5]: time="2019-02-14T18:47:40Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"1341ccc2-66ac-46bb-ae15-0295d4a126ba\"}" ... time="2019-02-14T18:51:38Z" level=debug msg="Purging asset \"Cluster\" from disk" So about 4 minutes of that is resource reaping (with the previous 1.5 minutes being log collection). [1]: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods [2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/build-log.txt [3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/artifacts/ [4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/build-log.txt [5]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/artifacts/e2e-aws/installer/.openshift_install.log
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wking If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
1f42ba4 to
434ef85
Compare
ci-operator invocation is part of Prow job configuration, which is roughly 90% Prowgen-generated and 10% hand-crafted. Prowgen already has separate paths for container-based jobs and template-based jobs, so it could generate different parameters for both types. Or as you suggest, we could make this configurable in config, but then people would need to know what to set there, and it would probably end up very similar to "containers->short, template->long". |
|
/hold Will review today, I also wanted this but there are some subtleties that can be an issue |
stevekuznetsov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a globally long period bad? Why does this need to be configurable per job?
If you use CI for fixing your bugs, you may not want to have to wait 7 minutes (or whatever we want to use to accomodate e2e) after a force-push before your aborted unit test job cleans up and you can launch the new one. I'm actually not sure how Prow handles triggering a new round after an abort. Does it hold jobs on the new tip until the old jobs have all been reaped? |
|
Isn't that just the grace period though, so if your test exited faster we'd move on to cleanup faster, right? |
No |
No, because there's not currently a channel from the pod-deletion code back to the handler manager, so it just sleeps to give the pod-deleter the time it might need.
How does that work with recycling jobs on the same namespace? Maybe if someone runs |
Wow, here I am learning about how
Uh -- individual jobs will be killed, yes. But there is no interdependency between jobs. When a ProwJob is created, if a ProwJob already exists that is running for the same org/repo#pull and started earlier, we issue a pod deletion... does default k8s client do blocking deletions? If not, this might be problematic and we may have multiple pods running |
|
@smarterclayton this is outstanding for a review |
|
No need for review yet, I was going to reroll to get better abort linking between the goroutines first. |
|
Note that we don’t have to kill thenpod with grace period or wait for it to exit for teardown to happen. We “fire and forget” pod deletion because the pod itself manages teardown within its grace period. It is possible for a pod being shutdown to be evicted, but it should be vanishingly rare. |
|
@wking: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Because while two seconds may be sufficient teardown time for unit tests and such, it's too short for e2e teardown (where cluster logs should be collected, cluster resources need to be reaped, and collected assets need to be uploaded). The two-second timeout is from 2ac3abf (#16). And actually, I'm not sure how the 2-second default worked even for unit tests and such, because Kubernetes pods have a 30-second grace period by default.
A recent aborted e2e job that did not run teardown to completion or collect any cluster assets is here and here. A recent successful e2e job showing current teardown timing is here and here. From the build log:
So 5.5 minutes to teardown. And from the installer's
destroy clusterlogs:So about 4 minutes of that is resource reaping (with the previous 1.5 minutes being log collection).
I'm still not sure where we set up the
ci-operatorinvocation, which is where we'd set the new command-line option. And if we aren't comfortable setting this to six minutes or similar for all jobs, we may want to drop--grace-periodand instead load this from the job's YAML configuration. @smarterclayton, thoughts?