Skip to content
This repository was archived by the owner on Jun 14, 2019. It is now read-only.

Conversation

@wking
Copy link
Member

@wking wking commented Feb 15, 2019

Because while two seconds may be sufficient teardown time for unit tests and such, it's too short for e2e teardown (where cluster logs should be collected, cluster resources need to be reaped, and collected assets need to be uploaded). The two-second timeout is from 2ac3abf (#16). And actually, I'm not sure how the 2-second default worked even for unit tests and such, because Kubernetes pods have a 30-second grace period by default.

A recent aborted e2e job that did not run teardown to completion or collect any cluster assets is here and here. A recent successful e2e job showing current teardown timing is here and here. From the build log:

2019/02/14 18:19:54 Container setup in pod e2e-aws completed successfully
2019/02/14 18:46:08 Container test in pod e2e-aws completed successfully
2019/02/14 18:51:38 Container teardown in pod e2e-aws completed successfully

So 5.5 minutes to teardown. And from the installer's destroy cluster logs:

time="2019-02-14T18:47:40Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"1341ccc2-66ac-46bb-ae15-0295d4a126ba\"}"
...
time="2019-02-14T18:51:38Z" level=debug msg="Purging asset \"Cluster\" from disk"

So about 4 minutes of that is resource reaping (with the previous 1.5 minutes being log collection).

I'm still not sure where we set up the ci-operator invocation, which is where we'd set the new command-line option. And if we aren't comfortable setting this to six minutes or similar for all jobs, we may want to drop --grace-period and instead load this from the job's YAML configuration. @smarterclayton, thoughts?

Because while two seconds may be sufficient teardown time for unit
tests and such, it's too short for e2e teardown (where cluster logs
should be collected, cluster resources need to be reaped, and
collected assets need to be uploaded).  The two-second timeout is from
c2ac3ab (When interrupted, mark any in progress pods/templates as
deleted, 2018-05-04, openshift#16).  And actually, I'm not sure how the
2-second default worked even for unit tests and such, because
Kubernetes pods have a 30-second grace period by default [1].

A recent aborted e2e job that did not run teardown to completion or
collect any cluster assets is [2,3].  A recent successful e2e job
showing current teardown timing is [4,5].  From [4]:

  2019/02/14 18:19:54 Container setup in pod e2e-aws completed successfully
  2019/02/14 18:46:08 Container test in pod e2e-aws completed successfully
  2019/02/14 18:51:38 Container teardown in pod e2e-aws completed successfully

So 5.5 minutes to teardown.  And from [5]:

  time="2019-02-14T18:47:40Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"1341ccc2-66ac-46bb-ae15-0295d4a126ba\"}"
  ...
  time="2019-02-14T18:51:38Z" level=debug msg="Purging asset \"Cluster\" from disk"

So about 4 minutes of that is resource reaping (with the previous 1.5
minutes being log collection).

[1]: https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
[2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/build-log.txt
[3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_installer/1255/pull-ci-openshift-installer-master-e2e-aws/3707/artifacts/
[4]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/build-log.txt
[5]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1045/pull-ci-openshift-installer-master-e2e-aws/3706/artifacts/e2e-aws/installer/.openshift_install.log
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: stevekuznetsov

If they are not already assigned, you can assign the PR to them by writing /assign @stevekuznetsov in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 15, 2019
@wking wking force-pushed the configurable-grace-period branch from 1f42ba4 to 434ef85 Compare February 15, 2019 06:49
@petr-muller
Copy link
Member

I'm still not sure where we set up the ci-operator invocation, which is where we'd set the new command-line option.

ci-operator invocation is part of Prow job configuration, which is roughly 90% Prowgen-generated and 10% hand-crafted. Prowgen already has separate paths for container-based jobs and template-based jobs, so it could generate different parameters for both types. Or as you suggest, we could make this configurable in config, but then people would need to know what to set there, and it would probably end up very similar to "containers->short, template->long".

@smarterclayton
Copy link
Contributor

/hold

Will review today, I also wanted this but there are some subtleties that can be an issue

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 15, 2019
Copy link
Contributor

@stevekuznetsov stevekuznetsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a globally long period bad? Why does this need to be configurable per job?

@wking
Copy link
Member Author

wking commented Feb 15, 2019

Is a globally long period bad?

If you use CI for fixing your bugs, you may not want to have to wait 7 minutes (or whatever we want to use to accomodate e2e) after a force-push before your aborted unit test job cleans up and you can launch the new one. I'm actually not sure how Prow handles triggering a new round after an abort. Does it hold jobs on the new tip until the old jobs have all been reaped?

@stevekuznetsov
Copy link
Contributor

Isn't that just the grace period though, so if your test exited faster we'd move on to cleanup faster, right?

@stevekuznetsov
Copy link
Contributor

Does it hold jobs on the new tip until the old jobs have all been reaped?

No

@wking
Copy link
Member Author

wking commented Feb 16, 2019

Isn't that just the grace period though, so if your test exited faster we'd move on to cleanup faster, right?

No, because there's not currently a channel from the pod-deletion code back to the handler manager, so it just sleeps to give the pod-deleter the time it might need.

Does it hold jobs on the new tip until the old jobs have all been reaped?

No

How does that work with recycling jobs on the same namespace? Maybe if someone runs /test e2e-aws while there's an existing run in progress? Does that abort the earlier job?

@stevekuznetsov
Copy link
Contributor

No, because there's not currently a channel from the pod-deletion code back to the handler manager, so it just sleeps to give the pod-deleter the time it might need.

Wow, here I am learning about how ci-operator works. Ew. Can we fix this?

How does that work with recycling jobs on the same namespace? Maybe if someone runs /test e2e-aws while there's an existing run in progress? Does that abort the earlier job?

Uh -- individual jobs will be killed, yes. But there is no interdependency between jobs. When a ProwJob is created, if a ProwJob already exists that is running for the same org/repo#pull and started earlier, we issue a pod deletion... does default k8s client do blocking deletions? If not, this might be problematic and we may have multiple pods running

@stevekuznetsov
Copy link
Contributor

@smarterclayton this is outstanding for a review

@wking
Copy link
Member Author

wking commented Mar 6, 2019

No need for review yet, I was going to reroll to get better abort linking between the goroutines first.

@smarterclayton
Copy link
Contributor

Note that we don’t have to kill thenpod with grace period or wait for it to exit for teardown to happen.

We “fire and forget” pod deletion because the pod itself manages teardown within its grace period. It is possible for a pod being shutdown to be evicted, but it should be vanishingly rare.

@openshift-ci-robot
Copy link

@wking: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants