-
Notifications
You must be signed in to change notification settings - Fork 39
Description
I am forwarding an issue from Red Hat internal Jira that I was debugging in June 2024 but that I have not resolved yet.
Current Behavior:
If the scan results tarball is too big, the OSH task ends up in the INTERRUPTED state despite it was successful:
- The scan results tarball is successfully uploaded and unpacked.
- A scan notification about a CLOSED task is sent.
- Then for some reason the worker tries to finish the task again, which breaks the state machine in kobo.
Expected Behavior:
The task status remains CLOSED.
Impact Statement:
Scan results are successfully uploaded and unpacked. Only the task status is misleading.
worker.log contains:
2024-06-20 14:49:50 [INFO ] Task #586227 exited with status 0
2024-06-20 14:49:50 [INFO ] Task has finished: 586227
2024-06-20 14:49:50 [INFO ] kill_group: Process (pgrp 1747847) exited
2024-06-20 14:49:53 [INFO ] Attempting to take task #586418 [MockBuild]
2024-06-20 14:49:53 [INFO ] Task forked #586418 [MockBuild]: pid=2020326
2024-06-20 14:50:14 [INFO ] Waking up task #585987 [ErrataDiffBuild].
2024-06-20 14:50:14 [WARNING ] Closing interrupted tasks: [586227]
hub.log contains:
INFO 2024-06-20 16:46:47,047 worker.py:86 finish_task Finishing task 586227
INFO 2024-06-20 16:46:47,067 results_loader.py:33 __init__ Processing tasks '#586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]' and 'None'
WARNING 2024-06-20 16:46:47,070 path.py:60 get_json_results json results not found: '[]', task #586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]
DEBUG 2024-06-20 16:46:47,070 results_loader.py:49 unpack_results Unpacking /var/lib/osh/hub/tasks/0/580000/586227/virt-cdi-uploadproxy-container-v4.12.12-5.tar.xz
DEBUG 2024-06-20 16:46:47,070 csmock_parser.py:57 extract_tarball Running command ['tar', '-xf', '/var/lib/osh/hub/tasks/0/580000/586227/virt-cdi-uploadproxy-container-v4.12.12-5.tar.xz', '-C', '/var/lib/osh/hub/tasks/0/580000/586227', '--wildcards', '--wildcards-match-slash', '--exclude=*cov', '--exclude=*debug']
[...]
INFO 2024-06-20 16:48:17,470 worker.py:86 finish_task Finishing task 586227
INFO 2024-06-20 16:48:17,486 results_loader.py:33 __init__ Processing tasks '#586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]' and 'None'
WARNING 2024-06-20 16:48:17,489 path.py:60 get_json_results json results not found: '[]', task #586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]
DEBUG 2024-06-20 16:48:17,489 results_loader.py:49 unpack_results Unpacking /var/lib/osh/hub/tasks/0/580000/586227/virt-cdi-uploadproxy-container-v4.12.12-5.tar.xz
DEBUG 2024-06-20 16:48:17,489 csmock_parser.py:57 extract_tarball Running command ['tar', '-xf', '/var/lib/osh/hub/tasks/0/580000/586227/virt-cdi-uploadproxy-container-v4.12.12-5.tar.xz', '-C', '/var/lib/osh/hub/tasks/0/580000/586227', '--wildcards', '--wildcards-match-slash', '--exclude=*cov', '--exclude=*debug']
[...]
INFO 2024-06-20 16:49:47,876 worker.py:86 finish_task Finishing task 586227
INFO 2024-06-20 16:49:47,885 results_loader.py:33 __init__ Processing tasks '#586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]' and 'None'
INFO 2024-06-20 16:49:47,887 results_loader.py:46 unpack_results Results are already unpacked for task #586227 [method: MockBuild, state: OPEN, worker: osh-worker-012.osh-001.prod.iad2.dc.redhat.com]
INFO 2024-06-20 16:49:51,340 worker.py:86 finish_task Finishing task 586314
INFO 2024-06-20 16:49:51,352 results_loader.py:33 __init__ Processing tasks '#586314 [method: MockBuild, state: OPEN, worker: osh-worker-015.osh-001.prod.iad2.dc.redhat.com]' and 'None'
WARNING 2024-06-20 16:49:51,355 path.py:60 get_json_results json results not found: '[]', task #586314 [method: MockBuild, state: OPEN, worker: osh-worker-015.osh-001.prod.iad2.dc.redhat.com]
DEBUG 2024-06-20 16:49:51,356 results_loader.py:49 unpack_results Unpacking /var/lib/osh/hub/tasks/0/580000/586314/ose-clusterresourceoverride-container-v4.12.0-202406191416.p0.ge85aaf7.assembly.stream.el8.tar.xz
DEBUG 2024-06-20 16:49:51,356 csmock_parser.py:57 extract_tarball Running command ['tar', '-xf', '/var/lib/osh/hub/tasks/0/580000/586314/ose-clusterresourceoverride-container-v4.12.0-202406191416.p0.ge85aaf7.assembly.stream.el8.tar.xz', '-C', '/var/lib/osh/hub/tasks/0/580000/586314', '--wildcards', '--wildcards-match-slash', '--exclude=*cov', '--exclude=*debug']
DEBUG 2024-06-20 16:49:51,478 worker.py:80 email_task_notification email_task_notification for 586227
[...]
DEBUG 2024-06-20 16:50:16,023 worker.py:80 email_task_notification email_task_notification for 586227
Cause:
The worker calls finish_task() over XML-RPC each 90 seconds again if the previous call does not finish till then. This is caused by retry_request_decorator:
Line 462 in 1ac4034
| def retry_request_decorator(transport_class): |
I guess the XML-RPC connection times out after 60 seconds, then the decorator sleeps for 30 seconds. The messages from kobo can be seen at the and of stdout.log (rather than in worker.log where I looked for them initially). I think we need to distinguish network-related timeouts from the time actually spent by a remote operation.
It is also a bad idea to repeat an XML-RPC call that has already been (at least partially) handled by the remote service.
As figured out by @lzaoral, this is a regression introduced by #173.