fix(agent): include GlobalTimeout during allocate phase#928
fix(agent): include GlobalTimeout during allocate phase#928rene-oromtz wants to merge 6 commits intomainfrom
Conversation
16b0278 to
3614347
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #928 +/- ##
==========================================
+ Coverage 73.85% 74.07% +0.21%
==========================================
Files 108 108
Lines 10313 10322 +9
Branches 886 888 +2
==========================================
+ Hits 7617 7646 +29
+ Misses 2508 2488 -20
Partials 188 188
*This pull request uses carry forward flags. Click here to find out more.
🚀 New features to boost your workflow:
|
This reverts commit 3614347.
878591e to
8a3e71c
Compare
|
|
||
| parent_job_id = self.job_data.get("parent_job_id") | ||
| if not parent_job_id: | ||
| # TODO: Remove this path once Spread testing use cases |
There was a problem hiding this comment.
Do we have any sort of future ticket we can reference in this TODO to make sure this happens?
| logger.info("Parent job completed, exiting...") | ||
| break | ||
| except TFServerError: | ||
| logger.warning("Failed to get allocated job status, retrying") |
There was a problem hiding this comment.
There are two jobs that could be responsible for this error -- parent and self.job_id; should we have two try statements to isolate them and provide a better error message and handling?
| if not parent_job_id: | ||
| # TODO: Remove this path once Spread testing use cases | ||
| # are migrated to reserve phase instead of allocate phase. | ||
| logger.warning("No parent job ID found while allocated") |
There was a problem hiding this comment.
What will remedy this situation? We seem to get into this while loop, and if we don't have a parent_job_id, how will we get one? Will we just spin around in this loop and never leave?
| return exitcode | ||
|
|
||
| def allocate(self): | ||
| def allocate(self, _): |
There was a problem hiding this comment.
Why aren't we naming this new arg? Is allocate intended to be an abstract method for multi-device sub-classes?
Description
This PR adds a a schema validation forallocate_data.This PR adds GlobalTimeout to the allocate phase, without a proper parent job id, the agent can wait indefinitively in the allocate phase until job is manually cancelled. Ideally, the best approach is to enforce the schema in server side but for now, just adding this timeout should allow no agent exceeds the allowed amount of time.
The rationale on not including a schema on server side is that there seems to be cases where users that rely on Spread testing are using the allocate phase to fetch for the IP and then use that IP to run manual tests, this is not ideal as the reserve stage should serve for that purpose but we should at least give a proper warning before enforcing the restriction on server side.
Resolved issues
Resolves #799
Resolves CERTTF-713
Documentation
Web service API changes
Tests
Added couple of unit tests for testing agent exits on allocate status