Conversation
ekalosak
commented
Dec 14, 2021
- made servo issue a startup event when a run gets cancelled;
- made the PrometheusConnector.startup() idempotent
…PrometheusConnector.startup() idempotent
ENG-700 Don't hang on cancellation
This issue is mainly about the failure of the HPA+'s ServoX Connector failing to recover after the servo processes a However, because debugging this issue requires generating Abort failure replication instructions
Replication instructions
ObjectiveDon't hang on cleaning up HPA+ from the servo when Cancellation event occurs. Follow-on issues |
…ould have multiple sources
…line developer docs
… servo attribute is not typed as Optional
…opsani/servox into ek/issue-startup-after-run-cancellation
| def run_main_loop(self) -> None: | ||
| if self._main_loop_task: | ||
| self._main_loop_task.cancel() | ||
| loop = asyncio.get_event_loop() | ||
| loop.create_task(self.servo.dispatch_event(servo.Events.startup)) |
There was a problem hiding this comment.
| def run_main_loop(self) -> None: | |
| if self._main_loop_task: | |
| self._main_loop_task.cancel() | |
| loop = asyncio.get_event_loop() | |
| loop.create_task(self.servo.dispatch_event(servo.Events.startup)) | |
| async def run_main_loop(self) -> None: | |
| if self._main_loop_task: | |
| self._main_loop_task.cancel() | |
| await self.servo.startup() | |
| self.logger.info( | |
| f"Servo started with {len(self.servo.connectors)} active connectors [{self.optimizer.id} @ {self.optimizer.url or self.optimizer.base_url}]" | |
| ) |
I agree we should update the startup/shutdown lifecycle given the implementation of cancel responses. However, we need to update a few more places for the sake of completeness:
|
What are we doing with this? Seems pretty reasonable... Except I don't love that each connector has to be aware of being potentially restarted and guard against it. Seems like a lifecycle teardown of the channel would keep it more straightforward or even a blanket teardown of all channels. It shouldn't be the connector's responsibility to handle this state. It would have to be replicated everywhere |
|
Looking into the relevant internal project mgmt, it would seem @ekalosak reached a similar conclusion in that this should be handled externally. I'm stilling looking into it to see if it has impact on new startup behavior for the k8s connector |