-
Notifications
You must be signed in to change notification settings - Fork 172
Description
Not sure how to classify this one. The symptoms are:
- our TRE was working fine until this morning. Now, any and all interactions to create/delete, start/stop, or enable/disable resources get stuck in 'pending'.
- The service bus is receiving the requests, I see them accumulating in the
workspacequeue. All messages areActive, none are in the dead-letter queue. - The Resource Processor is healthy, running one instance, and that instance appears to be well configured:
- It can resolve the address of the SB to it's private IP address.
- It can connect to the SB on port 443 (but not 5761 or 5762, in case that matters).
- The RP docker container has all the right environment variables set, the full list from the
cloud-init.yamlin the codebase. - The only environment variable which is not set is
RP_BUNDLE_key_store_id, this is empty. I don't know if this is normal or not.
- The docker container logs show repeated failures to connect to the SB. Screenshots below.
- The managed identity the container is running with is correct, and still has all the correct rights to read/send-to the SB.
- Killing & restarting the VMSS instance doesn't help.
I've tried restarting the API, restarting the VMSS instance, running make tre-stop / make tre-start, but nothing makes any difference.
Our production TRE is running version 0.21.0, but I've also stood up a new TRE this morning from the head of this repo, and it behaves identically. Yesterday, both the production SDE and another dev SDE I booted were working fine.
The fact that it's happening on two TREs today, where it didn't happen on two TREs yesterday, points at the tenancy or subscription, but I'm not aware of anything that could change there that would cause this problem.
So, I don't believe this is a bug in the TRE, but whatever's happening, I need to get my TRE running. Any suggestions?
