Summary
On QAnet, after deploying a VM plus a name web-gateway, the public gateway URL returns 502 (and intermittent connection timeouts) continuously for over 80 minutes, even though the VM is fully up and reachable over mycelium from other nodes. The gateway node cannot establish a mycelium route to the freshly deployed VM's overlay IP. A different VM on the same compute node and the same gateway node, deployed earlier, works fine. Deleting and redeploying the gateway does not fix it.
Environment
- Network: QAnet (qa.grid.tf), zone
gent01.qa.grid.tf
- Deploying twin: 703
- Compute node: node 5 (node twin 12) — the VM runs here
- Gateway node: node 2 (twin 9), public IPv4
185.69.167.80 — serves the gent01.qa.grid.tf zone
- Affected VM mycelium IP:
434:4c2f:45aa:ef20:ff0f:ccc6:4f60:ed58
- Public hostname:
despiegk.gent01.qa.grid.tf
- Name contract:
85745 (state Created, single, no orphans)
- Workloads deployed via the grid (
deploy_vm / deploy_webgateway)
The affected deployment is still live (VM and gateway up) and available for inspection.
Symptom
curl https://despiegk.gent01.qa.grid.tf/ returns 502 Bad Gateway, intermittently timing out with no response, continuously for 80+ minutes after deploy. DNS resolves correctly to the gateway (185.69.167.80).
Evidence the VM is healthy and reachable (just not from the gateway)
- From a second VM on a different rented node:
curl http://[434:4c2f:45aa:ef20:ff0f:ccc6:4f60:ed58]:9997/ returns HTTP 302 (the app's expected redirect). The VM's service is up and reachable over mycelium from another node.
- From an external workstation: both ICMP ping and SSH to the VM's mycelium IP succeed.
So the VM, its service, and its mycelium IP are all healthy. The only path that fails is gateway-node-2 to this VM.
Control case (works)
A different VM, flowtest8.gent01.qa.grid.tf, on the SAME compute node (node 5) and the SAME gateway node (node 2), deployed about 2.5 hours earlier, returns HTTP 302 publicly and has worked consistently. It came up publicly within roughly 9 minutes of deploy. The only material difference is age: the older VM's overlay IP route to the gateway converged; the affected VM's route has not converged after 80+ minutes.
What we ruled out
- DNS: resolves to the correct gateway IP.
- Name contract: a single clean
Created contract (85745); no orphan or duplicate contracts.
- The VM / application: healthy and reachable from other nodes over mycelium (302 on
:9997).
- The gateway workload: we deleted the name gateway and redeployed it (including with an explicit
fqdn); still 502. So a stale or half-bound gateway workload is not the cause. This matches the expectation that the gateway proxies per request, so a redeploy does not change the route.
Related observation (possibly separate)
deploy_webgateway returns state=ready but with an empty fqdn for this VM, across all deploy attempts for this name. For the working VM (flowtest8) the same call returned a populated fqdn. This may be a separate read-back quirk, but the correlation (the empty-fqdn deploys all map to the VM that does not route) is worth noting.
Hypothesis
The mycelium route from the gateway node (node 2) to a freshly assigned VM overlay prefix (here 434:4c2f:45aa:ef20::/64 on node 5) is slow or stuck to converge. The route appears to be per overlay address: the gateway node has a working route to an older VM's prefix on the same compute node, but not to the new VM's prefix. Redeploying the gateway does not trigger reconvergence.
Impact
Onboarding a fresh VM intermittently produces a public URL that is unreachable for a long or indefinite time, while the VM itself is perfectly healthy. This makes fresh provisioning unreliable from an end-user perspective.
Questions for the grid team
- Is there a known issue with gateway-node to fresh-VM mycelium route convergence on QAnet, and can it be sped up or forced?
- Is there any tenant-side action (a particular deploy order, a route re-announce, choosing a different gateway node) that helps convergence, or is this purely host-side (Zero-OS) and outside tenant control?
- Is the empty-
fqdn return from deploy_webgateway (state=ready, fqdn empty) a known or related issue?
Reproduction
- Deploy a VM on a QAnet node in the
gent01 zone and confirm it is reachable over mycelium from another node (curl its mycelium IP on the app port returns 302).
- Deploy a name web-gateway for it on the zone's gateway node, backend
http://[vm_mycelium]:9997.
- Observe the public URL returns 502 / times out for an extended period (80+ min here) while the VM stays reachable from other nodes.
- Delete and redeploy the gateway: still 502.
- Compare with an older VM on the same nodes whose public URL works.
Summary
On QAnet, after deploying a VM plus a name web-gateway, the public gateway URL returns 502 (and intermittent connection timeouts) continuously for over 80 minutes, even though the VM is fully up and reachable over mycelium from other nodes. The gateway node cannot establish a mycelium route to the freshly deployed VM's overlay IP. A different VM on the same compute node and the same gateway node, deployed earlier, works fine. Deleting and redeploying the gateway does not fix it.
Environment
gent01.qa.grid.tf185.69.167.80— serves thegent01.qa.grid.tfzone434:4c2f:45aa:ef20:ff0f:ccc6:4f60:ed58despiegk.gent01.qa.grid.tf85745(state Created, single, no orphans)deploy_vm/deploy_webgateway)The affected deployment is still live (VM and gateway up) and available for inspection.
Symptom
curl https://despiegk.gent01.qa.grid.tf/returns502 Bad Gateway, intermittently timing out with no response, continuously for 80+ minutes after deploy. DNS resolves correctly to the gateway (185.69.167.80).Evidence the VM is healthy and reachable (just not from the gateway)
curl http://[434:4c2f:45aa:ef20:ff0f:ccc6:4f60:ed58]:9997/returns HTTP 302 (the app's expected redirect). The VM's service is up and reachable over mycelium from another node.So the VM, its service, and its mycelium IP are all healthy. The only path that fails is gateway-node-2 to this VM.
Control case (works)
A different VM,
flowtest8.gent01.qa.grid.tf, on the SAME compute node (node 5) and the SAME gateway node (node 2), deployed about 2.5 hours earlier, returns HTTP 302 publicly and has worked consistently. It came up publicly within roughly 9 minutes of deploy. The only material difference is age: the older VM's overlay IP route to the gateway converged; the affected VM's route has not converged after 80+ minutes.What we ruled out
Createdcontract (85745); no orphan or duplicate contracts.:9997).fqdn); still 502. So a stale or half-bound gateway workload is not the cause. This matches the expectation that the gateway proxies per request, so a redeploy does not change the route.Related observation (possibly separate)
deploy_webgatewayreturnsstate=readybut with an emptyfqdnfor this VM, across all deploy attempts for this name. For the working VM (flowtest8) the same call returned a populatedfqdn. This may be a separate read-back quirk, but the correlation (the empty-fqdn deploys all map to the VM that does not route) is worth noting.Hypothesis
The mycelium route from the gateway node (node 2) to a freshly assigned VM overlay prefix (here
434:4c2f:45aa:ef20::/64on node 5) is slow or stuck to converge. The route appears to be per overlay address: the gateway node has a working route to an older VM's prefix on the same compute node, but not to the new VM's prefix. Redeploying the gateway does not trigger reconvergence.Impact
Onboarding a fresh VM intermittently produces a public URL that is unreachable for a long or indefinite time, while the VM itself is perfectly healthy. This makes fresh provisioning unreliable from an end-user perspective.
Questions for the grid team
fqdnreturn fromdeploy_webgateway(state=ready,fqdnempty) a known or related issue?Reproduction
gent01zone and confirm it is reachable over mycelium from another node (curl its mycelium IP on the app port returns 302).http://[vm_mycelium]:9997.