QAnet: public name-gateway URL stays 502 for 80+ min; gateway node cannot reach a freshly deployed VM over mycelium (VM reachable from other nodes; gateway redeploy does not help)

## Summary

On QAnet, after deploying a VM plus a name web-gateway, the public gateway URL returns 502 (and intermittent connection timeouts) continuously for over 80 minutes, even though the VM is fully up and reachable over mycelium from other nodes. The gateway node cannot establish a mycelium route to the freshly deployed VM's overlay IP. A different VM on the same compute node and the same gateway node, deployed earlier, works fine. Deleting and redeploying the gateway does not fix it.

## Environment

- Network: QAnet (qa.grid.tf), zone `gent01.qa.grid.tf`
- Deploying twin: 703
- Compute node: node 5 (node twin 12) — the VM runs here
- Gateway node: node 2 (twin 9), public IPv4 `185.69.167.80` — serves the `gent01.qa.grid.tf` zone
- Affected VM mycelium IP: `434:4c2f:45aa:ef20:ff0f:ccc6:4f60:ed58`
- Public hostname: `despiegk.gent01.qa.grid.tf`
- Name contract: `85745` (state Created, single, no orphans)
- Workloads deployed via the grid (`deploy_vm` / `deploy_webgateway`)

The affected deployment is still live (VM and gateway up) and available for inspection.

## Symptom

`curl https://despiegk.gent01.qa.grid.tf/` returns `502 Bad Gateway`, intermittently timing out with no response, continuously for 80+ minutes after deploy. DNS resolves correctly to the gateway (`185.69.167.80`).

## Evidence the VM is healthy and reachable (just not from the gateway)

- From a second VM on a different rented node: `curl http://[434:4c2f:45aa:ef20:ff0f:ccc6:4f60:ed58]:9997/` returns HTTP 302 (the app's expected redirect). The VM's service is up and reachable over mycelium from another node.
- From an external workstation: both ICMP ping and SSH to the VM's mycelium IP succeed.

So the VM, its service, and its mycelium IP are all healthy. The only path that fails is gateway-node-2 to this VM.

## Control case (works)

A different VM, `flowtest8.gent01.qa.grid.tf`, on the SAME compute node (node 5) and the SAME gateway node (node 2), deployed about 2.5 hours earlier, returns HTTP 302 publicly and has worked consistently. It came up publicly within roughly 9 minutes of deploy. The only material difference is age: the older VM's overlay IP route to the gateway converged; the affected VM's route has not converged after 80+ minutes.

## What we ruled out

- DNS: resolves to the correct gateway IP.
- Name contract: a single clean `Created` contract (`85745`); no orphan or duplicate contracts.
- The VM / application: healthy and reachable from other nodes over mycelium (302 on `:9997`).
- The gateway workload: we deleted the name gateway and redeployed it (including with an explicit `fqdn`); still 502. So a stale or half-bound gateway workload is not the cause. This matches the expectation that the gateway proxies per request, so a redeploy does not change the route.

## Related observation (possibly separate)

`deploy_webgateway` returns `state=ready` but with an empty `fqdn` for this VM, across all deploy attempts for this name. For the working VM (`flowtest8`) the same call returned a populated `fqdn`. This may be a separate read-back quirk, but the correlation (the empty-fqdn deploys all map to the VM that does not route) is worth noting.

## Hypothesis

The mycelium route from the gateway node (node 2) to a freshly assigned VM overlay prefix (here `434:4c2f:45aa:ef20::/64` on node 5) is slow or stuck to converge. The route appears to be per overlay address: the gateway node has a working route to an older VM's prefix on the same compute node, but not to the new VM's prefix. Redeploying the gateway does not trigger reconvergence.

## Impact

Onboarding a fresh VM intermittently produces a public URL that is unreachable for a long or indefinite time, while the VM itself is perfectly healthy. This makes fresh provisioning unreliable from an end-user perspective.

## Questions for the grid team

1. Is there a known issue with gateway-node to fresh-VM mycelium route convergence on QAnet, and can it be sped up or forced?
2. Is there any tenant-side action (a particular deploy order, a route re-announce, choosing a different gateway node) that helps convergence, or is this purely host-side (Zero-OS) and outside tenant control?
3. Is the empty-`fqdn` return from `deploy_webgateway` (`state=ready`, `fqdn` empty) a known or related issue?

## Reproduction

1. Deploy a VM on a QAnet node in the `gent01` zone and confirm it is reachable over mycelium from another node (curl its mycelium IP on the app port returns 302).
2. Deploy a name web-gateway for it on the zone's gateway node, backend `http://[vm_mycelium]:9997`.
3. Observe the public URL returns 502 / times out for an extended period (80+ min here) while the VM stays reachable from other nodes.
4. Delete and redeploy the gateway: still 502.
5. Compare with an older VM on the same nodes whose public URL works.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QAnet: public name-gateway URL stays 502 for 80+ min; gateway node cannot reach a freshly deployed VM over mycelium (VM reachable from other nodes; gateway redeploy does not help) #458

Summary

Environment

Symptom

Evidence the VM is healthy and reachable (just not from the gateway)

Control case (works)

What we ruled out

Related observation (possibly separate)

Hypothesis

Impact

Questions for the grid team

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

QAnet: public name-gateway URL stays 502 for 80+ min; gateway node cannot reach a freshly deployed VM over mycelium (VM reachable from other nodes; gateway redeploy does not help) #458

Description

Summary

Environment

Symptom

Evidence the VM is healthy and reachable (just not from the gateway)

Control case (works)

What we ruled out

Related observation (possibly separate)

Hypothesis

Impact

Questions for the grid team

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions