Skip to content

QAnet: public name-gateway URL stays 502 for 80+ min; gateway node cannot reach a freshly deployed VM over mycelium (VM reachable from other nodes; gateway redeploy does not help) #458

@mik-tf

Description

@mik-tf

Summary

On QAnet, after deploying a VM plus a name web-gateway, the public gateway URL returns 502 (and intermittent connection timeouts) continuously for over 80 minutes, even though the VM is fully up and reachable over mycelium from other nodes. The gateway node cannot establish a mycelium route to the freshly deployed VM's overlay IP. A different VM on the same compute node and the same gateway node, deployed earlier, works fine. Deleting and redeploying the gateway does not fix it.

Environment

  • Network: QAnet (qa.grid.tf), zone gent01.qa.grid.tf
  • Deploying twin: 703
  • Compute node: node 5 (node twin 12) — the VM runs here
  • Gateway node: node 2 (twin 9), public IPv4 185.69.167.80 — serves the gent01.qa.grid.tf zone
  • Affected VM mycelium IP: 434:4c2f:45aa:ef20:ff0f:ccc6:4f60:ed58
  • Public hostname: despiegk.gent01.qa.grid.tf
  • Name contract: 85745 (state Created, single, no orphans)
  • Workloads deployed via the grid (deploy_vm / deploy_webgateway)

The affected deployment is still live (VM and gateway up) and available for inspection.

Symptom

curl https://despiegk.gent01.qa.grid.tf/ returns 502 Bad Gateway, intermittently timing out with no response, continuously for 80+ minutes after deploy. DNS resolves correctly to the gateway (185.69.167.80).

Evidence the VM is healthy and reachable (just not from the gateway)

  • From a second VM on a different rented node: curl http://[434:4c2f:45aa:ef20:ff0f:ccc6:4f60:ed58]:9997/ returns HTTP 302 (the app's expected redirect). The VM's service is up and reachable over mycelium from another node.
  • From an external workstation: both ICMP ping and SSH to the VM's mycelium IP succeed.

So the VM, its service, and its mycelium IP are all healthy. The only path that fails is gateway-node-2 to this VM.

Control case (works)

A different VM, flowtest8.gent01.qa.grid.tf, on the SAME compute node (node 5) and the SAME gateway node (node 2), deployed about 2.5 hours earlier, returns HTTP 302 publicly and has worked consistently. It came up publicly within roughly 9 minutes of deploy. The only material difference is age: the older VM's overlay IP route to the gateway converged; the affected VM's route has not converged after 80+ minutes.

What we ruled out

  • DNS: resolves to the correct gateway IP.
  • Name contract: a single clean Created contract (85745); no orphan or duplicate contracts.
  • The VM / application: healthy and reachable from other nodes over mycelium (302 on :9997).
  • The gateway workload: we deleted the name gateway and redeployed it (including with an explicit fqdn); still 502. So a stale or half-bound gateway workload is not the cause. This matches the expectation that the gateway proxies per request, so a redeploy does not change the route.

Related observation (possibly separate)

deploy_webgateway returns state=ready but with an empty fqdn for this VM, across all deploy attempts for this name. For the working VM (flowtest8) the same call returned a populated fqdn. This may be a separate read-back quirk, but the correlation (the empty-fqdn deploys all map to the VM that does not route) is worth noting.

Hypothesis

The mycelium route from the gateway node (node 2) to a freshly assigned VM overlay prefix (here 434:4c2f:45aa:ef20::/64 on node 5) is slow or stuck to converge. The route appears to be per overlay address: the gateway node has a working route to an older VM's prefix on the same compute node, but not to the new VM's prefix. Redeploying the gateway does not trigger reconvergence.

Impact

Onboarding a fresh VM intermittently produces a public URL that is unreachable for a long or indefinite time, while the VM itself is perfectly healthy. This makes fresh provisioning unreliable from an end-user perspective.

Questions for the grid team

  1. Is there a known issue with gateway-node to fresh-VM mycelium route convergence on QAnet, and can it be sped up or forced?
  2. Is there any tenant-side action (a particular deploy order, a route re-announce, choosing a different gateway node) that helps convergence, or is this purely host-side (Zero-OS) and outside tenant control?
  3. Is the empty-fqdn return from deploy_webgateway (state=ready, fqdn empty) a known or related issue?

Reproduction

  1. Deploy a VM on a QAnet node in the gent01 zone and confirm it is reachable over mycelium from another node (curl its mycelium IP on the app port returns 302).
  2. Deploy a name web-gateway for it on the zone's gateway node, backend http://[vm_mycelium]:9997.
  3. Observe the public URL returns 502 / times out for an extended period (80+ min here) while the VM stays reachable from other nodes.
  4. Delete and redeploy the gateway: still 502.
  5. Compare with an older VM on the same nodes whose public URL works.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions