Skip to content

Add graceful Fly deploy helper#250

Open
nexiumbiz-debug wants to merge 2 commits into
algora-io:mainfrom
nexiumbiz-debug:codex/graceful-fly-deployments
Open

Add graceful Fly deploy helper#250
nexiumbiz-debug wants to merge 2 commits into
algora-io:mainfrom
nexiumbiz-debug:codex/graceful-fly-deployments

Conversation

@nexiumbiz-debug
Copy link
Copy Markdown

/claim #78

Bounty issue: algora-io/tv#78. This targets algora-io/algora per the maintainer note that the graceful deploy work is still needed in the new app.

Summary

  • Adds Algora.DeploymentHealth and makes /health return 503 while a node is draining so Fly routes around old machines.
  • Adds Algora.Release.prepare_for_deploy/0 for release RPCs to mark a node unhealthy and pause local Oban queues.
  • Adds deploy.exs to warm replacement Fly machines with a supplied image, drain old machines, stop/destroy them, and restore the original process count.
  • Extends Fly/Phoenix shutdown windows and documents the deployment flow.

Verification

  • Ran git diff --cached --check before commit.
  • Added AlgoraWeb.HealthControllerTest covering healthy and draining responses.
  • Not run locally: mix test or a live Fly demo. This Windows environment does not have Elixir/Mix, Docker, flyctl, or Fly app credentials installed.

Demo plan for a Fly app with access

  1. Run fly deploy --build-only --push and copy the produced registry.fly.io/algora:deployment-xxxx image ref.
  2. Run mix run deploy.exs registry.fly.io/algora:deployment-xxxx.
  3. Confirm replacement machines pass Fly checks before old machines are marked draining.
  4. Confirm old machines return 503 from /health, pause local Oban queues, receive SIGTERM, and are destroyed after shutdown.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@nagiexplorer88
Copy link
Copy Markdown

deploy.exs drains the old machines before it has verified the replacements are healthy on the new image. The script only waits for the machine count before the update (deploy.exs:44-47, deploy.exs:107-115), then updates the replacement machines (deploy.exs:51-56) and immediately runs prepare_for_deploy on the old machines (deploy.exs:58-61). If the new image boots slowly or fails /health, this can mark the old healthy machines unhealthy and later stop them, which is exactly the livestream interruption this bounty is trying to avoid. Consider waiting for each replacement machine to pass the Fly /health check after machine update before preparing/stopping the old machines.

@nexiumbiz-debug
Copy link
Copy Markdown
Author

Thanks, that is a good catch. I pushed 8c02889c to add a replacement-machine health gate before any old machines are marked draining.

The deploy helper now, after updating the replacement machines to the new image, polls fly machine status <id> --json for each replacement and requires the machine to be started with passing Fly checks before it runs prepare_for_deploy on any old machine. If a replacement does not become healthy within the timeout, the script raises and leaves the old healthy machines alone.

I also documented that ordering in the README. I still cannot run a live Fly demo from this environment because it does not have fly, Elixir/Mix, or app credentials installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants