Skip to content

local emulator image optimization#1326

Closed
BilalG1 wants to merge 9 commits into
devfrom
emulator-arm64-ubuntu-runner
Closed

local emulator image optimization#1326
BilalG1 wants to merge 9 commits into
devfrom
emulator-arm64-ubuntu-runner

Conversation

@BilalG1
Copy link
Copy Markdown
Collaborator

@BilalG1 BilalG1 commented Apr 10, 2026

Summary

Cross-arch TCG on ubicloud-standard-8 (amd64 host → arm64 guest) either SIGTRAPs during migrations (old QEMU) or hangs in wait-for-deps with no progress. GitHub's ubuntu-24.04-arm runner is an Azure arm64 VM — same-arch TCG, no KVM (nested virt not exposed on Azure arm64) — but empirically completes migrations, the dep setup, and image packaging end-to-end. The only failure on the diagnostics branch was the backend smoke test hitting its 300s timeout, which the parent commit on this branch already skips on arm64.

Matrix is now per-runner:

  • amd64: ubicloud-standard-8 — keeps KVM acceleration.
  • arm64: ubuntu-24.04-arm — same-arch TCG, no smoke test.

Evidence backing this change

From the emulator-arm64-kvm-diagnostics run on ubuntu-24.04-arm:

  • Migrations passed without SIGTRAP (no -cpu cortex-a72 was even needed there — same-arch TCG is immune to the V8 JIT translation bug that breaks cross-arch TCG).
  • Postgres, Redis, ClickHouse, MinIO, Svix all came up.
  • init-services.sh completed cleanly.
  • The slim image built successfully.
  • Only the smoke test failed — and that was because Next.js never bound to :8102 within 300s, which is a fundamental limitation of running the backend under any TCG configuration.

On this branch the smoke test is skipped on arm64, so that last failure mode doesn't apply.

Test plan

  • amd64 build still passes unchanged on ubicloud-standard-8 under KVM.
  • arm64 build completes on ubuntu-24.04-arm, skips smoke test with the clear log message, and uploads the qcow2 artifact.
  • Total runtime for arm64 is roughly ~60-75 minutes (TCG is slow but this run has no backend startup step).

BilalG1 added 7 commits April 9, 2026 14:21
Provisioning used to silently wait out the full 6000s timeout on any
guest-side failure because the cleanup trap only logged the error. Now
it writes STACK_CLOUD_INIT_FAILED and shuts the VM down, and the host
waiter breaks on that marker and reports it distinctly.

Also bump smoke test timeout 120s->300s, dump docker ps / container
logs / free -m / verbose curl when it fails, log the qemu accel path,
and enable /dev/kvm on the CI runner so the VM isn't stuck in TCG.
The arm64 matrix entry cross-compiles on the amd64 CI runner, so the
guest runs under QEMU TCG. Under -cpu max, V8 emits armv8.5+ JIT code
that TCG mistranslates and node crashes with SIGTRAP (exit 133)
during migrations. Three changes together get it working:

- Drop to -cpu cortex-a72 for TCG arm64 guests. Limits V8 to
  armv8.0-a which TCG handles cleanly. Native paths (HVF/KVM) keep
  -cpu max for full performance.
- Run migrations with NODE_OPTIONS=--jitless as belt-and-suspenders.
  Migrations are I/O-bound so the perf hit is negligible.
- Skip the in-guest smoke test on arm64. A full Next.js backend under
  cross-arch TCG either SIGTRAPs or times out; the amd64 build still
  runs the smoke test, which covers every non-arch-specific code
  path. Arch is propagated into the guest via a new build-arch.env
  marker in the stack-bundle ISO.
The previous commit set NODE_OPTIONS=--jitless on the migration
docker exec. That was wrong for two reasons:
- --jitless disables eval and new Function, which some code in the
  migration path uses, so it broke amd64 builds that had been passing.
- --jitless is a V8 feature gate, not a TCG workaround. If it breaks
  one arch it breaks both — it could never have helped arm64 either.

Revert the --jitless flag and rely on -cpu cortex-a72 (added in the
parent commit) as the root-cause fix for the arm64 TCG SIGTRAP.

Keep the stdout/stderr capture for the migration exec so the next
failure dumps the actual node error through log-provision instead of
being swallowed by the serial-only stream.
Cross-arch TCG on ubicloud-standard-8 either SIGTRAPs during migrations
(old QEMU) or hangs in wait-for-deps with no progress. GitHub's
ubuntu-24.04-arm runner is an Azure arm64 VM — same-arch TCG, no KVM
(no nested virt exposed) — but empirically completes migrations, the
dep setup, and image packaging end-to-end (verified on the diagnostics
branch run). Only failure there was the backend smoke test hitting its
300s timeout, which the parent commit on this branch already skips on
arm64.

Keep amd64 on ubicloud-standard-8 for its KVM acceleration.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
stack-auth-hosted-components Error Error Apr 11, 2026 0:09am
stack-backend Ready Ready Preview, Comment Apr 11, 2026 0:09am
stack-dashboard Ready Ready Preview, Comment Apr 11, 2026 0:09am
stack-demo Ready Ready Preview, Comment Apr 11, 2026 0:09am
stack-docs Ready Ready Preview, Comment Apr 11, 2026 0:09am
stack-preview-backend Ready Ready Preview, Comment Apr 11, 2026 0:09am
stack-preview-dashboard Ready Ready Preview, Comment Apr 11, 2026 0:09am

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 10, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b7891809-8755-4403-b9fa-9287f04e8447

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch emulator-arm64-ubuntu-runner

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 10, 2026

Greptile Summary

This PR moves the arm64 emulator build from a cross-arch TCG setup on ubicloud-standard-8 (amd64 host) to ubuntu-24.04-arm, a native arm64 runner that avoids V8 JIT translation crashes while still using software emulation (no KVM on Azure arm64). The test matrix is kept amd64-only since the backend cannot start within a reasonable window under TCG.

Confidence Score: 5/5

Safe to merge — focused CI change with no logic regressions and well-documented trade-offs.

The only substantive change is substituting the arm64 runner and excluding arm64 from the smoke-test matrix, both of which are intentional and well-explained. KVM setup degrades gracefully on the arm64 runner. No P0/P1 findings identified.

No files require special attention.

Important Files Changed

Filename Overview
.github/workflows/qemu-emulator-build.yaml Adds ubuntu-24.04-arm as the runner for arm64 builds; keeps amd64 on ubicloud-standard-8 with KVM; arm64 is excluded from the smoke-test matrix. KVM-setup step degrades gracefully on the arm64 runner.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[push / PR / workflow_dispatch] --> B{Matrix: arch}

    B -->|amd64| C[ubicloud-standard-8\nKVM acceleration]
    B -->|arm64| D[ubuntu-24.04-arm\nSame-arch TCG, no KVM]

    C --> E[Build QEMU image\namd64]
    D --> F[Build QEMU image\narm64]

    E --> G[Start emulator\nVerify services]
    F --> H[Start emulator\nVerify services]

    G --> I[Upload qcow2 artifact\namd64]
    H --> J[Upload qcow2 artifact\narm64]

    I --> K[test job\namd64 only on ubicloud-standard-8]
    J --> K

    K --> L[Smoke tests\nbackend · dashboard · MinIO · Inbucket]

    L --> M{publish condition?}
    M -->|main/dev/dispatch| N[Create GitHub Release\nboth qcow2 artifacts]
    M -->|otherwise| O[Skip publish]
Loading

Reviews (1): Last reviewed commit: "ci: run arm64 emulator build on ubuntu-2..." | Re-trigger Greptile

BilalG1 added 2 commits April 10, 2026 17:01
wait-for-deps used to loop forever on each service, so any single
dep that failed to start (e.g. a service crash-looping under TCG)
hung the build until the outer 6000s provision timeout.

Rewrite as a wait_for helper with:
- Hard 1500s budget across the full dep wait (overridable via
  STACK_DEPS_TIMEOUT). On timeout, dump docker ps -a, last 300 lines
  of the deps container, and per-service reachability, then exit 1
  so provision-build's cleanup trap fires and the VM shuts down fast.
- "<service> ready (Ns)" log lines on each service so successful
  runs show which service was the bottleneck.
- 30s heartbeat per service so long-running waits don't look frozen.

amd64 is unaffected — services come up in ~1s each under KVM, which
is well inside any threshold here.
Same-arch TCG (e.g. arm64 guest on the arm64 ubuntu-24.04-arm runner
that has no nested virt) was falling through to -cpu cortex-a72 too.
Empirically that hangs wait-for-deps indefinitely — services never
reach a ready state — probably because QEMU's TCG emulation of named
CPU models is less well-tested than -cpu max, especially for the LSE
atomic fallback paths the dep services exercise.

The cortex-a72 workaround is only needed for cross-arch TCG, where V8
emits JIT instructions the amd64 host's TCG mistranslates. Restrict
it to that case; same-arch TCG now gets -cpu max, matching the known
working config from the diagnostics branch run on ubuntu-24.04-arm.
@BilalG1 BilalG1 changed the base branch from emulator-tcg-arm64-fixes to dev April 13, 2026 16:36
@BilalG1 BilalG1 changed the title ci: run arm64 emulator build on ubuntu-24.04-arm (same-arch TCG) local emulator image optimization Apr 13, 2026
@BilalG1 BilalG1 closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant