local emulator image optimization#1326
Conversation
Provisioning used to silently wait out the full 6000s timeout on any guest-side failure because the cleanup trap only logged the error. Now it writes STACK_CLOUD_INIT_FAILED and shuts the VM down, and the host waiter breaks on that marker and reports it distinctly. Also bump smoke test timeout 120s->300s, dump docker ps / container logs / free -m / verbose curl when it fails, log the qemu accel path, and enable /dev/kvm on the CI runner so the VM isn't stuck in TCG.
The arm64 matrix entry cross-compiles on the amd64 CI runner, so the guest runs under QEMU TCG. Under -cpu max, V8 emits armv8.5+ JIT code that TCG mistranslates and node crashes with SIGTRAP (exit 133) during migrations. Three changes together get it working: - Drop to -cpu cortex-a72 for TCG arm64 guests. Limits V8 to armv8.0-a which TCG handles cleanly. Native paths (HVF/KVM) keep -cpu max for full performance. - Run migrations with NODE_OPTIONS=--jitless as belt-and-suspenders. Migrations are I/O-bound so the perf hit is negligible. - Skip the in-guest smoke test on arm64. A full Next.js backend under cross-arch TCG either SIGTRAPs or times out; the amd64 build still runs the smoke test, which covers every non-arch-specific code path. Arch is propagated into the guest via a new build-arch.env marker in the stack-bundle ISO.
The previous commit set NODE_OPTIONS=--jitless on the migration docker exec. That was wrong for two reasons: - --jitless disables eval and new Function, which some code in the migration path uses, so it broke amd64 builds that had been passing. - --jitless is a V8 feature gate, not a TCG workaround. If it breaks one arch it breaks both — it could never have helped arm64 either. Revert the --jitless flag and rely on -cpu cortex-a72 (added in the parent commit) as the root-cause fix for the arm64 TCG SIGTRAP. Keep the stdout/stderr capture for the migration exec so the next failure dumps the actual node error through log-provision instead of being swallowed by the serial-only stream.
Cross-arch TCG on ubicloud-standard-8 either SIGTRAPs during migrations (old QEMU) or hangs in wait-for-deps with no progress. GitHub's ubuntu-24.04-arm runner is an Azure arm64 VM — same-arch TCG, no KVM (no nested virt exposed) — but empirically completes migrations, the dep setup, and image packaging end-to-end (verified on the diagnostics branch run). Only failure there was the backend smoke test hitting its 300s timeout, which the parent commit on this branch already skips on arm64. Keep amd64 on ubicloud-standard-8 for its KVM acceleration.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR moves the arm64 emulator build from a cross-arch TCG setup on Confidence Score: 5/5Safe to merge — focused CI change with no logic regressions and well-documented trade-offs. The only substantive change is substituting the arm64 runner and excluding arm64 from the smoke-test matrix, both of which are intentional and well-explained. KVM setup degrades gracefully on the arm64 runner. No P0/P1 findings identified. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[push / PR / workflow_dispatch] --> B{Matrix: arch}
B -->|amd64| C[ubicloud-standard-8\nKVM acceleration]
B -->|arm64| D[ubuntu-24.04-arm\nSame-arch TCG, no KVM]
C --> E[Build QEMU image\namd64]
D --> F[Build QEMU image\narm64]
E --> G[Start emulator\nVerify services]
F --> H[Start emulator\nVerify services]
G --> I[Upload qcow2 artifact\namd64]
H --> J[Upload qcow2 artifact\narm64]
I --> K[test job\namd64 only on ubicloud-standard-8]
J --> K
K --> L[Smoke tests\nbackend · dashboard · MinIO · Inbucket]
L --> M{publish condition?}
M -->|main/dev/dispatch| N[Create GitHub Release\nboth qcow2 artifacts]
M -->|otherwise| O[Skip publish]
Reviews (1): Last reviewed commit: "ci: run arm64 emulator build on ubuntu-2..." | Re-trigger Greptile |
wait-for-deps used to loop forever on each service, so any single dep that failed to start (e.g. a service crash-looping under TCG) hung the build until the outer 6000s provision timeout. Rewrite as a wait_for helper with: - Hard 1500s budget across the full dep wait (overridable via STACK_DEPS_TIMEOUT). On timeout, dump docker ps -a, last 300 lines of the deps container, and per-service reachability, then exit 1 so provision-build's cleanup trap fires and the VM shuts down fast. - "<service> ready (Ns)" log lines on each service so successful runs show which service was the bottleneck. - 30s heartbeat per service so long-running waits don't look frozen. amd64 is unaffected — services come up in ~1s each under KVM, which is well inside any threshold here.
Same-arch TCG (e.g. arm64 guest on the arm64 ubuntu-24.04-arm runner that has no nested virt) was falling through to -cpu cortex-a72 too. Empirically that hangs wait-for-deps indefinitely — services never reach a ready state — probably because QEMU's TCG emulation of named CPU models is less well-tested than -cpu max, especially for the LSE atomic fallback paths the dep services exercise. The cortex-a72 workaround is only needed for cross-arch TCG, where V8 emits JIT instructions the amd64 host's TCG mistranslates. Restrict it to that case; same-arch TCG now gets -cpu max, matching the known working config from the diagnostics branch run on ubuntu-24.04-arm.
Summary
Cross-arch TCG on
ubicloud-standard-8(amd64 host → arm64 guest) either SIGTRAPs during migrations (old QEMU) or hangs inwait-for-depswith no progress. GitHub'subuntu-24.04-armrunner is an Azure arm64 VM — same-arch TCG, no KVM (nested virt not exposed on Azure arm64) — but empirically completes migrations, the dep setup, and image packaging end-to-end. The only failure on the diagnostics branch was the backend smoke test hitting its 300s timeout, which the parent commit on this branch already skips on arm64.Matrix is now per-runner:
ubicloud-standard-8— keeps KVM acceleration.ubuntu-24.04-arm— same-arch TCG, no smoke test.Evidence backing this change
From the
emulator-arm64-kvm-diagnosticsrun onubuntu-24.04-arm:-cpu cortex-a72was even needed there — same-arch TCG is immune to the V8 JIT translation bug that breaks cross-arch TCG).init-services.shcompleted cleanly.:8102within 300s, which is a fundamental limitation of running the backend under any TCG configuration.On this branch the smoke test is skipped on arm64, so that last failure mode doesn't apply.
Test plan
ubicloud-standard-8under KVM.ubuntu-24.04-arm, skips smoke test with the clear log message, and uploads the qcow2 artifact.