emulator: bounded dep wait with per-service diagnostics by BilalG1 · Pull Request #1325 · hexclave/stack-auth

BilalG1 · 2026-04-10T19:17:54Z

Summary

The arm64 CI build hung in wait-for-deps for 20+ minutes with no way to tell which dep was stuck, because the original script used unbounded until … do sleep 1; done loops. Eventually it would either burn the full 6000s outer timeout or block forever on a crash-looping service.

Rewrite the wait as a wait_for helper with:

Hard 1500s budget across the full dep wait (overridable via STACK_DEPS_TIMEOUT). On timeout we dump docker ps -a, last 300 lines of the deps container, and per-service reachability probes, then exit 1 so provision-build's cleanup trap fires and the VM shuts down fast instead of idling to the outer 6000s timeout.
"<service> ready (Ns)" log lines on each service so successful runs show which one was the bottleneck (useful for both amd64 baselines and arm64 diagnosis).
30s heartbeat per service so long-running waits don't look frozen to the host-side waiter.

amd64 is unaffected — services come up in ~1s each under KVM, well inside any threshold here. The change is scoped to wait-for-deps only; no workflow changes, no docker image changes.

Stacked on top of emulator-tcg-arm64-fixes so it inherits the cortex-a72 + migration-output-capture + smoke-test-skip work.

Test plan

amd64 build still passes, new "postgres/clickhouse/svix/minio/qstash ready (Ns)" lines appear in the provision log.
arm64 build either (a) completes with per-service timings showing which service was the TCG bottleneck, or (b) times out cleanly at ~1500s with docker ps + deps container logs dumped, so we can diagnose the stuck service from actual data instead of guessing.

wait-for-deps used to loop forever on each service, so any single dep that failed to start (e.g. a service crash-looping under TCG) hung the build until the outer 6000s provision timeout. Rewrite as a wait_for helper with: - Hard 1500s budget across the full dep wait (overridable via STACK_DEPS_TIMEOUT). On timeout, dump docker ps -a, last 300 lines of the deps container, and per-service reachability, then exit 1 so provision-build's cleanup trap fires and the VM shuts down fast. - "<service> ready (Ns)" log lines on each service so successful runs show which service was the bottleneck. - 30s heartbeat per service so long-running waits don't look frozen. amd64 is unaffected — services come up in ~1s each under KVM, which is well inside any threshold here.

vercel · 2026-04-10T19:18:02Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
stack-auth-hosted-components	Building	Preview	Apr 10, 2026 7:22pm
stack-backend	Ready	Preview, Comment	Apr 10, 2026 7:22pm
stack-dashboard	Ready	Preview, Comment	Apr 10, 2026 7:22pm
stack-demo	Ready	Preview, Comment	Apr 10, 2026 7:22pm
stack-docs	Ready	Preview, Comment	Apr 10, 2026 7:22pm
stack-preview-backend	Ready	Preview, Comment	Apr 10, 2026 7:22pm
stack-preview-dashboard	Ready	Preview, Comment	Apr 10, 2026 7:22pm

coderabbitai · 2026-04-10T19:18:05Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 557919b0-2f92-48c3-a77d-fa16c91f3183

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch emulator-bounded-dep-wait

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-04-10T19:21:28Z

Greptile Summary

This PR rewrites the wait-for-deps script in the emulator cloud-init user-data to replace unbounded until … sleep 1 loops with a wait_for helper that enforces a hard 1500 s global timeout (STACK_DEPS_TIMEOUT), emits per-service "ready (Ns)" log lines, fires a 30 s heartbeat per service, and dumps docker ps, container logs, and per-service reachability probes on timeout.

P1 – probe curls block indefinitely: wait_for probes for clickhouse, svix, minio, and qstash call curl without --max-time. If any service has its port open but stalls mid-HTTP-handshake, the eval \"$probe\" call hangs forever inside the while true loop, silently preventing both the heartbeat and the DEPS_TIMEOUT check from ever firing — the exact failure mode this PR was written to fix. dump_diagnostics already uses --max-time 3; the same guard should be added to the probes (e.g. --max-time 5).

Confidence Score: 4/5

Safe to merge after adding --max-time to the four curl probe strings; the bug would only manifest if a service holds a TCP port open while hanging on the HTTP layer, but that scenario is precisely what this PR is guarding against.

One P1 finding: the wait_for probe curls lack --max-time, which allows a partially-started service to block the loop indefinitely and defeat the bounded-wait guarantee. The fix is a one-liner per probe and is clearly correct. All other logic (global timer, per-service timing, heartbeat, diagnostic dump, qstash 401 probe) is correct.

docker/local-emulator/qemu/cloud-init/emulator/user-data — specifically the four curl probe strings on lines 209–212

Important Files Changed

Filename	Overview
docker/local-emulator/qemu/cloud-init/emulator/user-data	Rewrites `wait-for-deps` with a 1500 s global timeout, per-service heartbeats, and diagnostic dumps on timeout; one P1 bug: probe `curl` calls lack `--max-time`, so a service that has a port open but hangs on HTTP blocks the timeout mechanism indefinitely.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([wait-for-deps starts]) --> B[start = SECONDS\ntimeout = DEPS_TIMEOUT 1500s]
    B --> C[wait_for postgres]
    C --> D[wait_for clickhouse]
    D --> E[wait_for svix]
    E --> F[wait_for minio]
    F --> G[wait_for qstash]
    G --> H([log 'all deps ready', exit 0])

    subgraph wait_for [wait_for name probe]
        direction TB
        W1[svc_start = SECONDS\nnext_heartbeat = svc_start + 30] --> W2{eval probe\nsucceeds?}
        W2 -- yes --> W3([log 'name ready Ns'\nreturn 0])
        W2 -- no --> W4{SECONDS >= next_heartbeat?}
        W4 -- yes --> W5[log 'still waiting...'\nnext_heartbeat += 30]
        W5 --> W6
        W4 -- no --> W6{SECONDS - start >= DEPS_TIMEOUT?}
        W6 -- no --> W7[sleep 2] --> W2
        W6 -- yes --> W8[log TIMEOUT\ndump_diagnostics\nexit 1]
    end

    subgraph dump_diagnostics [dump_diagnostics]
        direction TB
        D1[docker ps -a] --> D2[docker logs --tail 300 deps-container]
        D2 --> D3[nc -z postgres:5432\ncurl --max-time 3 clickhouse/ping\ncurl --max-time 3 svix/health\ncurl --max-time 3 minio/health\ncurl --max-time 3 qstash 401?]
    end

    W8 --> dump_diagnostics

Prompt To Fix All With AI

This is a comment left during a code review.
Path: docker/local-emulator/qemu/cloud-init/emulator/user-data
Line: 209-212

Comment:
**Probe curls block indefinitely without `--max-time`**

The `curl` calls in `wait_for` probes have no `--max-time`, so if a service has its port open but stalls on the HTTP handshake (e.g., clickhouse in the middle of startup), the probe hangs indefinitely. When this happens, the `while true` loop is blocked inside `eval "$probe"` and neither the 30 s heartbeat nor the `DEPS_TIMEOUT` check can ever fire — exactly the bounded-wait guarantee this PR is meant to provide is silently broken.

`dump_diagnostics` correctly uses `--max-time 3` on every probe; the same guard is needed in the `wait_for` probes:

```suggestion
      wait_for "postgres"   'nc -z 127.0.0.1 5432'
      wait_for "clickhouse" 'curl -sf --max-time 5 http://127.0.0.1:8123/ping'
      wait_for "svix"       'curl -sf --max-time 5 http://127.0.0.1:8071/api/v1/health/'
      wait_for "minio"      'curl -sf --max-time 5 http://127.0.0.1:9090/minio/health/live'
      wait_for "qstash"     '[ "$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "emulator: bounded dep wait with per-serv..." | Re-trigger Greptile}

greptile-apps · 2026-04-10T19:21:31Z

+      wait_for "clickhouse" 'curl -sf http://127.0.0.1:8123/ping'
+      wait_for "svix"       'curl -sf http://127.0.0.1:8071/api/v1/health/'
+      wait_for "minio"      'curl -sf http://127.0.0.1:9090/minio/health/live'
+      wait_for "qstash"     '[ "$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'


Probe curls block indefinitely without --max-time

The curl calls in wait_for probes have no --max-time, so if a service has its port open but stalls on the HTTP handshake (e.g., clickhouse in the middle of startup), the probe hangs indefinitely. When this happens, the while true loop is blocked inside eval "$probe" and neither the 30 s heartbeat nor the DEPS_TIMEOUT check can ever fire — exactly the bounded-wait guarantee this PR is meant to provide is silently broken.

dump_diagnostics correctly uses --max-time 3 on every probe; the same guard is needed in the wait_for probes:

Suggested change

wait_for "clickhouse" 'curl -sf http://127.0.0.1:8123/ping'

wait_for "svix" 'curl -sf http://127.0.0.1:8071/api/v1/health/'

wait_for "minio" 'curl -sf http://127.0.0.1:9090/minio/health/live'

wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'

wait_for "postgres" 'nc -z 127.0.0.1 5432'

wait_for "clickhouse" 'curl -sf --max-time 5 http://127.0.0.1:8123/ping'

wait_for "svix" 'curl -sf --max-time 5 http://127.0.0.1:8071/api/v1/health/'

wait_for "minio" 'curl -sf --max-time 5 http://127.0.0.1:9090/minio/health/live'

wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'

Prompt To Fix With AI

This is a comment left during a code review. Path: docker/local-emulator/qemu/cloud-init/emulator/user-data Line: 209-212 Comment: **Probe curls block indefinitely without `--max-time`** The `curl` calls in `wait_for` probes have no `--max-time`, so if a service has its port open but stalls on the HTTP handshake (e.g., clickhouse in the middle of startup), the probe hangs indefinitely. When this happens, the `while true` loop is blocked inside `eval "$probe"` and neither the 30 s heartbeat nor the `DEPS_TIMEOUT` check can ever fire — exactly the bounded-wait guarantee this PR is meant to provide is silently broken. `dump_diagnostics` correctly uses `--max-time 3` on every probe; the same guard is needed in the `wait_for` probes: ```suggestion wait_for "postgres" 'nc -z 127.0.0.1 5432' wait_for "clickhouse" 'curl -sf --max-time 5 http://127.0.0.1:8123/ping' wait_for "svix" 'curl -sf --max-time 5 http://127.0.0.1:8071/api/v1/health/' wait_for "minio" 'curl -sf --max-time 5 http://127.0.0.1:9090/minio/health/live' wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]' ``` How can I resolve this? If you propose a fix, please make it concise.

github-actions Bot assigned BilalG1 Apr 10, 2026

vercel Bot deployed to Preview – stack-preview-dashboard April 10, 2026 19:19 View deployment

vercel Bot deployed to Preview – stack-preview-backend April 10, 2026 19:20 View deployment

greptile-apps Bot reviewed Apr 10, 2026

View reviewed changes

BilalG1 closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emulator: bounded dep wait with per-service diagnostics#1325

emulator: bounded dep wait with per-service diagnostics#1325
BilalG1 wants to merge 1 commit into
emulator-tcg-arm64-fixesfrom
emulator-bounded-dep-wait

BilalG1 commented Apr 10, 2026

Uh oh!

vercel Bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 10, 2026

Review skipped

Uh oh!

greptile-apps Bot commented Apr 10, 2026

Uh oh!

greptile-apps Bot Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BilalG1 commented Apr 10, 2026

Summary

Test plan

Uh oh!

vercel Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 10, 2026

Review skipped

Uh oh!

greptile-apps Bot commented Apr 10, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 10, 2026 •

edited

Loading