Skip to content

emulator: bounded dep wait with per-service diagnostics#1325

Closed
BilalG1 wants to merge 1 commit into
emulator-tcg-arm64-fixesfrom
emulator-bounded-dep-wait
Closed

emulator: bounded dep wait with per-service diagnostics#1325
BilalG1 wants to merge 1 commit into
emulator-tcg-arm64-fixesfrom
emulator-bounded-dep-wait

Conversation

@BilalG1
Copy link
Copy Markdown
Collaborator

@BilalG1 BilalG1 commented Apr 10, 2026

Summary

The arm64 CI build hung in wait-for-deps for 20+ minutes with no way to tell which dep was stuck, because the original script used unbounded until … do sleep 1; done loops. Eventually it would either burn the full 6000s outer timeout or block forever on a crash-looping service.

Rewrite the wait as a wait_for helper with:

  • Hard 1500s budget across the full dep wait (overridable via STACK_DEPS_TIMEOUT). On timeout we dump docker ps -a, last 300 lines of the deps container, and per-service reachability probes, then exit 1 so provision-build's cleanup trap fires and the VM shuts down fast instead of idling to the outer 6000s timeout.
  • "<service> ready (Ns)" log lines on each service so successful runs show which one was the bottleneck (useful for both amd64 baselines and arm64 diagnosis).
  • 30s heartbeat per service so long-running waits don't look frozen to the host-side waiter.

amd64 is unaffected — services come up in ~1s each under KVM, well inside any threshold here. The change is scoped to wait-for-deps only; no workflow changes, no docker image changes.

Stacked on top of emulator-tcg-arm64-fixes so it inherits the cortex-a72 + migration-output-capture + smoke-test-skip work.

Test plan

  • amd64 build still passes, new "postgres/clickhouse/svix/minio/qstash ready (Ns)" lines appear in the provision log.
  • arm64 build either (a) completes with per-service timings showing which service was the TCG bottleneck, or (b) times out cleanly at ~1500s with docker ps + deps container logs dumped, so we can diagnose the stuck service from actual data instead of guessing.

wait-for-deps used to loop forever on each service, so any single
dep that failed to start (e.g. a service crash-looping under TCG)
hung the build until the outer 6000s provision timeout.

Rewrite as a wait_for helper with:
- Hard 1500s budget across the full dep wait (overridable via
  STACK_DEPS_TIMEOUT). On timeout, dump docker ps -a, last 300 lines
  of the deps container, and per-service reachability, then exit 1
  so provision-build's cleanup trap fires and the VM shuts down fast.
- "<service> ready (Ns)" log lines on each service so successful
  runs show which service was the bottleneck.
- 30s heartbeat per service so long-running waits don't look frozen.

amd64 is unaffected — services come up in ~1s each under KVM, which
is well inside any threshold here.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
stack-auth-hosted-components Building Building Preview Apr 10, 2026 7:22pm
stack-backend Ready Ready Preview, Comment Apr 10, 2026 7:22pm
stack-dashboard Ready Ready Preview, Comment Apr 10, 2026 7:22pm
stack-demo Ready Ready Preview, Comment Apr 10, 2026 7:22pm
stack-docs Ready Ready Preview, Comment Apr 10, 2026 7:22pm
stack-preview-backend Ready Ready Preview, Comment Apr 10, 2026 7:22pm
stack-preview-dashboard Ready Ready Preview, Comment Apr 10, 2026 7:22pm

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 10, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 557919b0-2f92-48c3-a77d-fa16c91f3183

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch emulator-bounded-dep-wait

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 10, 2026

Greptile Summary

This PR rewrites the wait-for-deps script in the emulator cloud-init user-data to replace unbounded until … sleep 1 loops with a wait_for helper that enforces a hard 1500 s global timeout (STACK_DEPS_TIMEOUT), emits per-service "ready (Ns)" log lines, fires a 30 s heartbeat per service, and dumps docker ps, container logs, and per-service reachability probes on timeout.

  • P1 – probe curls block indefinitely: wait_for probes for clickhouse, svix, minio, and qstash call curl without --max-time. If any service has its port open but stalls mid-HTTP-handshake, the eval \"$probe\" call hangs forever inside the while true loop, silently preventing both the heartbeat and the DEPS_TIMEOUT check from ever firing — the exact failure mode this PR was written to fix. dump_diagnostics already uses --max-time 3; the same guard should be added to the probes (e.g. --max-time 5).

Confidence Score: 4/5

Safe to merge after adding --max-time to the four curl probe strings; the bug would only manifest if a service holds a TCP port open while hanging on the HTTP layer, but that scenario is precisely what this PR is guarding against.

One P1 finding: the wait_for probe curls lack --max-time, which allows a partially-started service to block the loop indefinitely and defeat the bounded-wait guarantee. The fix is a one-liner per probe and is clearly correct. All other logic (global timer, per-service timing, heartbeat, diagnostic dump, qstash 401 probe) is correct.

docker/local-emulator/qemu/cloud-init/emulator/user-data — specifically the four curl probe strings on lines 209–212

Important Files Changed

Filename Overview
docker/local-emulator/qemu/cloud-init/emulator/user-data Rewrites wait-for-deps with a 1500 s global timeout, per-service heartbeats, and diagnostic dumps on timeout; one P1 bug: probe curl calls lack --max-time, so a service that has a port open but hangs on HTTP blocks the timeout mechanism indefinitely.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([wait-for-deps starts]) --> B[start = SECONDS\ntimeout = DEPS_TIMEOUT 1500s]
    B --> C[wait_for postgres]
    C --> D[wait_for clickhouse]
    D --> E[wait_for svix]
    E --> F[wait_for minio]
    F --> G[wait_for qstash]
    G --> H([log 'all deps ready', exit 0])

    subgraph wait_for [wait_for name probe]
        direction TB
        W1[svc_start = SECONDS\nnext_heartbeat = svc_start + 30] --> W2{eval probe\nsucceeds?}
        W2 -- yes --> W3([log 'name ready Ns'\nreturn 0])
        W2 -- no --> W4{SECONDS >= next_heartbeat?}
        W4 -- yes --> W5[log 'still waiting...'\nnext_heartbeat += 30]
        W5 --> W6
        W4 -- no --> W6{SECONDS - start >= DEPS_TIMEOUT?}
        W6 -- no --> W7[sleep 2] --> W2
        W6 -- yes --> W8[log TIMEOUT\ndump_diagnostics\nexit 1]
    end

    subgraph dump_diagnostics [dump_diagnostics]
        direction TB
        D1[docker ps -a] --> D2[docker logs --tail 300 deps-container]
        D2 --> D3[nc -z postgres:5432\ncurl --max-time 3 clickhouse/ping\ncurl --max-time 3 svix/health\ncurl --max-time 3 minio/health\ncurl --max-time 3 qstash 401?]
    end

    W8 --> dump_diagnostics
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docker/local-emulator/qemu/cloud-init/emulator/user-data
Line: 209-212

Comment:
**Probe curls block indefinitely without `--max-time`**

The `curl` calls in `wait_for` probes have no `--max-time`, so if a service has its port open but stalls on the HTTP handshake (e.g., clickhouse in the middle of startup), the probe hangs indefinitely. When this happens, the `while true` loop is blocked inside `eval "$probe"` and neither the 30 s heartbeat nor the `DEPS_TIMEOUT` check can ever fire — exactly the bounded-wait guarantee this PR is meant to provide is silently broken.

`dump_diagnostics` correctly uses `--max-time 3` on every probe; the same guard is needed in the `wait_for` probes:

```suggestion
      wait_for "postgres"   'nc -z 127.0.0.1 5432'
      wait_for "clickhouse" 'curl -sf --max-time 5 http://127.0.0.1:8123/ping'
      wait_for "svix"       'curl -sf --max-time 5 http://127.0.0.1:8071/api/v1/health/'
      wait_for "minio"      'curl -sf --max-time 5 http://127.0.0.1:9090/minio/health/live'
      wait_for "qstash"     '[ "$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "emulator: bounded dep wait with per-serv..." | Re-trigger Greptile

Comment on lines +209 to +212
wait_for "clickhouse" 'curl -sf http://127.0.0.1:8123/ping'
wait_for "svix" 'curl -sf http://127.0.0.1:8071/api/v1/health/'
wait_for "minio" 'curl -sf http://127.0.0.1:9090/minio/health/live'
wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Probe curls block indefinitely without --max-time

The curl calls in wait_for probes have no --max-time, so if a service has its port open but stalls on the HTTP handshake (e.g., clickhouse in the middle of startup), the probe hangs indefinitely. When this happens, the while true loop is blocked inside eval "$probe" and neither the 30 s heartbeat nor the DEPS_TIMEOUT check can ever fire — exactly the bounded-wait guarantee this PR is meant to provide is silently broken.

dump_diagnostics correctly uses --max-time 3 on every probe; the same guard is needed in the wait_for probes:

Suggested change
wait_for "clickhouse" 'curl -sf http://127.0.0.1:8123/ping'
wait_for "svix" 'curl -sf http://127.0.0.1:8071/api/v1/health/'
wait_for "minio" 'curl -sf http://127.0.0.1:9090/minio/health/live'
wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
wait_for "postgres" 'nc -z 127.0.0.1 5432'
wait_for "clickhouse" 'curl -sf --max-time 5 http://127.0.0.1:8123/ping'
wait_for "svix" 'curl -sf --max-time 5 http://127.0.0.1:8071/api/v1/health/'
wait_for "minio" 'curl -sf --max-time 5 http://127.0.0.1:9090/minio/health/live'
wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
Prompt To Fix With AI
This is a comment left during a code review.
Path: docker/local-emulator/qemu/cloud-init/emulator/user-data
Line: 209-212

Comment:
**Probe curls block indefinitely without `--max-time`**

The `curl` calls in `wait_for` probes have no `--max-time`, so if a service has its port open but stalls on the HTTP handshake (e.g., clickhouse in the middle of startup), the probe hangs indefinitely. When this happens, the `while true` loop is blocked inside `eval "$probe"` and neither the 30 s heartbeat nor the `DEPS_TIMEOUT` check can ever fire — exactly the bounded-wait guarantee this PR is meant to provide is silently broken.

`dump_diagnostics` correctly uses `--max-time 3` on every probe; the same guard is needed in the `wait_for` probes:

```suggestion
      wait_for "postgres"   'nc -z 127.0.0.1 5432'
      wait_for "clickhouse" 'curl -sf --max-time 5 http://127.0.0.1:8123/ping'
      wait_for "svix"       'curl -sf --max-time 5 http://127.0.0.1:8071/api/v1/health/'
      wait_for "minio"      'curl -sf --max-time 5 http://127.0.0.1:9090/minio/health/live'
      wait_for "qstash"     '[ "$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
```

How can I resolve this? If you propose a fix, please make it concise.

@BilalG1 BilalG1 closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant