emulator: bounded dep wait with per-service diagnostics#1325
Conversation
wait-for-deps used to loop forever on each service, so any single dep that failed to start (e.g. a service crash-looping under TCG) hung the build until the outer 6000s provision timeout. Rewrite as a wait_for helper with: - Hard 1500s budget across the full dep wait (overridable via STACK_DEPS_TIMEOUT). On timeout, dump docker ps -a, last 300 lines of the deps container, and per-service reachability, then exit 1 so provision-build's cleanup trap fires and the VM shuts down fast. - "<service> ready (Ns)" log lines on each service so successful runs show which service was the bottleneck. - 30s heartbeat per service so long-running waits don't look frozen. amd64 is unaffected — services come up in ~1s each under KVM, which is well inside any threshold here.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR rewrites the
Confidence Score: 4/5Safe to merge after adding --max-time to the four curl probe strings; the bug would only manifest if a service holds a TCP port open while hanging on the HTTP layer, but that scenario is precisely what this PR is guarding against. One P1 finding: the wait_for probe curls lack --max-time, which allows a partially-started service to block the loop indefinitely and defeat the bounded-wait guarantee. The fix is a one-liner per probe and is clearly correct. All other logic (global timer, per-service timing, heartbeat, diagnostic dump, qstash 401 probe) is correct. docker/local-emulator/qemu/cloud-init/emulator/user-data — specifically the four curl probe strings on lines 209–212 Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([wait-for-deps starts]) --> B[start = SECONDS\ntimeout = DEPS_TIMEOUT 1500s]
B --> C[wait_for postgres]
C --> D[wait_for clickhouse]
D --> E[wait_for svix]
E --> F[wait_for minio]
F --> G[wait_for qstash]
G --> H([log 'all deps ready', exit 0])
subgraph wait_for [wait_for name probe]
direction TB
W1[svc_start = SECONDS\nnext_heartbeat = svc_start + 30] --> W2{eval probe\nsucceeds?}
W2 -- yes --> W3([log 'name ready Ns'\nreturn 0])
W2 -- no --> W4{SECONDS >= next_heartbeat?}
W4 -- yes --> W5[log 'still waiting...'\nnext_heartbeat += 30]
W5 --> W6
W4 -- no --> W6{SECONDS - start >= DEPS_TIMEOUT?}
W6 -- no --> W7[sleep 2] --> W2
W6 -- yes --> W8[log TIMEOUT\ndump_diagnostics\nexit 1]
end
subgraph dump_diagnostics [dump_diagnostics]
direction TB
D1[docker ps -a] --> D2[docker logs --tail 300 deps-container]
D2 --> D3[nc -z postgres:5432\ncurl --max-time 3 clickhouse/ping\ncurl --max-time 3 svix/health\ncurl --max-time 3 minio/health\ncurl --max-time 3 qstash 401?]
end
W8 --> dump_diagnostics
Prompt To Fix All With AIThis is a comment left during a code review.
Path: docker/local-emulator/qemu/cloud-init/emulator/user-data
Line: 209-212
Comment:
**Probe curls block indefinitely without `--max-time`**
The `curl` calls in `wait_for` probes have no `--max-time`, so if a service has its port open but stalls on the HTTP handshake (e.g., clickhouse in the middle of startup), the probe hangs indefinitely. When this happens, the `while true` loop is blocked inside `eval "$probe"` and neither the 30 s heartbeat nor the `DEPS_TIMEOUT` check can ever fire — exactly the bounded-wait guarantee this PR is meant to provide is silently broken.
`dump_diagnostics` correctly uses `--max-time 3` on every probe; the same guard is needed in the `wait_for` probes:
```suggestion
wait_for "postgres" 'nc -z 127.0.0.1 5432'
wait_for "clickhouse" 'curl -sf --max-time 5 http://127.0.0.1:8123/ping'
wait_for "svix" 'curl -sf --max-time 5 http://127.0.0.1:8071/api/v1/health/'
wait_for "minio" 'curl -sf --max-time 5 http://127.0.0.1:9090/minio/health/live'
wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "emulator: bounded dep wait with per-serv..." | Re-trigger Greptile |
| wait_for "clickhouse" 'curl -sf http://127.0.0.1:8123/ping' | ||
| wait_for "svix" 'curl -sf http://127.0.0.1:8071/api/v1/health/' | ||
| wait_for "minio" 'curl -sf http://127.0.0.1:9090/minio/health/live' | ||
| wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]' |
There was a problem hiding this comment.
Probe curls block indefinitely without
--max-time
The curl calls in wait_for probes have no --max-time, so if a service has its port open but stalls on the HTTP handshake (e.g., clickhouse in the middle of startup), the probe hangs indefinitely. When this happens, the while true loop is blocked inside eval "$probe" and neither the 30 s heartbeat nor the DEPS_TIMEOUT check can ever fire — exactly the bounded-wait guarantee this PR is meant to provide is silently broken.
dump_diagnostics correctly uses --max-time 3 on every probe; the same guard is needed in the wait_for probes:
| wait_for "clickhouse" 'curl -sf http://127.0.0.1:8123/ping' | |
| wait_for "svix" 'curl -sf http://127.0.0.1:8071/api/v1/health/' | |
| wait_for "minio" 'curl -sf http://127.0.0.1:9090/minio/health/live' | |
| wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]' | |
| wait_for "postgres" 'nc -z 127.0.0.1 5432' | |
| wait_for "clickhouse" 'curl -sf --max-time 5 http://127.0.0.1:8123/ping' | |
| wait_for "svix" 'curl -sf --max-time 5 http://127.0.0.1:8071/api/v1/health/' | |
| wait_for "minio" 'curl -sf --max-time 5 http://127.0.0.1:9090/minio/health/live' | |
| wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]' |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docker/local-emulator/qemu/cloud-init/emulator/user-data
Line: 209-212
Comment:
**Probe curls block indefinitely without `--max-time`**
The `curl` calls in `wait_for` probes have no `--max-time`, so if a service has its port open but stalls on the HTTP handshake (e.g., clickhouse in the middle of startup), the probe hangs indefinitely. When this happens, the `while true` loop is blocked inside `eval "$probe"` and neither the 30 s heartbeat nor the `DEPS_TIMEOUT` check can ever fire — exactly the bounded-wait guarantee this PR is meant to provide is silently broken.
`dump_diagnostics` correctly uses `--max-time 3` on every probe; the same guard is needed in the `wait_for` probes:
```suggestion
wait_for "postgres" 'nc -z 127.0.0.1 5432'
wait_for "clickhouse" 'curl -sf --max-time 5 http://127.0.0.1:8123/ping'
wait_for "svix" 'curl -sf --max-time 5 http://127.0.0.1:8071/api/v1/health/'
wait_for "minio" 'curl -sf --max-time 5 http://127.0.0.1:9090/minio/health/live'
wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
The arm64 CI build hung in
wait-for-depsfor 20+ minutes with no way to tell which dep was stuck, because the original script used unboundeduntil … do sleep 1; doneloops. Eventually it would either burn the full 6000s outer timeout or block forever on a crash-looping service.Rewrite the wait as a
wait_forhelper with:STACK_DEPS_TIMEOUT). On timeout we dumpdocker ps -a, last 300 lines of the deps container, and per-service reachability probes, then exit 1 soprovision-build's cleanup trap fires and the VM shuts down fast instead of idling to the outer 6000s timeout.<service>ready (Ns)" log lines on each service so successful runs show which one was the bottleneck (useful for both amd64 baselines and arm64 diagnosis).amd64 is unaffected — services come up in ~1s each under KVM, well inside any threshold here. The change is scoped to
wait-for-depsonly; no workflow changes, no docker image changes.Stacked on top of
emulator-tcg-arm64-fixesso it inherits the cortex-a72 + migration-output-capture + smoke-test-skip work.Test plan