Skip to content

Commit c76c8da

Browse files
committed
emulator: bounded dep wait with per-service diagnostics
wait-for-deps used to loop forever on each service, so any single dep that failed to start (e.g. a service crash-looping under TCG) hung the build until the outer 6000s provision timeout. Rewrite as a wait_for helper with: - Hard 1500s budget across the full dep wait (overridable via STACK_DEPS_TIMEOUT). On timeout, dump docker ps -a, last 300 lines of the deps container, and per-service reachability, then exit 1 so provision-build's cleanup trap fires and the VM shuts down fast. - "<service> ready (Ns)" log lines on each service so successful runs show which service was the bottleneck. - 30s heartbeat per service so long-running waits don't look frozen. amd64 is unaffected — services come up in ~1s each under KVM, which is well inside any threshold here.
1 parent 6c5615b commit c76c8da

1 file changed

Lines changed: 55 additions & 6 deletions

File tree

  • docker/local-emulator/qemu/cloud-init/emulator

docker/local-emulator/qemu/cloud-init/emulator/user-data

Lines changed: 55 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -155,13 +155,62 @@ write_files:
155155
permissions: '0755'
156156
content: |
157157
#!/bin/bash
158-
set -euo pipefail
158+
set -uo pipefail
159+
160+
# Hard upper bound across the whole dep wait. Under TCG every service
161+
# init is 5-20x slower than native, so we allow a generous budget, but
162+
# if we cross it something is genuinely stuck and we need to surface it.
163+
DEPS_TIMEOUT="${STACK_DEPS_TIMEOUT:-1500}"
164+
DEPS_CONTAINER="${STACK_DEPS_CONTAINER:-stack-build-init}"
165+
start=$SECONDS
166+
log() { /usr/local/bin/log-provision "wait-for-deps: $*"; }
167+
168+
dump_diagnostics() {
169+
log "dumping diagnostics for stuck dep wait..."
170+
log "--- docker ps -a ---"
171+
docker ps -a 2>&1 | while IFS= read -r line; do log "ps: $line"; done || true
172+
log "--- docker logs ${DEPS_CONTAINER} (last 300 lines) ---"
173+
docker logs --tail 300 "$DEPS_CONTAINER" 2>&1 | while IFS= read -r line; do log "deps: $line"; done || true
174+
log "--- per-service probes ---"
175+
nc -z 127.0.0.1 5432 >/dev/null 2>&1 && log "postgres:5432 reachable" || log "postgres:5432 NOT reachable"
176+
curl -sf --max-time 3 http://127.0.0.1:8123/ping >/dev/null 2>&1 && log "clickhouse:8123 reachable" || log "clickhouse:8123 NOT reachable"
177+
curl -sf --max-time 3 http://127.0.0.1:8071/api/v1/health/ >/dev/null 2>&1 && log "svix:8071 reachable" || log "svix:8071 NOT reachable"
178+
curl -sf --max-time 3 http://127.0.0.1:9090/minio/health/live >/dev/null 2>&1 && log "minio:9090 reachable" || log "minio:9090 NOT reachable"
179+
code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 3 http://127.0.0.1:8080/ 2>/dev/null || true)
180+
[ "$code" = "401" ] && log "qstash:8080 reachable (401)" || log "qstash:8080 NOT reachable (code=${code:-none})"
181+
}
182+
183+
wait_for() {
184+
local name="$1" probe="$2" elapsed
185+
local svc_start=$SECONDS
186+
local next_heartbeat=$((svc_start + 30))
187+
while true; do
188+
if eval "$probe" >/dev/null 2>&1; then
189+
elapsed=$((SECONDS - svc_start))
190+
log "${name} ready (${elapsed}s)"
191+
return 0
192+
fi
193+
if [ "$SECONDS" -ge "$next_heartbeat" ]; then
194+
log "still waiting for ${name} ($((SECONDS - svc_start))s elapsed)"
195+
next_heartbeat=$((SECONDS + 30))
196+
fi
197+
if [ "$((SECONDS - start))" -ge "$DEPS_TIMEOUT" ]; then
198+
elapsed=$((SECONDS - start))
199+
log "TIMEOUT waiting for ${name} after ${elapsed}s (hard cap ${DEPS_TIMEOUT}s)"
200+
dump_diagnostics
201+
exit 1
202+
fi
203+
sleep 2
204+
done
205+
}
159206

160-
until nc -z 127.0.0.1 5432 >/dev/null 2>&1; do sleep 1; done
161-
until curl -sf http://127.0.0.1:8123/ping >/dev/null 2>&1; do sleep 1; done
162-
until curl -sf http://127.0.0.1:8071/api/v1/health/ >/dev/null 2>&1; do sleep 1; done
163-
until curl -sf http://127.0.0.1:9090/minio/health/live >/dev/null 2>&1; do sleep 1; done
164-
until [ "$(curl -s -o /dev/null -w '%{http_code}' http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]; do sleep 1; done
207+
log "starting dep wait (timeout=${DEPS_TIMEOUT}s)"
208+
wait_for "postgres" 'nc -z 127.0.0.1 5432'
209+
wait_for "clickhouse" 'curl -sf http://127.0.0.1:8123/ping'
210+
wait_for "svix" 'curl -sf http://127.0.0.1:8071/api/v1/health/'
211+
wait_for "minio" 'curl -sf http://127.0.0.1:9090/minio/health/live'
212+
wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
213+
log "all deps ready ($((SECONDS - start))s total)"
165214

166215
- path: /etc/stack-build-computed.env
167216
content: |

0 commit comments

Comments
 (0)