Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 55 additions & 6 deletions docker/local-emulator/qemu/cloud-init/emulator/user-data
Original file line number Diff line number Diff line change
Expand Up @@ -155,13 +155,62 @@ write_files:
permissions: '0755'
content: |
#!/bin/bash
set -euo pipefail
set -uo pipefail

# Hard upper bound across the whole dep wait. Under TCG every service
# init is 5-20x slower than native, so we allow a generous budget, but
# if we cross it something is genuinely stuck and we need to surface it.
DEPS_TIMEOUT="${STACK_DEPS_TIMEOUT:-1500}"
DEPS_CONTAINER="${STACK_DEPS_CONTAINER:-stack-build-init}"
start=$SECONDS
log() { /usr/local/bin/log-provision "wait-for-deps: $*"; }

dump_diagnostics() {
log "dumping diagnostics for stuck dep wait..."
log "--- docker ps -a ---"
docker ps -a 2>&1 | while IFS= read -r line; do log "ps: $line"; done || true
log "--- docker logs ${DEPS_CONTAINER} (last 300 lines) ---"
docker logs --tail 300 "$DEPS_CONTAINER" 2>&1 | while IFS= read -r line; do log "deps: $line"; done || true
log "--- per-service probes ---"
nc -z 127.0.0.1 5432 >/dev/null 2>&1 && log "postgres:5432 reachable" || log "postgres:5432 NOT reachable"
curl -sf --max-time 3 http://127.0.0.1:8123/ping >/dev/null 2>&1 && log "clickhouse:8123 reachable" || log "clickhouse:8123 NOT reachable"
curl -sf --max-time 3 http://127.0.0.1:8071/api/v1/health/ >/dev/null 2>&1 && log "svix:8071 reachable" || log "svix:8071 NOT reachable"
curl -sf --max-time 3 http://127.0.0.1:9090/minio/health/live >/dev/null 2>&1 && log "minio:9090 reachable" || log "minio:9090 NOT reachable"
code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 3 http://127.0.0.1:8080/ 2>/dev/null || true)
[ "$code" = "401" ] && log "qstash:8080 reachable (401)" || log "qstash:8080 NOT reachable (code=${code:-none})"
}

wait_for() {
local name="$1" probe="$2" elapsed
local svc_start=$SECONDS
local next_heartbeat=$((svc_start + 30))
while true; do
if eval "$probe" >/dev/null 2>&1; then
elapsed=$((SECONDS - svc_start))
log "${name} ready (${elapsed}s)"
return 0
fi
if [ "$SECONDS" -ge "$next_heartbeat" ]; then
log "still waiting for ${name} ($((SECONDS - svc_start))s elapsed)"
next_heartbeat=$((SECONDS + 30))
fi
if [ "$((SECONDS - start))" -ge "$DEPS_TIMEOUT" ]; then
elapsed=$((SECONDS - start))
log "TIMEOUT waiting for ${name} after ${elapsed}s (hard cap ${DEPS_TIMEOUT}s)"
dump_diagnostics
exit 1
fi
sleep 2
done
}

until nc -z 127.0.0.1 5432 >/dev/null 2>&1; do sleep 1; done
until curl -sf http://127.0.0.1:8123/ping >/dev/null 2>&1; do sleep 1; done
until curl -sf http://127.0.0.1:8071/api/v1/health/ >/dev/null 2>&1; do sleep 1; done
until curl -sf http://127.0.0.1:9090/minio/health/live >/dev/null 2>&1; do sleep 1; done
until [ "$(curl -s -o /dev/null -w '%{http_code}' http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]; do sleep 1; done
log "starting dep wait (timeout=${DEPS_TIMEOUT}s)"
wait_for "postgres" 'nc -z 127.0.0.1 5432'
wait_for "clickhouse" 'curl -sf http://127.0.0.1:8123/ping'
wait_for "svix" 'curl -sf http://127.0.0.1:8071/api/v1/health/'
wait_for "minio" 'curl -sf http://127.0.0.1:9090/minio/health/live'
wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
Comment on lines +209 to +212
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Probe curls block indefinitely without --max-time

The curl calls in wait_for probes have no --max-time, so if a service has its port open but stalls on the HTTP handshake (e.g., clickhouse in the middle of startup), the probe hangs indefinitely. When this happens, the while true loop is blocked inside eval "$probe" and neither the 30 s heartbeat nor the DEPS_TIMEOUT check can ever fire — exactly the bounded-wait guarantee this PR is meant to provide is silently broken.

dump_diagnostics correctly uses --max-time 3 on every probe; the same guard is needed in the wait_for probes:

Suggested change
wait_for "clickhouse" 'curl -sf http://127.0.0.1:8123/ping'
wait_for "svix" 'curl -sf http://127.0.0.1:8071/api/v1/health/'
wait_for "minio" 'curl -sf http://127.0.0.1:9090/minio/health/live'
wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
wait_for "postgres" 'nc -z 127.0.0.1 5432'
wait_for "clickhouse" 'curl -sf --max-time 5 http://127.0.0.1:8123/ping'
wait_for "svix" 'curl -sf --max-time 5 http://127.0.0.1:8071/api/v1/health/'
wait_for "minio" 'curl -sf --max-time 5 http://127.0.0.1:9090/minio/health/live'
wait_for "qstash" '[ "$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
Prompt To Fix With AI
This is a comment left during a code review.
Path: docker/local-emulator/qemu/cloud-init/emulator/user-data
Line: 209-212

Comment:
**Probe curls block indefinitely without `--max-time`**

The `curl` calls in `wait_for` probes have no `--max-time`, so if a service has its port open but stalls on the HTTP handshake (e.g., clickhouse in the middle of startup), the probe hangs indefinitely. When this happens, the `while true` loop is blocked inside `eval "$probe"` and neither the 30 s heartbeat nor the `DEPS_TIMEOUT` check can ever fire — exactly the bounded-wait guarantee this PR is meant to provide is silently broken.

`dump_diagnostics` correctly uses `--max-time 3` on every probe; the same guard is needed in the `wait_for` probes:

```suggestion
      wait_for "postgres"   'nc -z 127.0.0.1 5432'
      wait_for "clickhouse" 'curl -sf --max-time 5 http://127.0.0.1:8123/ping'
      wait_for "svix"       'curl -sf --max-time 5 http://127.0.0.1:8071/api/v1/health/'
      wait_for "minio"      'curl -sf --max-time 5 http://127.0.0.1:9090/minio/health/live'
      wait_for "qstash"     '[ "$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 http://127.0.0.1:8080/ 2>/dev/null || true)" = "401" ]'
```

How can I resolve this? If you propose a fix, please make it concise.

log "all deps ready ($((SECONDS - start))s total)"

- path: /etc/stack-build-computed.env
content: |
Expand Down
Loading