Skip to content

fix(eth-indexer): make /eth/health O(1)#853

Merged
raymondjacobson merged 1 commit into
mainfrom
api/eth-health-o1
May 22, 2026
Merged

fix(eth-indexer): make /eth/health O(1)#853
raymondjacobson merged 1 commit into
mainfrom
api/eth-health-o1

Conversation

@raymondjacobson
Copy link
Copy Markdown
Member

Summary

https://api.audius.co/eth/health was hanging on prod because the handler ran:

```sql
SELECT COUNT(*) FROM (
SELECT LOWER(wallet) FROM users WHERE wallet IS NOT NULL AND wallet <> ''
UNION
SELECT LOWER(wallet) FROM associated_wallets WHERE chain='eth' AND is_delete=FALSE
) t
```

On prod that's ~3.15M rows through a seq scan + dedup sort and takes long enough to keep the HTTP request open indefinitely (no statement timeout, no handler timeout). Cheap locally with one seeded user, lethal in prod.

These counts (tracked_wallets, cached_wallets) were nice-to-have stats, not liveness signals. They don't belong on a health endpoint.

Changes

  • GetHealth: drop the COUNT subqueries and the corresponding fields from the response. What remains is all O(1):
    • connected, rpc_configured, last_block_seen, last_event_at — in-memory atomics
    • checkpoint_block — single-row PK lookup on eth_indexer_checkpoints
  • /eth/health handler: wrap GetHealth in a 2s context timeout. Even if a future query turns slow, the request fails fast instead of hanging the ingress.

After merge

Wait for the auto-upgrader to pick up the new image (every 3 min, see Pulumi.prod-api.yaml's autoUpgradeSchedule), or roll the deployment manually:

```bash
kubectl -n api rollout restart deployment/eth-indexer
```

Then:
```bash
time curl -s https://api.audius.co/eth/health | jq

Expect: <1s, JSON returned

```

If you want population stats, query directly:
```bash
kubectl -n api exec -i deploy/bridge -- psql "$writeDbUrl" -c \
"SELECT COUNT(*) FROM eth_wallet_balances;"
```

Test plan

  • go build ./... clean
  • go vet ./eth/... clean
  • After deploy, curl -m 5 https://api.audius.co/eth/health returns JSON in well under 1s
  • Field set: errors, connected, rpc_configured, last_block_seen, checkpoint_block, last_event_at

The previous GetHealth ran a UNION/COUNT across users +
associated_wallets to populate tracked_wallets, plus a COUNT(*) on
eth_wallet_balances for cached_wallets. On prod that's ~3.15M rows
through a seq scan + dedup sort and consistently times out (or hangs
the handler — there was no statement timeout). Cheap locally, lethal
in prod.

Drop both counts from the response. They were nice-to-have stats, not
liveness signals — a health endpoint that takes 30s to tell you the
indexer is alive is worse than no endpoint. If you need population
stats, query eth_wallet_balances directly.

What's left is all O(1):
- connected, rpc_configured, last_block_seen, last_event_at: in-memory
- checkpoint_block: single-row PK lookup on eth_indexer_checkpoints

Also add a 2s context timeout to the handler. Even if a future query is
added that turns slow, the request fails fast instead of hanging the
ingress.
@raymondjacobson raymondjacobson merged commit fe386b9 into main May 22, 2026
5 checks passed
@raymondjacobson raymondjacobson deleted the api/eth-health-o1 branch May 22, 2026 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant