Add ops cheat sheet and smoke/uptime monitoring scripts#96
Conversation
📝 WalkthroughWalkthroughAdds ops runbook documentation, a shell smoke-test runner for post-deploy validation, and an UptimeRobot provisioning script to create liveness and keyword monitors for the rest-api service. ChangesOperational Monitoring and Smoke Testing Infrastructure
Sequence DiagramsequenceDiagram
participant Cron as Cron Scheduler
participant Smoke as Smoke Test
participant API as REST API
participant Monitor as UptimeRobot
participant Alert as Alert System
Cron->>Smoke: Execute smoke tests
Smoke->>API: Probe /health and /health/detailed
API-->>Smoke: Responses + latency
Smoke->>Smoke: Validate responses & latency
Smoke-->>Alert: Fire on test FAIL
Monitor->>API: HTTP health checks and keyword validation
API-->>Monitor: Status responses
alt Monitor detects failure
Monitor-->>Alert: Trigger alert
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
9a6514a to
8ff4b51
Compare
How to use the Graphite Merge QueueAdd the label add-to-gt-merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
8ff4b51 to
abee3da
Compare
abee3da to
26859c3
Compare
dc10129 to
6e0d829
Compare
## Motivation `GET /health` only proves that the API process is up. It does not show whether raindex local DB sync is configured, actively syncing, ready for reads, or failing. The detailed endpoint gives monitoring and operators an on-demand snapshot without this API needing to know raindex local DB internals. ## Solution - Add `GET /health/detailed` while keeping the existing `GET /health` liveness response unchanged. - Bump `lib/rain.orderbook` to `0f2f3d19212e6682c70e52a335d66988a10083a6` and use the new `get_local_db_sync_snapshot()` API for raindex local DB sync status. - Report app DB health plus raindex sync state, including configured/healthy flags, scheduler state, network readiness, orderbook readiness, sync phase, last synced block, update timestamp, and errors. - Refactor the detailed health implementation to call `raindex.client().get_local_db_sync_snapshot().await` instead of opening/querying SQLite directly. - Remove the direct `rusqlite` dependency from this crate. - Add OpenAPI response types for raindex sync networks and orderbooks. - Map raindex component status as `active`, `syncing`, `failure`, or `not_configured`; top-level API health remains `ok`, `degraded`, or `error`. - Update local test fixtures for orderbook YAML spec version 5 and adjust small SDK API changes from the submodule bump. ## Chained PRs - **#96** (`alastair/ops-tooling`) is stacked on this PR and should be merged after this PR. ## Dependent PRs - **rainlanguage/raindex#2563** must merge before this PR. This PR bumps `lib/rain.orderbook` to a commit that depends on that raindex change. ## Checks - [x] `nix develop -c cargo fmt` - [x] `nix develop -c cargo test` - [x] `nix develop -c rainix-rs-static` Part of the deployed-state split tracked in #94. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a detailed health endpoint providing structured status information about the application's database connectivity and synchronization components, including component-level metrics and error details. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
26859c3 to
24c9c0c
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
scripts/smoke.sh (1)
51-53: 💤 Low value
auth_headeras an unquoted string breaks ifAPI_KEYorAPI_SECRETcontains spaces or special characters; prefer an array.The
# shellcheck disable=SC2086suppression sidesteps the warning but doesn't eliminate the risk. Sinceargsis already an array, extending that pattern here keeps the suppression comment unnecessary and handles pathological key values correctly.♻️ Proposed refactor
- local auth_header="" + local auth_args=() if [[ -n "$API_KEY" && -n "$API_SECRET" ]]; then - auth_header="-u $API_KEY:$API_SECRET" + auth_args=(-u "$API_KEY:$API_SECRET") fi local tmp tmp=$(mktemp) - # shellcheck disable=SC2086 local result - result=$(curl -sS -X "$method" $auth_header \ + result=$(curl -sS -X "$method" "${auth_args[@]}" \🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/smoke.sh` around lines 51 - 53, The auth header is built as an unquoted string (auth_header) which breaks on spaces/special chars; change it to an array form consistent with args (e.g., build auth_args as an array or append to the existing args array) and use that array when invoking curl/command so each credential is a separate element; update the conditional that currently sets auth_header="-u $API_KEY:$API_SECRET" to instead add ("-u" "$API_KEY:$API_SECRET") to the args array (or create auth_args with those elements) to avoid word-splitting and remove the SC2086 suppression.scripts/uptimerobot-setup.sh (1)
75-76: ⚡ Quick winAdd a
--max-timeto the UptimeRobot APIcurlcall.Without it the script hangs indefinitely if the API is unreachable or slow to respond.
♻️ Proposed fix
- resp=$(curl -sS -X POST "${args[@]}" \ - https://api.uptimerobot.com/v2/newMonitor) + resp=$(curl -sS -X POST --max-time 15 "${args[@]}" \ + https://api.uptimerobot.com/v2/newMonitor)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/uptimerobot-setup.sh` around lines 75 - 76, The curl invocation that sets resp (the POST to https://api.uptimerobot.com/v2/newMonitor) can hang indefinitely; add a timeout option (e.g., --max-time 30) to the curl arguments so the request fails after a bounded number of seconds, making sure to include the option in the same args array or inline curl invocation that constructs resp.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/ops.md`:
- Around line 62-67: The grep regex in the docs snippet currently requires a
decimal point ("duration_ms\":[0-9]+\.[0-9]+") so integer-only durations are
missed; update the pattern to allow optional fractional parts (e.g. use
"duration_ms\":[0-9]+(?:\.[0-9]+)?" or "\d+(?:\.\d+)?") so matches include both
"duration_ms":5001 and "duration_ms":5001.23; adjust the grep expression in the
journalctl pipeline accordingly (leave the awk $2 > 5000 check unchanged).
In `@scripts/smoke.sh`:
- Line 128: The jq filter in the probe call probe "GET /v1/orders/token/{usdc}"
... uses ".orders | type == \"array\" and .pagination" which evaluates
.pagination against the .orders array and always fails; fix it by making the
type check apply only to .orders (wrap that expression in parentheses) and
separately test that the top-level .pagination exists (use a null-check like
.pagination != null), i.e. change the filter so it reads: (.orders | type ==
"array") and (.pagination != null) so .pagination is evaluated on the root
object.
---
Nitpick comments:
In `@scripts/smoke.sh`:
- Around line 51-53: The auth header is built as an unquoted string
(auth_header) which breaks on spaces/special chars; change it to an array form
consistent with args (e.g., build auth_args as an array or append to the
existing args array) and use that array when invoking curl/command so each
credential is a separate element; update the conditional that currently sets
auth_header="-u $API_KEY:$API_SECRET" to instead add ("-u"
"$API_KEY:$API_SECRET") to the args array (or create auth_args with those
elements) to avoid word-splitting and remove the SC2086 suppression.
In `@scripts/uptimerobot-setup.sh`:
- Around line 75-76: The curl invocation that sets resp (the POST to
https://api.uptimerobot.com/v2/newMonitor) can hang indefinitely; add a timeout
option (e.g., --max-time 30) to the curl arguments so the request fails after a
bounded number of seconds, making sure to include the option in the same args
array or inline curl invocation that constructs resp.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 53a31c5f-4c07-4f79-9891-310e4278ad0c
📒 Files selected for processing (3)
docs/ops.mdscripts/smoke.shscripts/uptimerobot-setup.sh
| journalctl -u rest-api --since '1 hour ago' --no-pager \ | ||
| | grep 'request completed' \ | ||
| | grep -oE 'duration_ms":[0-9]+\.[0-9]+' \ | ||
| | awk -F: '$2 > 5000 { print }' \ | ||
| | wc -l | ||
| ``` |
There was a problem hiding this comment.
grep pattern misses integer-only duration_ms values — slow requests will be silently under-counted.
The pattern duration_ms":[0-9]+\.[0-9]+ requires a decimal point, so requests like "duration_ms":5001 (no fractional part) won't match, and the slow-request count will be lower than the real value.
🐛 Proposed fix
- | grep -oE 'duration_ms":[0-9]+\.[0-9]+' \
- | awk -F: '$2 > 5000 { print }' \
+ | grep -oE 'duration_ms":[0-9]+(\.[0-9]+)?' \
+ | awk -F: '$2 > 5000 { print }' \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| journalctl -u rest-api --since '1 hour ago' --no-pager \ | |
| | grep 'request completed' \ | |
| | grep -oE 'duration_ms":[0-9]+\.[0-9]+' \ | |
| | awk -F: '$2 > 5000 { print }' \ | |
| | wc -l | |
| ``` | |
| journalctl -u rest-api --since '1 hour ago' --no-pager \ | |
| | grep 'request completed' \ | |
| | grep -oE 'duration_ms":[0-9]+(\.[0-9]+)?' \ | |
| | awk -F: '$2 > 5000 { print }' \ | |
| | wc -l |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/ops.md` around lines 62 - 67, The grep regex in the docs snippet
currently requires a decimal point ("duration_ms\":[0-9]+\.[0-9]+") so
integer-only durations are missed; update the pattern to allow optional
fractional parts (e.g. use "duration_ms\":[0-9]+(?:\.[0-9]+)?" or
"\d+(?:\.\d+)?") so matches include both "duration_ms":5001 and
"duration_ms":5001.23; adjust the grep expression in the journalctl pipeline
accordingly (leave the awk $2 > 5000 check unchanged).
| # 3. Authenticated endpoints — only run if creds are set | ||
| if [[ -n "$API_KEY" && -n "$API_SECRET" ]]; then | ||
| probe "GET /v1/tokens" GET "/v1/tokens" 200 '.tokens | type == "array"' | ||
| probe "GET /v1/orders/token/{usdc}" GET "/v1/orders/token/$USDC_BASE" 200 '.orders | type == "array" and .pagination' |
There was a problem hiding this comment.
Broken jq filter — this probe will always FAIL even on a healthy endpoint.
After the pipe |, the input is the .orders array. Evaluating .pagination on an array returns null, so type == "array" and .pagination becomes true and null which is false in jq. jq -e exits non-zero for false, marking the check as a shape mismatch regardless of the actual response.
🐛 Proposed fix
- probe "GET /v1/orders/token/{usdc}" GET "/v1/orders/token/$USDC_BASE" 200 '.orders | type == "array" and .pagination'
+ probe "GET /v1/orders/token/{usdc}" GET "/v1/orders/token/$USDC_BASE" 200 '(.orders | type == "array") and (.pagination != null)'To confirm the behaviour locally:
#!/bin/bash
# Expect: first prints "false", second prints "true"
echo '{"orders":[{"id":1}],"pagination":{"page":1}}' \
| jq '.orders | type == "array" and .pagination'
echo '{"orders":[{"id":1}],"pagination":{"page":1}}' \
| jq '(.orders | type == "array") and (.pagination != null)'🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/smoke.sh` at line 128, The jq filter in the probe call probe "GET
/v1/orders/token/{usdc}" ... uses ".orders | type == \"array\" and .pagination"
which evaluates .pagination against the .orders array and always fails; fix it
by making the type check apply only to .orders (wrap that expression in
parentheses) and separately test that the top-level .pagination exists (use a
null-check like .pagination != null), i.e. change the filter so it reads:
(.orders | type == "array") and (.pagination != null) so .pagination is
evaluated on the root object.
Merge activity
|
## Summary - New `docs/ops.md` — journalctl + curl recipes for the deployed service, cache_warmer status interpretation, smoke-test usage - New `scripts/smoke.sh` — end-to-end correctness + latency probe; asserts on response shape (not just status), reports per-check latency, exits non-zero on FAIL - New `scripts/uptimerobot-setup.sh` — one-shot creation of the 3 baseline UptimeRobot monitors (liveness, /health/detailed status=ok, cache_warmer running=true) ## Why Ops knowledge for the preview environment was living outside the repo. This pulls it inline so: - New team members can debug the live service from journalctl recipes - Smoke tests can run post-deploy or on a cron - UptimeRobot setup is reproducible per-environment via one script ## Forward-looking note - The smoke + UptimeRobot scripts probe `/health/detailed.cache_warmer`, which lights up only after **#95** (detailed health endpoint) and the cache-warmer PR (TBD, see #94) merge. Until then those checks return FAIL/keyword-miss against `main` — that's expected. - No application code touched in this PR; safe to merge ahead of the dependent endpoints. ## Test plan - [ ] `bash -n scripts/smoke.sh` (syntax check) - [ ] Run `./scripts/smoke.sh` against preview once detailed-health lands; expect PASS on liveness + 401 checks - [ ] Read `docs/ops.md` end-to-end Part of the deployed-state split tracked in #94. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added comprehensive operations guide with service health check commands, troubleshooting queries, and monitoring recommendations. * **Chores** * Added post-deployment smoke test script for validating service functionality and latency. * Added automated monitoring setup script for configuring external health checks. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
24c9c0c to
e441fa2
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
scripts/smoke.sh (1)
128-128:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBroken
jqfilter — probe will alwaysFAILon a healthy endpoint.This was flagged in a previous review and remains unresolved.
🐛 Proposed fix
- probe "GET /v1/orders/token/{usdc}" GET "/v1/orders/token/$USDC_BASE" 200 '.orders | type == "array" and .pagination' + probe "GET /v1/orders/token/{usdc}" GET "/v1/orders/token/$USDC_BASE" 200 '(.orders | type == "array") and (.pagination != null)'🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/smoke.sh` at line 128, The jq filter in the probe call for GET "/v1/orders/token/$USDC_BASE" is malformed and always fails; update the expression so it evaluates the two conditions separately and combines them, e.g. change the filter from '.orders | type == "array" and .pagination' to '(.orders | type == "array") and (.pagination != null)' in the probe invocation so it correctly checks that .orders is an array and .pagination is present.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/smoke.sh`:
- Line 116: The probe for "GET /health/detailed" uses a truthy check '.status'
which accepts any non-null value (e.g., "degraded"); change the assertion to an
explicit equality check '.status == "ok"' so the probe only passes when the
detailed health status is exactly "ok" (update the probe entry for "GET
/health/detailed" accordingly).
- Around line 51-53: The current assignment auth_header="-u
$API_KEY:$API_SECRET" exposes credentials in process listings; instead, create a
temporary netrc file (use mktemp) containing a machine entry with login set to
$API_KEY and password set to $API_SECRET, update the curl invocation to remove
auth_header and pass --netrc-file <tempfile>, and ensure the tempfile is
securely permissioned and removed after use (trap cleanup on EXIT); modify any
references to auth_header in the script (e.g., the curl call that used
$auth_header) to use the --netrc-file option and remove the SC2086 suppression.
---
Duplicate comments:
In `@scripts/smoke.sh`:
- Line 128: The jq filter in the probe call for GET
"/v1/orders/token/$USDC_BASE" is malformed and always fails; update the
expression so it evaluates the two conditions separately and combines them, e.g.
change the filter from '.orders | type == "array" and .pagination' to '(.orders
| type == "array") and (.pagination != null)' in the probe invocation so it
correctly checks that .orders is an array and .pagination is present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1466c004-63fa-404f-a6d3-9a9020df49ad
📒 Files selected for processing (3)
docs/ops.mdscripts/smoke.shscripts/uptimerobot-setup.sh
✅ Files skipped from review due to trivial changes (1)
- docs/ops.md
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/uptimerobot-setup.sh
| if [[ -n "$API_KEY" && -n "$API_SECRET" ]]; then | ||
| auth_header="-u $API_KEY:$API_SECRET" | ||
| fi |
There was a problem hiding this comment.
API credentials exposed in the process table.
auth_header="-u $API_KEY:$API_SECRET" expands into the curl command line, making credentials visible in /proc/<pid>/cmdline and via ps aux on multi-user or shared CI systems. Prefer passing credentials through --netrc-file with a temporary file, or via a --config file written to a mktemp path.
🔒 Proposed fix using a temporary netrc file
- local auth_header=""
- if [[ -n "$API_KEY" && -n "$API_SECRET" ]]; then
- auth_header="-u $API_KEY:$API_SECRET"
- fi
+ local auth_args=()
+ local netrc_tmp=""
+ if [[ -n "$API_KEY" && -n "$API_SECRET" ]]; then
+ netrc_tmp=$(mktemp)
+ printf 'default login %s password %s\n' "$API_KEY" "$API_SECRET" > "$netrc_tmp"
+ auth_args=(--netrc-file "$netrc_tmp")
+ fiThen update the curl invocation and cleanup:
- result=$(curl -sS -X "$method" $auth_header \
+ result=$(curl -sS -X "$method" "${auth_args[@]}" \
-o "$tmp" \
-w '%{http_code} %{time_total}\n' \
--max-time 30 \
"$API_URL$path" 2>&1) || true
+ [[ -n "$netrc_tmp" ]] && rm -f "$netrc_tmp"This also eliminates the SC2086 word-splitting suppression comment.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/smoke.sh` around lines 51 - 53, The current assignment
auth_header="-u $API_KEY:$API_SECRET" exposes credentials in process listings;
instead, create a temporary netrc file (use mktemp) containing a machine entry
with login set to $API_KEY and password set to $API_SECRET, update the curl
invocation to remove auth_header and pass --netrc-file <tempfile>, and ensure
the tempfile is securely permissioned and removed after use (trap cleanup on
EXIT); modify any references to auth_header in the script (e.g., the curl call
that used $auth_header) to use the --netrc-file option and remove the SC2086
suppression.
|
|
||
| # 1. Public endpoints (no auth) | ||
| probe "GET /health" GET "/health" 200 '.status == "ok"' | ||
| probe "GET /health/detailed" GET "/health/detailed" 200 '.status' |
There was a problem hiding this comment.
/health/detailed assertion is weaker than intended — passes for any non-null .status.
Line 115 uses .status == "ok" (boolean), but line 116 uses .status (truthy check only). A response with "status": "degraded" would still pass. If the intent is to verify the endpoint is healthy, use the same equality check.
🐛 Proposed fix
-probe "GET /health/detailed" GET "/health/detailed" 200 '.status'
+probe "GET /health/detailed" GET "/health/detailed" 200 '.status == "ok"'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| probe "GET /health/detailed" GET "/health/detailed" 200 '.status' | |
| probe "GET /health/detailed" GET "/health/detailed" 200 '.status == "ok"' |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/smoke.sh` at line 116, The probe for "GET /health/detailed" uses a
truthy check '.status' which accepts any non-null value (e.g., "degraded");
change the assertion to an explicit equality check '.status == "ok"' so the
probe only passes when the detailed health status is exactly "ok" (update the
probe entry for "GET /health/detailed" accordingly).

Summary
docs/ops.md— journalctl + curl recipes for the deployed service, cache_warmer status interpretation, smoke-test usagescripts/smoke.sh— end-to-end correctness + latency probe; asserts on response shape (not just status), reports per-check latency, exits non-zero on FAILscripts/uptimerobot-setup.sh— one-shot creation of the 3 baseline UptimeRobot monitors (liveness, /health/detailed status=ok, cache_warmer running=true)Why
Ops knowledge for the preview environment was living outside the repo. This pulls it inline so:
Forward-looking note
/health/detailed.cache_warmer, which lights up only after Add detailed health endpoint with raindex sync status #95 (detailed health endpoint) and the cache-warmer PR (TBD, see [Tracking] Deployed preview state — split-source, do not merge #94) merge. Until then those checks return FAIL/keyword-miss againstmain— that's expected.Test plan
bash -n scripts/smoke.sh(syntax check)./scripts/smoke.shagainst preview once detailed-health lands; expect PASS on liveness + 401 checksdocs/ops.mdend-to-endPart of the deployed-state split tracked in #94.
Summary by CodeRabbit
Documentation
Chores