Standards for monitoring, logging, metrics, and alerting in DevRail-managed services. These complement the Output & Logging section in DEVELOPMENT.md (which covers build-time logging) and the Container Standards health check requirements.
Every service exposes health endpoints for orchestrators and load balancers:
| Endpoint | Purpose | What It Checks |
|---|---|---|
/healthz |
Liveness | Process is running, not deadlocked. Does not check external dependencies. |
/readyz |
Readiness | Process can serve traffic. Checks database connections, cache availability, required downstream services. |
-
Both endpoints return HTTP 200 for healthy, HTTP 503 for unhealthy. Response body includes a JSON status and optional detail.
-
/healthzis cheap. No database queries, no network calls. If the process can respond, it is alive. -
/readyzchecks critical dependencies only. Do not check every downstream service -- only those without which the service cannot function. -
Health checks are unauthenticated. Orchestrators and load balancers must be able to probe without credentials.
-
Report dependency status individually in readiness checks so operators can identify which dependency is down:
{ "status": "unhealthy", "checks": { "database": "ok", "cache": "timeout", "queue": "ok" } }
Build-time logging (Makefile targets, CI jobs, scripts) is covered in DEVELOPMENT.md Output & Logging. This section covers runtime application logging for deployed services.
Application logs are JSON, one object per line:
{"level":"info","msg":"Request handled","method":"GET","path":"/users","status":200,"duration_ms":42,"request_id":"abc-123","ts":"2026-02-25T10:00:00Z"}| Field | Purpose |
|---|---|
level |
Log level: debug, info, warn, error |
msg |
Human-readable message |
ts |
ISO 8601 timestamp with timezone |
request_id |
Correlation ID for tracing a request across services |
- Generate a unique
request_idat the edge (API gateway, load balancer, or first service). - Propagate the ID through all downstream service calls via a header (
X-Request-ID). - Include the ID in every log entry for the request lifecycle.
- Return the ID in the API response so clients can reference it in support requests.
| Level | When to Use |
|---|---|
debug |
Detailed diagnostic information. Disabled in production by default. |
info |
Normal operational events. Request handled, job completed, service started. |
warn |
Unexpected but recoverable situations. Retry succeeded, deprecated feature used, approaching limit. |
error |
Failures that require attention. Unhandled exception, dependency unreachable, data corruption detected. |
- Do not use
printorconsole.login production code. Use the language's structured logging library (Pythonstructlog/logging, Goslog, Nodepino/winston). - Log at the appropriate level. An expected "not found" result is not an error.
- Include context, not just messages. "Failed to connect" is useless. "Failed to connect to database at db.example.com:5432 after 3 retries" is actionable.
Expose metrics in Prometheus format at /metrics (or use a platform-specific agent):
http_requests_total{method="GET",path="/users",status="200"} 1234
http_request_duration_seconds{method="GET",path="/users",quantile="0.99"} 0.250
For every service, instrument the RED signals:
| Signal | Metric | What It Tells You |
|---|---|---|
| Rate | http_requests_total |
How many requests per second |
| Errors | http_requests_total{status=~"5.."} |
What fraction of requests are failing |
| Duration | http_request_duration_seconds |
How long requests take (histogram) |
Follow Prometheus naming conventions:
- Use
snake_case - Include the unit as a suffix:
_seconds,_bytes,_total - Counters end with
_total - Use labels for dimensions, not separate metric names
- Instrument the RED signals for every service. This is the minimum.
- Add business metrics where relevant. Orders placed, messages processed, cache hit ratio.
- Do not create high-cardinality labels. User IDs, request IDs, and IP addresses as labels will overwhelm your metrics backend.
- Set appropriate histogram buckets. Default buckets may not match your service's latency profile.
- Alert on symptoms, not causes. "Error rate > 5%" is a symptom. "Database CPU > 80%" is a cause. Alert on the former; investigate the latter during triage.
- Every alert must be actionable. If an alert fires and the correct response is "ignore it", the alert should not exist.
- Every alert links to a runbook. The alert definition includes a URL pointing to the response procedure.
- Tune alerts to minimize noise. False positives erode trust. Use appropriate thresholds, windows, and "for" durations to avoid flapping.
| Alert Severity | Response | Example |
|---|---|---|
| Critical | Page on-call immediately | Service down, data loss risk |
| Warning | Investigate during business hours | Error rate elevated, disk filling |
| Info | Review in next planning cycle | Deprecated API usage increasing |
- One dashboard per service at minimum, showing the RED signals.
- Golden signals visible at a glance: latency, traffic, errors, saturation.
- Time range selectable. Default to last 1 hour with options for 6h, 24h, 7d.
- Link from alert to dashboard. When an alert fires, the operator can navigate directly to the relevant dashboard.
| Panel | Metric |
|---|---|
| Request rate | http_requests_total rate |
| Error rate | 5xx responses as a percentage |
| Latency (p50, p95, p99) | http_request_duration_seconds quantiles |
| Resource utilization | CPU, memory, disk, connections |
| Business metrics | Application-specific KPIs |
Certain data must never appear in logs, regardless of log level:
| Category | Examples |
|---|---|
| Secrets | API keys, passwords, tokens, private keys |
| PII | Email addresses, phone numbers, government IDs, full names (unless required and documented) |
| Financial data | Credit card numbers, bank account numbers |
| Health data | Medical records, health status |
| Full request/response bodies | May contain any of the above. Log selectively. |
| Session tokens | Can be used to impersonate users |
- Redact at the source. Use logging middleware that strips sensitive fields before writing.
- Use allowlists, not blocklists. Explicitly list which fields to log rather than trying to exclude all sensitive ones.
- Audit log output. Periodically review production logs for accidental PII or secret exposure.
- Build-time logging (what
make checkoutputs, how CI jobs report results) is defined in DEVELOPMENT.md Output & Logging. This document covers runtime application observability. - For health check endpoints in containers, see Container Standards.
- Observability tooling choices (Prometheus, Grafana, Datadog, etc.) are project-specific. These standards define what to measure, not which vendor to use.