Skip to content

feat(observability): Prometheus /metrics endpoint#43

Merged
MorganOnCode merged 1 commit into
masterfrom
feat/prometheus-metrics
May 15, 2026
Merged

feat(observability): Prometheus /metrics endpoint#43
MorganOnCode merged 1 commit into
masterfrom
feat/prometheus-metrics

Conversation

@MorganOnCode
Copy link
Copy Markdown
Owner

Closes audit #12 — the observability gap noted in the 2026-05-15 audit. Sentry covers errors/traces but there's no latency percentiles, throughput, or queue depth, blocking SLA validation and capacity planning.

What

Adds `GET /metrics` returning the standard Prometheus text format (v0.0.4), scrape-able by any Prometheus-compatible system.

Default metrics (from `prom-client`'s `collectDefaultMetrics`):

  • Node.js heap size / used
  • GC duration + counts
  • Event loop lag
  • Active handles / requests
  • CPU user/system seconds

Custom HTTP metrics (added by an `onResponse` hook):

  • `http_requests_total` (counter) labeled by `method`, `route`, `status_code`
  • `http_request_duration_seconds` (histogram) with realistic facilitator-latency buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s

Default label `service="cardano402"` on every series.

Design choices

  • Route pattern, not raw URL as the label. `/files/abc123` and `/files/xyz789` both collapse onto `route="/files/:cid"`. Bounded cardinality — won't explode the time-series count even with many distinct file IDs.
  • /metrics and /health excluded from tracking. The former avoids recursive accounting (scraping creates traffic), the latter avoids skewing latency percentiles with k8s-style liveness-probe noise.
  • `prom-client` directly, not a wrapper plugin. The wrappers (`fastify-metrics` etc.) add abstraction and another upstream to track for not much gain. ~50 lines of plugin code is clearer and easier to extend with custom metrics later (Redis ping latency, Blockfrost call duration, verify/settle outcomes — easy to add when the need shows up).
  • robots.txt updated to `Disallow: /metrics` so it's not crawl-indexed.

What it doesn't do (yet, deliberately)

  • No authentication on /metrics. For a single-VPS setup with no separate Prometheus server yet, this is acceptable. The metrics don't leak secrets — at worst they reveal traffic patterns. If/when you stand up Prometheus (or want to lock this down), the easy paths are: a shared-secret `?token=...` query param, bind /metrics to localhost-only, or wrap with nginx basic-auth. Defer until there's a real scraper.
  • No custom payment-flow metrics. I'd want `verify_total` / `settle_total` labeled by outcome, `blockfrost_call_duration_seconds`, etc. Quick to add once the framework is in. Out of scope for this PR — the goal here is the infrastructure.

Test plan

  • `pnpm typecheck` clean
  • `pnpm lint` clean
  • `pnpm test` — 34 files / 452 tests passing (+9 new for metrics)
  • CI passes on this branch
  • After deploy: `curl http://localhost:3000/metrics\` returns the Prometheus text format
  • After traffic: `curl http://localhost:3000/metrics | grep http_requests_total` shows non-zero counters for hit routes

🤖 Generated with Claude Code

Closes audit #12 -- the observability gap noted in the 2026-05-15 audit
(Sentry covers errors but no latency percentiles, throughput, or queue
depth, blocking SLA validation and capacity planning).

Adds a small focused plugin at src/routes/metrics.ts using prom-client
directly (no wrapper plugin):

- GET /metrics returns the standard Prometheus text format (v0.0.4)
- Registers Node.js default metrics: heap, GC, event loop lag, CPU
- Adds two custom metrics labeled by method/route/status_code:
  - http_requests_total (counter)
  - http_request_duration_seconds (histogram with realistic facilitator
    latency buckets: 5ms to 10s)
- Default label `service="cardano402"` on every series
- Route label uses the templated pattern (e.g. "/files/:cid") not the
  raw URL -- bounded cardinality, won't explode the time-series count
- Skips /metrics itself (recursive accounting) and /health (k8s-style
  liveness-probe noise that would skew latency percentiles)

robots.txt updated to Disallow /metrics so it isn't crawl-indexed.

9 new tests cover: content type + Prometheus format, default Node.js
metrics presence, custom counter/histogram presence, request tracking
across multiple calls, route-pattern cardinality boundedness, method
+ status_code labels, and the /metrics + /health exclusions.

Full suite: 34 files / 452 tests passing (was 33 / 443).

No production behaviour change: the plugin only adds a new GET route
and an onResponse hook that does in-memory increments. Memory cost is
trivial (prom-client is ~50KB).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MorganOnCode MorganOnCode merged commit a7d1e78 into master May 15, 2026
5 checks passed
@MorganOnCode MorganOnCode deleted the feat/prometheus-metrics branch May 15, 2026 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant