Skip to content

feat(security): add TLS for all inter-service communication (closes #26)#201

Merged
mvillmow merged 3 commits into
mainfrom
26-auto-impl
May 10, 2026
Merged

feat(security): add TLS for all inter-service communication (closes #26)#201
mvillmow merged 3 commits into
mainfrom
26-auto-impl

Conversation

@mvillmow
Copy link
Copy Markdown
Contributor

@mvillmow mvillmow commented May 4, 2026

Summary

Addresses the S8 security audit finding from issue #26: all inter-service paths in the argus stack used plain HTTP, transmitting the Grafana admin password and sensitive operational data (agent names, task details, logs) in cleartext.

Two-tier strategy

Tier Path Mechanism
1 (high-priority) exporter → Agamemnon/NATS/Nestor TLS env var support + Tailscale recommendation (cross-host WSL2 boundary)
2 (best-practice) Docker-internal services Self-signed CA + per-service certificates

Files changed

  • certs/gen-certs.sh — generates a self-signed CA and per-service certs with correct SANs; certs/.gitignore ensures private keys are never committed
  • exporter/exporter.pyAGAMEMNON_TLS_CA / NESTOR_TLS_CA / NATS_TLS_CA env vars; _build_ssl_context() creates ssl.SSLContext when a CA file is configured; TLS_VERIFY=false escape hatch for dev
  • docker-compose.yml — cert volume mounts for all services; Grafana HTTPS via GF_SERVER_PROTOCOL=https; TLS env vars for argus-exporter
  • configs/prometheus.ymltls_server_config block; self-monitoring job updated to scheme: https
  • configs/loki.ymlserver.http_tls_config block
  • configs/promtail.yml — Loki push URL changed to https://; tls_config CA file for client verification
  • configs/grafana/datasources.yml — URLs updated to https://; tlsAuthWithCACert: true
  • justfilegen-certs recipe; reload-prometheus, test-scrape, import-dashboards use https:// with --cacert certs/ca.crt
  • pixi.toml — adds python, pytest dependencies and test task
  • docs/tls-setup.md — full runbook: cert generation, Tailscale recommendation for cross-host paths, cert rotation, troubleshooting

Test plan

  • 13 pytest tests in tests/test_exporter_tls.py covering _build_ssl_context, _fetch, _health_check TLS context threading, and env var wiring — all pass (python3 -m pytest tests/ -v)
  • Run just gen-certs locally — verify certs/ca.crt and per-service certs are generated
  • Run just start — verify all containers start without TLS errors
  • Run just test-scrape — verify up metric returns 1 for all targets (over HTTPS)
  • Open https://localhost:3001 in browser — accept or trust certs/ca.crt
  • Verify Grafana dashboards load data from Prometheus and Loki datasources

Backward compatibility

The exporter's AGAMEMNON_TLS_CA="" / NESTOR_TLS_CA="" / NATS_TLS_CA="" defaults preserve plain-HTTP behaviour for upstream services that haven't yet added TLS — avoiding SSL_ERROR_RX_RECORD_TOO_LONG errors. No external services (Agamemnon, Nestor, NATS) are modified.

Closes #26

🤖 Generated with Claude Code

@mvillmow mvillmow enabled auto-merge (squash) May 5, 2026 00:53
@mvillmow mvillmow force-pushed the 26-auto-impl branch 5 times, most recently from 6570824 to 870af4d Compare May 10, 2026 02:34
mvillmow and others added 2 commits May 9, 2026 19:54
Addresses the S8 security audit finding (issue #26): all inter-service
paths in the argus stack were using plain HTTP, transmitting credentials
and sensitive operational data in cleartext.

Changes:
- certs/gen-certs.sh: script to generate a self-signed CA and
  per-service certificates (prometheus, loki, grafana, promtail,
  argus-exporter) with correct SANs
- certs/.gitignore: ensures private keys are never committed
- exporter/exporter.py: add AGAMEMNON_TLS_CA / NESTOR_TLS_CA /
  NATS_TLS_CA env vars; _build_ssl_context() creates an ssl.SSLContext
  when a CA file is configured; TLS_VERIFY=false escape hatch for dev
- docker-compose.yml: mount certs/ into all services; add TLS env vars
  for argus-exporter; enable Grafana HTTPS via GF_SERVER_PROTOCOL=https
- configs/prometheus.yml: add tls_server_config block; update
  self-monitoring job to scheme: https
- configs/loki.yml: add server.http_tls_config block
- configs/promtail.yml: change Loki push URL to https://; add tls_config
  with CA file for client verification
- configs/grafana/datasources.yml: update Prometheus and Loki URLs to
  https://; add tlsAuthWithCACert: true
- justfile: add gen-certs recipe; update reload-prometheus, test-scrape,
  import-dashboards to use https:// and --cacert certs/ca.crt
- pixi.toml: add python and pytest dependencies; add test task
- docs/tls-setup.md: runbook covering cert generation, Tailscale
  recommendation for cross-host WSL2 paths, cert rotation, and
  troubleshooting (SSL_ERROR_RX_RECORD_TOO_LONG)
- tests/test_exporter_tls.py: 13 pytest tests covering _build_ssl_context,
  _fetch, _health_check TLS context threading, and env var wiring

Closes #26

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
yamllint [colons] rule fails when key_file has two spaces before the value.
Both configs/loki.yml:8 and configs/prometheus.yml:17 had double spaces.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The collect() function used ThreadPoolExecutor but it was not imported,
causing ruff F821 'Undefined name' lint failure in CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mvillmow mvillmow merged commit 53ed702 into main May 10, 2026
14 of 17 checks passed
@mvillmow mvillmow deleted the 26-auto-impl branch May 10, 2026 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Audit] S8 Security: No TLS configured for inter-service communication

1 participant