feat(security): add TLS for all inter-service communication (closes #26)#201
Merged
Conversation
6570824 to
870af4d
Compare
Addresses the S8 security audit finding (issue #26): all inter-service paths in the argus stack were using plain HTTP, transmitting credentials and sensitive operational data in cleartext. Changes: - certs/gen-certs.sh: script to generate a self-signed CA and per-service certificates (prometheus, loki, grafana, promtail, argus-exporter) with correct SANs - certs/.gitignore: ensures private keys are never committed - exporter/exporter.py: add AGAMEMNON_TLS_CA / NESTOR_TLS_CA / NATS_TLS_CA env vars; _build_ssl_context() creates an ssl.SSLContext when a CA file is configured; TLS_VERIFY=false escape hatch for dev - docker-compose.yml: mount certs/ into all services; add TLS env vars for argus-exporter; enable Grafana HTTPS via GF_SERVER_PROTOCOL=https - configs/prometheus.yml: add tls_server_config block; update self-monitoring job to scheme: https - configs/loki.yml: add server.http_tls_config block - configs/promtail.yml: change Loki push URL to https://; add tls_config with CA file for client verification - configs/grafana/datasources.yml: update Prometheus and Loki URLs to https://; add tlsAuthWithCACert: true - justfile: add gen-certs recipe; update reload-prometheus, test-scrape, import-dashboards to use https:// and --cacert certs/ca.crt - pixi.toml: add python and pytest dependencies; add test task - docs/tls-setup.md: runbook covering cert generation, Tailscale recommendation for cross-host WSL2 paths, cert rotation, and troubleshooting (SSL_ERROR_RX_RECORD_TOO_LONG) - tests/test_exporter_tls.py: 13 pytest tests covering _build_ssl_context, _fetch, _health_check TLS context threading, and env var wiring Closes #26 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
yamllint [colons] rule fails when key_file has two spaces before the value. Both configs/loki.yml:8 and configs/prometheus.yml:17 had double spaces. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The collect() function used ThreadPoolExecutor but it was not imported, causing ruff F821 'Undefined name' lint failure in CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses the S8 security audit finding from issue #26: all inter-service paths in the argus stack used plain HTTP, transmitting the Grafana admin password and sensitive operational data (agent names, task details, logs) in cleartext.
Two-tier strategy
Files changed
certs/gen-certs.sh— generates a self-signed CA and per-service certs with correct SANs;certs/.gitignoreensures private keys are never committedexporter/exporter.py—AGAMEMNON_TLS_CA/NESTOR_TLS_CA/NATS_TLS_CAenv vars;_build_ssl_context()createsssl.SSLContextwhen a CA file is configured;TLS_VERIFY=falseescape hatch for devdocker-compose.yml— cert volume mounts for all services; Grafana HTTPS viaGF_SERVER_PROTOCOL=https; TLS env vars forargus-exporterconfigs/prometheus.yml—tls_server_configblock; self-monitoring job updated toscheme: httpsconfigs/loki.yml—server.http_tls_configblockconfigs/promtail.yml— Loki push URL changed tohttps://;tls_configCA file for client verificationconfigs/grafana/datasources.yml— URLs updated tohttps://;tlsAuthWithCACert: truejustfile—gen-certsrecipe;reload-prometheus,test-scrape,import-dashboardsusehttps://with--cacert certs/ca.crtpixi.toml— addspython,pytestdependencies andtesttaskdocs/tls-setup.md— full runbook: cert generation, Tailscale recommendation for cross-host paths, cert rotation, troubleshootingTest plan
tests/test_exporter_tls.pycovering_build_ssl_context,_fetch,_health_checkTLS context threading, and env var wiring — all pass (python3 -m pytest tests/ -v)just gen-certslocally — verifycerts/ca.crtand per-service certs are generatedjust start— verify all containers start without TLS errorsjust test-scrape— verifyupmetric returns1for all targets (over HTTPS)https://localhost:3001in browser — accept or trustcerts/ca.crtBackward compatibility
The exporter's
AGAMEMNON_TLS_CA=""/NESTOR_TLS_CA=""/NATS_TLS_CA=""defaults preserve plain-HTTP behaviour for upstream services that haven't yet added TLS — avoidingSSL_ERROR_RX_RECORD_TOO_LONGerrors. No external services (Agamemnon, Nestor, NATS) are modified.Closes #26
🤖 Generated with Claude Code