Skip to content

feat(monitoring): align observability stack with issue #10 acceptance#195

Open
vildanden-ai wants to merge 1 commit intoillbnm:masterfrom
vildanden-ai:feat/issue-10-observability-acceptance
Open

feat(monitoring): align observability stack with issue #10 acceptance#195
vildanden-ai wants to merge 1 commit intoillbnm:masterfrom
vildanden-ai:feat/issue-10-observability-acceptance

Conversation

@vildanden-ai
Copy link
Copy Markdown

Summary

This PR aligns the observability stack with issue #10 acceptance criteria for:

  • Prometheus + Grafana + Loki + Tempo + Alertmanager + Uptime Kuma (+ OnCall optional profile)
  • Provisioned dashboards and datasources
  • Alert rules split by host/containers/services
  • Promtail coverage for Docker + syslog + Traefik access logs
  • Uptime Kuma automated setup script
  • Retention settings and OIDC role mapping

What changed

1) Compose/services and pinned versions

  • Updated stacks/monitoring/docker-compose.yml
  • Ensured pinned image tags per issue requirements (no latest)
  • Added/wired required services and volumes for dashboards/provisioning
  • Added optional Grafana OnCall under profile (does not block default stack startup)

2) Prometheus scrape targets

  • Updated config/prometheus/prometheus.yml to include required jobs:
    • cadvisor
    • node-exporter
    • traefik
    • authentik
    • nextcloud
    • gitea
    • prometheus

3) Alert rules split exactly as required

  • Added:
    • config/prometheus/alerts/host.yml
    • config/prometheus/alerts/containers.yml
    • config/prometheus/alerts/services.yml
  • Removed old monolithic rules file:
    • config/prometheus/rules/homelab.yml

4) Grafana provisioning + dashboards

  • Added dashboards in config/grafana/dashboards/:
    • Node Exporter Full (1860)
    • Docker Container & Host Metrics (179)
    • Traefik Official (17346)
    • Loki Dashboard (13639)
    • Uptime Kuma (18278)
  • Added logs.json with UID logs so Explore shortcut path /d/logs/logs exists
  • Added Tempo datasource and traces-to-logs linkage in provisioning

5) Logs/traces/retention

  • Promtail now scrapes:
    • Docker container logs
    • /var/log/syslog
    • Traefik access logs
  • Added Tempo config with retention support:
    • config/tempo/tempo-config.yml
  • Retention env handling wired for:
    • Prometheus / Loki / Tempo

6) Alertmanager → ntfy

  • Updated config/alertmanager/alertmanager.yml
  • Added routing for default + critical alerts to ntfy endpoint(s)

7) Uptime Kuma automation

  • Added:
    • scripts/uptime-kuma-setup.sh
    • scripts/uptime-kuma-setup.py
  • Script creates/reuses monitors, status page, and notification setup

8) OIDC role mapping and docs/env

  • Grafana OIDC role mapping:
    • homelab-admins -> Admin
    • homelab-users -> Viewer
  • Updated .env.example and stacks/monitoring/.env.example
  • Added stacks/monitoring/README.md with acceptance mapping and validation steps

Acceptance criteria mapping

  • Grafana accessible with provisioned dashboards auto-loaded
  • Required Prometheus scrape jobs configured
  • Loki + Promtail configured for container/syslog/Traefik logs
  • Alerts split into host/containers/services and routed to Alertmanager/ntfy
  • Uptime Kuma setup automation script provided
  • Authentik OIDC group-role mapping configured
  • Retention settings for Prometheus/Loki/Tempo provided

Validation run (in CI-less sandbox)

Static/local checks completed:

  • Pinned image tags verified (grep check)
  • Required Prometheus jobs present (grep check)
  • Required alert files exist
  • Dashboard files exist and JSON parse passed
  • python3 -m py_compile scripts/uptime-kuma-setup.py passed
  • bash -n scripts/uptime-kuma-setup.sh passed

Runtime blockers in this environment:

  • Docker unavailable in this runner (docker: command not found), so live E2E checks (docker compose up, target UP checks, stress-trigger alert, HTTP 200 endpoint probes) could not be executed here.

Tooling note required by issue

Generated/reviewed with: claude-opus-4-6

Codex review

I performed a GPT-5.3 Codex cross-check for:

  • config correctness
  • secret hygiene (no hardcoded secrets introduced)
  • pinned image policy
  • domestic-network-friendly defaults in config/docs
    No unresolved critical findings are intentionally left in this PR.

If maintainers want, I can follow up with host-run proof logs (docker compose ps, Prometheus Targets UP screenshot, and alert fire/resolve evidence) once executed on a Docker-enabled host.

@zhuzhushiwojia
Copy link
Copy Markdown

👋 Hi @illbnm!

Checking in on the Observability Stack PR review status.

PR Summary:

  • Prometheus + Grafana + Loki + Tempo + Alertmanager + Uptime Kuma
  • 12 scrape jobs configured
  • Complete alerting rules (host/containers/services)
  • ntfy notification integration

Ready for any feedback or adjustments. Appreciate your time! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants