feat(monitoring): align observability stack with issue #10 acceptance by vildanden-ai · Pull Request #195 · illbnm/homelab-stack

vildanden-ai · 2026-03-19T06:03:55Z

Summary

This PR aligns the observability stack with issue #10 acceptance criteria for:

Prometheus + Grafana + Loki + Tempo + Alertmanager + Uptime Kuma (+ OnCall optional profile)
Provisioned dashboards and datasources
Alert rules split by host/containers/services
Promtail coverage for Docker + syslog + Traefik access logs
Uptime Kuma automated setup script
Retention settings and OIDC role mapping

What changed

1) Compose/services and pinned versions

Updated stacks/monitoring/docker-compose.yml
Ensured pinned image tags per issue requirements (no latest)
Added/wired required services and volumes for dashboards/provisioning
Added optional Grafana OnCall under profile (does not block default stack startup)

2) Prometheus scrape targets

Updated config/prometheus/prometheus.yml to include required jobs:
- cadvisor
- node-exporter
- traefik
- authentik
- nextcloud
- gitea
- prometheus

3) Alert rules split exactly as required

Added:
- config/prometheus/alerts/host.yml
- config/prometheus/alerts/containers.yml
- config/prometheus/alerts/services.yml
Removed old monolithic rules file:
- config/prometheus/rules/homelab.yml

4) Grafana provisioning + dashboards

Added dashboards in config/grafana/dashboards/:
- Node Exporter Full (1860)
- Docker Container & Host Metrics (179)
- Traefik Official (17346)
- Loki Dashboard (13639)
- Uptime Kuma (18278)
Added logs.json with UID logs so Explore shortcut path /d/logs/logs exists
Added Tempo datasource and traces-to-logs linkage in provisioning

5) Logs/traces/retention

Promtail now scrapes:
- Docker container logs
- /var/log/syslog
- Traefik access logs
Added Tempo config with retention support:
- config/tempo/tempo-config.yml
Retention env handling wired for:
- Prometheus / Loki / Tempo

6) Alertmanager → ntfy

Updated config/alertmanager/alertmanager.yml
Added routing for default + critical alerts to ntfy endpoint(s)

7) Uptime Kuma automation

Added:
- scripts/uptime-kuma-setup.sh
- scripts/uptime-kuma-setup.py
Script creates/reuses monitors, status page, and notification setup

8) OIDC role mapping and docs/env

Grafana OIDC role mapping:
- homelab-admins -> Admin
- homelab-users -> Viewer
Updated .env.example and stacks/monitoring/.env.example
Added stacks/monitoring/README.md with acceptance mapping and validation steps

Acceptance criteria mapping

Grafana accessible with provisioned dashboards auto-loaded
Required Prometheus scrape jobs configured
Loki + Promtail configured for container/syslog/Traefik logs
Alerts split into host/containers/services and routed to Alertmanager/ntfy
Uptime Kuma setup automation script provided
Authentik OIDC group-role mapping configured
Retention settings for Prometheus/Loki/Tempo provided

Validation run (in CI-less sandbox)

Static/local checks completed:

Pinned image tags verified (grep check)
Required Prometheus jobs present (grep check)
Required alert files exist
Dashboard files exist and JSON parse passed
python3 -m py_compile scripts/uptime-kuma-setup.py passed
bash -n scripts/uptime-kuma-setup.sh passed

Runtime blockers in this environment:

Docker unavailable in this runner (docker: command not found), so live E2E checks (docker compose up, target UP checks, stress-trigger alert, HTTP 200 endpoint probes) could not be executed here.

Tooling note required by issue

Generated/reviewed with: claude-opus-4-6

Codex review

I performed a GPT-5.3 Codex cross-check for:

config correctness
secret hygiene (no hardcoded secrets introduced)
pinned image policy
domestic-network-friendly defaults in config/docs
No unresolved critical findings are intentionally left in this PR.

If maintainers want, I can follow up with host-run proof logs (docker compose ps, Prometheus Targets UP screenshot, and alert fire/resolve evidence) once executed on a Docker-enabled host.

…ptance

zhuzhushiwojia · 2026-03-20T03:10:33Z

👋 Hi @illbnm!

Checking in on the Observability Stack PR review status.

PR Summary:

Prometheus + Grafana + Loki + Tempo + Alertmanager + Uptime Kuma
12 scrape jobs configured
Complete alerting rules (host/containers/services)
ntfy notification integration

Ready for any feedback or adjustments. Appreciate your time! 🙏

feat(monitoring): align observability stack with issue illbnm#10 acce…

bcb6e72

…ptance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(monitoring): align observability stack with issue #10 acceptance#195

feat(monitoring): align observability stack with issue #10 acceptance#195
vildanden-ai wants to merge 1 commit intoillbnm:masterfrom
vildanden-ai:feat/issue-10-observability-acceptance

vildanden-ai commented Mar 19, 2026

Uh oh!

zhuzhushiwojia commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vildanden-ai commented Mar 19, 2026

Summary

What changed

1) Compose/services and pinned versions

2) Prometheus scrape targets

3) Alert rules split exactly as required

4) Grafana provisioning + dashboards

5) Logs/traces/retention

6) Alertmanager → ntfy

7) Uptime Kuma automation

8) OIDC role mapping and docs/env

Acceptance criteria mapping

Validation run (in CI-less sandbox)

Tooling note required by issue

Codex review

Uh oh!

zhuzhushiwojia commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants