-
Notifications
You must be signed in to change notification settings - Fork 0
WIP: Phase 3 - Infra Monitoring Stack #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Docker Compose with Prometheus, Grafana, Alertmanager, Node Exporter, Blackbox Exporter, cAdvisor - Prometheus configuration with scrape configs - Alert rules for host, container, service, and network monitoring - Alertmanager configuration with routing and receivers - Blackbox Exporter modules for HTTP, TCP, ICMP, DNS probes - Grafana datasources provisioning π€ Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. WalkthroughIntroduces a complete Docker Compose-based infrastructure monitoring stack comprising Prometheus for metrics collection, Grafana with three monitoring dashboards, Alertmanager for alert routing, and exporters (Node, Blackbox, cadvisor) for multi-protocol endpoint and system monitoring. Includes configuration files for all components and comprehensive README documentation. Changes
Estimated code review effortπ― 3 (Moderate) | β±οΈ ~30 minutes
Poem
β¨ Finishing touchesπ§ͺ Generate unit tests (beta)
π Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro π Files selected for processing (11)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Reviewer's GuideIntroduces a complete Docker Compose-based infrastructure monitoring stack (Prometheus, Grafana, Alertmanager, Node Exporter, Blackbox Exporter, cAdvisor) with wiring, alerting, Alertmanager routing, Blackbox modules, Grafana provisioning, and documentation, plus a comprehensive set of Prometheus alert rules and Grafana dashboards. Sequence diagram for alert lifecycle from metrics to notificationssequenceDiagram
participant NodeExporter
participant cAdvisor
participant BlackboxExporter
participant Prometheus
participant Alertmanager
participant EmailReceiver as Email_oncall
participant WebhookReceiver as Webhook_services
NodeExporter->>Prometheus: expose node_metrics
cAdvisor->>Prometheus: expose_container_metrics
BlackboxExporter->>Prometheus: expose_probe_metrics
loop scrape_interval_15s
Prometheus->>NodeExporter: HTTP_GET_/metrics
Prometheus->>cAdvisor: HTTP_GET_/metrics
Prometheus->>BlackboxExporter: HTTP_GET_/probe
NodeExporter-->>Prometheus: metrics_payload
cAdvisor-->>Prometheus: metrics_payload
BlackboxExporter-->>Prometheus: probe_results
end
loop evaluation_interval_15s
Prometheus->>Prometheus: evaluate_alert_rules
end
alt alert_fires
Prometheus->>Alertmanager: push_alerts_via_alerting_config
Alertmanager->>Alertmanager: route_by_severity_and_labels
Alertmanager->>Alertmanager: apply_inhibit_rules
alt severity_critical
Alertmanager-->>EmailReceiver: send_critical_email
Alertmanager-->>WebhookReceiver: POST_/webhook/critical
else severity_warning
Alertmanager-->>EmailReceiver: send_warning_email
Alertmanager-->>WebhookReceiver: POST_/webhook/warning
else severity_info
Alertmanager-->>WebhookReceiver: POST_/webhook/info
end
end
alt alert_resolved
Prometheus->>Alertmanager: send_resolved_notification
Alertmanager-->>EmailReceiver: send_resolved_email_if_configured
Alertmanager-->>WebhookReceiver: POST_resolved_payload
end
Sequence diagram for Blackbox HTTP and ICMP probing via PrometheussequenceDiagram
participant Prometheus
participant BlackboxExporter
participant HTTPService as HTTP_targets
participant DNSHosts as ICMP_targets
rect rgb(235,235,235)
Note over Prometheus,BlackboxExporter: HTTP probes via job blackbox-http
loop scrape_interval_15s
Prometheus->>BlackboxExporter: GET_/probe?module=http_2xx&target=http_endpoint
BlackboxExporter->>HTTPService: HTTP_request
HTTPService-->>BlackboxExporter: HTTP_response
BlackboxExporter-->>Prometheus: probe_success_and_latency
end
end
rect rgb(235,235,235)
Note over Prometheus,BlackboxExporter: ICMP probes via job blackbox-icmp
loop scrape_interval_15s
Prometheus->>BlackboxExporter: GET_/probe?module=icmp&target=ip_address
BlackboxExporter->>DNSHosts: ICMP_echo_request
DNSHosts-->>BlackboxExporter: ICMP_echo_reply
BlackboxExporter-->>Prometheus: probe_success_and_rtt
end
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
- Add dashboard provisioning configuration (dashboards.yml) - Add Server Health dashboard with CPU, memory, disk, network panels - Add Docker Overview dashboard with container metrics - Add Network Overview dashboard with HTTP, ICMP, TCP probes - Add comprehensive bilingual README documentation Dashboards include: - Gauge/stat panels for quick status overview - Time series graphs for trend analysis - Bar gauges for comparison views - Table summaries with multiple metrics - Template variables for filtering π€ Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and found some issues that need to be addressed.
- The
ContainerDownalert usesabsent(container_last_seen{name!=""}), which will fire a single alert without anamelabel (breaking{{ $labels.name }}in the annotations) rather than per-container; consider a per-container expression like checkingmax_over_time(container_last_seen[...])or a similar metric that preserves container labels. - All the healthchecks rely on
wget, but the Prometheus/Grafana/Alertmanager/Exporter images may not have it installed; switching tocurl(which some images include) or using a simpler TCP-based check (e.g.CMD-SHELL nc -z localhost 9090) would make the healthchecks more robust.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `ContainerDown` alert uses `absent(container_last_seen{name!=""})`, which will fire a single alert without a `name` label (breaking `{{ $labels.name }}` in the annotations) rather than per-container; consider a per-container expression like checking `max_over_time(container_last_seen[...])` or a similar metric that preserves container labels.
- All the healthchecks rely on `wget`, but the Prometheus/Grafana/Alertmanager/Exporter images may not have it installed; switching to `curl` (which some images include) or using a simpler TCP-based check (e.g. `CMD-SHELL nc -z localhost 9090`) would make the healthchecks more robust.
## Individual Comments
### Comment 1
<location> `2-infra-monitoring/prometheus/alerts/alerts.yml:103-23` </location>
<code_context>
+ - name: container_alerts
+ rules:
+ # Container leΓ‘llt
+ - alert: ContainerDown
+ expr: absent(container_last_seen{name!=""})
+ for: 1m
+ labels:
+ severity: warning
+ annotations:
+ summary: "Container {{ $labels.name }} is down"
+ description: "Container has not been seen for more than 1 minute"
</code_context>
<issue_to_address>
**issue (bug_risk):** The ContainerDown alert uses `absent()` in a way that likely produces a single, unlabeled alert and makes `$labels.name` unusable.
`absent(container_last_seen{name!=""})` will fire a single alert with no `name` label, but the annotations expect `{{ $labels.name }}`. Please switch to a per-container expression (e.g. `max by (name) (time() - container_last_seen{name!=""}) > <threshold>`) so the alert preserves container labels and the template fields resolve correctly.
</issue_to_address>
### Comment 2
<location> `2-infra-monitoring/prometheus/alerts/alerts.yml:123` </location>
<code_context>
+ description: "Container CPU usage is above 80% (current: {{ $value | printf \"%.1f\" }}%)"
+
+ # Magas container memΓ³ria hasznΓ‘lat
+ - alert: ContainerHighMemory
+ expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100) > 80
+ for: 5m
</code_context>
<issue_to_address>
**issue (bug_risk):** The ContainerHighMemory alert can divide by zero or behave unexpectedly for containers without a memory limit.
This expression assumes `container_spec_memory_limit_bytes` is always set and > 0. For containers without a memory limit (0 or missing), this can yield divideβbyβzero or meaningless percentages. Consider excluding unlimited containers (e.g. `> 0` on the denominator) or handling them with a separate alert.
</issue_to_address>
### Comment 3
<location> `2-infra-monitoring/prometheus/alerts/alerts.yml:268-275` </location>
<code_context>
+ description: "Rule group {{ $labels.rule_group }} is taking longer than its interval to evaluate"
+
+ # Prometheus target scrape sikertelen
+ - alert: PrometheusTargetScrapeFailed
+ expr: up == 0
+ for: 5m
+ labels:
+ severity: warning
+ annotations:
+ summary: "Prometheus target {{ $labels.job }} scrape failed"
+ description: "Target {{ $labels.instance }} has been down for more than 5 minutes"
+
+ # Prometheus storage magas
</code_context>
<issue_to_address>
**suggestion:** The PrometheusTargetScrapeFailed rule on `up == 0` may create noisy, overlapping alerts with more specific `*Down`/blackbox alerts.
This expression will fire for every failed scrape, including components already covered by `PrometheusDown`, `GrafanaDown`, `AlertmanagerDown` and blackbox checks. To avoid noisy, duplicate alerts, consider excluding those jobs (e.g. `up{job!~"blackbox-.*|prometheus|grafana|alertmanager"} == 0`) or limiting this rule to specific target classes.
```suggestion
- alert: PrometheusTargetScrapeFailed
expr: up{job!~"blackbox-.*|prometheus|grafana|alertmanager"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus target {{ $labels.job }} scrape failed"
description: "Target {{ $labels.instance }} has been down for more than 5 minutes (excluding core and blackbox-monitored jobs)"
```
</issue_to_address>
### Comment 4
<location> `2-infra-monitoring/docker-compose.yml:102-61` </location>
<code_context>
+ # =========================================
+ # ALERTMANAGER - RiasztΓ‘s kezelΓ©s
+ # =========================================
+ alertmanager:
+ image: prom/alertmanager:v0.26.0
+ container_name: alertmanager
+ restart: unless-stopped
+ ports:
+ - "9093:9093"
+ volumes:
+ - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
+ - alertmanager_data:/alertmanager
+ command:
+ - "--config.file=/etc/alertmanager/alertmanager.yml"
+ - "--storage.path=/alertmanager"
+ - "--web.external-url=http://localhost:9093"
+ networks:
+ - monitoring
+ healthcheck:
+ test: ["CMD", "wget", "-q", "--spider", "http://localhost:9093/-/healthy"]
+ interval: 30s
+ timeout: 10s
+ retries: 3
+
+ # =========================================
</code_context>
<issue_to_address>
**issue (bug_risk):** Alertmanager config depends on `SMTP_PASSWORD`, but the container is not wired to receive this environment variable.
`smtp_auth_password` in `alertmanager/alertmanager.yml` uses `${SMTP_PASSWORD}`, but this variable isnβt defined in the container. Please wire it through via an `environment:` section (for example, `environment: [SMTP_PASSWORD=${SMTP_PASSWORD}]` or equivalent) so Alertmanager can resolve the password at runtime.
</issue_to_address>
### Comment 5
<location> `2-infra-monitoring/blackbox/blackbox.yml:54-21` </location>
<code_context>
+ body: '{}'
+
+ # HTTP Basic Auth
+ http_basic_auth:
+ prober: http
+ timeout: 5s
+ http:
+ method: GET
+ basic_auth:
</code_context>
<issue_to_address>
**issue (bug_risk):** The http_basic_auth module uses `${HTTP_BASIC_USER}`/`${HTTP_BASIC_PASS}` without ensuring they are provided to the blackbox-exporter container.
These values must be provided as environment variables to the `blackbox-exporter` container (for example via an `environment:` block in `docker-compose.yml` or a wrapper script). Otherwise, the exporter will use the literal strings and basic auth will fail. Please either wire the env vars through or choose a different way to parameterize these credentials.
</issue_to_address>
### Comment 6
<location> `2-infra-monitoring/docker-compose.yml:52-55` </location>
<code_context>
+ - "--web.console.libraries=/etc/prometheus/console_libraries"
+ - "--web.console.templates=/etc/prometheus/consoles"
+ - "--web.enable-lifecycle"
+ - "--web.enable-admin-api"
+ networks:
+ - monitoring
</code_context>
<issue_to_address>
**π¨ suggestion (security):** Enabling Prometheus admin API while publishing the port can introduce unnecessary security risk.
`--web.enable-admin-api` exposes destructive endpoints (e.g., delete series, snapshots). With `ports: "9090:9090"` this may be reachable outside the internal network. If there isnβt a strict access control layer in front of Prometheus, consider disabling this flag or binding Prometheus only to a private interface.
```suggestion
- "--web.console.templates=/etc/prometheus/consoles"
- "--web.enable-lifecycle"
networks:
```
</issue_to_address>
### Comment 7
<location> `2-infra-monitoring/README.md:36` </location>
<code_context>
+|------------------------|------|----------------------|
+| Prometheus | 9090 | Metrika gyΕ±jtΓ©s Γ©s tΓ‘rolΓ‘s / Metric collection and storage |
+| Grafana | 3000 | VizualizΓ‘ciΓ³ Γ©s dashboardok / Visualization and dashboards |
+| Alertmanager | 9093 | RiasztΓ‘s kezelΓ©s / Alert management |
+| Node Exporter | 9100 | Host metrikΓ‘k / Host metrics |
+| Blackbox Exporter | 9115 | Endpoint monitoring (HTTP, ICMP, TCP) |
</code_context>
<issue_to_address>
**issue (typo):** Use the correct Hungarian compound noun "RiasztΓ‘skezelΓ©s" for consistency.
Earlier in the document you already use the correct compound form ("riasztΓ‘skezelΓ©st"), so please update this entry to "RiasztΓ‘skezelΓ©s / Alert management" for consistency.
```suggestion
| Alertmanager | 9093 | RiasztΓ‘skezelΓ©s / Alert management |
```
</issue_to_address>
### Comment 8
<location> `2-infra-monitoring/README.md:61` </location>
<code_context>
+# NaplΓ³k megtekintΓ©se / View logs
+docker-compose logs -f
+
+# SzolgΓ‘ltatΓ‘s Γ‘llapot / Service status
+docker-compose ps
+```
</code_context>
<issue_to_address>
**issue (typo):** Adjust the Hungarian phrase to the grammatically correct "SzolgΓ‘ltatΓ‘s Γ‘llapota".
Also consider keeping the English label alongside the Hungarian, e.g. "SzolgΓ‘ltatΓ‘s Γ‘llapota / Service status".
```suggestion
# SzolgΓ‘ltatΓ‘s Γ‘llapota / Service status
```
</issue_to_address>
### Comment 9
<location> `2-infra-monitoring/README.md:117` </location>
<code_context>
+- HTTP endpoint stΓ‘tusz Γ©s vΓ‘laszidΕ
+- ICMP ping latency
+- TCP port elΓ©rhetΕsΓ©g
+- SSL tanΓΊsΓtvΓ‘ny lejΓ‘rat
+
+## RiasztΓ‘sok / Alerts
</code_context>
<issue_to_address>
**issue (typo):** Use the possessive form in Hungarian: "SSL tanΓΊsΓtvΓ‘ny lejΓ‘rata".
```suggestion
- SSL tanΓΊsΓtvΓ‘ny lejΓ‘rata
```
</issue_to_address>Help me be more useful! Please click π or π on each comment and I'll use the feedback to improve your reviews.
| for: 5m | ||
| labels: | ||
| severity: warning | ||
| annotations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): The ContainerDown alert uses absent() in a way that likely produces a single, unlabeled alert and makes $labels.name unusable.
absent(container_last_seen{name!=""}) will fire a single alert with no name label, but the annotations expect {{ $labels.name }}. Please switch to a per-container expression (e.g. max by (name) (time() - container_last_seen{name!=""}) > <threshold>) so the alert preserves container labels and the template fields resolve correctly.
| description: "Container CPU usage is above 80% (current: {{ $value | printf \"%.1f\" }}%)" | ||
|
|
||
| # Magas container memΓ³ria hasznΓ‘lat | ||
| - alert: ContainerHighMemory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): The ContainerHighMemory alert can divide by zero or behave unexpectedly for containers without a memory limit.
This expression assumes container_spec_memory_limit_bytes is always set and > 0. For containers without a memory limit (0 or missing), this can yield divideβbyβzero or meaningless percentages. Consider excluding unlimited containers (e.g. > 0 on the denominator) or handling them with a separate alert.
| - alert: PrometheusTargetScrapeFailed | ||
| expr: up == 0 | ||
| for: 5m | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: "Prometheus target {{ $labels.job }} scrape failed" | ||
| description: "Target {{ $labels.instance }} has been down for more than 5 minutes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: The PrometheusTargetScrapeFailed rule on up == 0 may create noisy, overlapping alerts with more specific *Down/blackbox alerts.
This expression will fire for every failed scrape, including components already covered by PrometheusDown, GrafanaDown, AlertmanagerDown and blackbox checks. To avoid noisy, duplicate alerts, consider excluding those jobs (e.g. up{job!~"blackbox-.*|prometheus|grafana|alertmanager"} == 0) or limiting this rule to specific target classes.
| - alert: PrometheusTargetScrapeFailed | |
| expr: up == 0 | |
| for: 5m | |
| labels: | |
| severity: warning | |
| annotations: | |
| summary: "Prometheus target {{ $labels.job }} scrape failed" | |
| description: "Target {{ $labels.instance }} has been down for more than 5 minutes" | |
| - alert: PrometheusTargetScrapeFailed | |
| expr: up{job!~"blackbox-.*|prometheus|grafana|alertmanager"} == 0 | |
| for: 5m | |
| labels: | |
| severity: warning | |
| annotations: | |
| summary: "Prometheus target {{ $labels.job }} scrape failed" | |
| description: "Target {{ $labels.instance }} has been down for more than 5 minutes (excluding core and blackbox-monitored jobs)" |
| test: ["CMD", "wget", "-q", "--spider", "http://localhost:9090/-/healthy"] | ||
| interval: 30s | ||
| timeout: 10s | ||
| retries: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Alertmanager config depends on SMTP_PASSWORD, but the container is not wired to receive this environment variable.
smtp_auth_password in alertmanager/alertmanager.yml uses ${SMTP_PASSWORD}, but this variable isnβt defined in the container. Please wire it through via an environment: section (for example, environment: [SMTP_PASSWORD=${SMTP_PASSWORD}] or equivalent) so Alertmanager can resolve the password at runtime.
| http_2xx: | ||
| prober: http | ||
| timeout: 5s | ||
| http: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): The http_basic_auth module uses ${HTTP_BASIC_USER}/${HTTP_BASIC_PASS} without ensuring they are provided to the blackbox-exporter container.
These values must be provided as environment variables to the blackbox-exporter container (for example via an environment: block in docker-compose.yml or a wrapper script). Otherwise, the exporter will use the literal strings and basic auth will fail. Please either wire the env vars through or choose a different way to parameterize these credentials.
| - "--web.console.templates=/etc/prometheus/consoles" | ||
| - "--web.enable-lifecycle" | ||
| - "--web.enable-admin-api" | ||
| networks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π¨ suggestion (security): Enabling Prometheus admin API while publishing the port can introduce unnecessary security risk.
--web.enable-admin-api exposes destructive endpoints (e.g., delete series, snapshots). With ports: "9090:9090" this may be reachable outside the internal network. If there isnβt a strict access control layer in front of Prometheus, consider disabling this flag or binding Prometheus only to a private interface.
| - "--web.console.templates=/etc/prometheus/consoles" | |
| - "--web.enable-lifecycle" | |
| - "--web.enable-admin-api" | |
| networks: | |
| - "--web.console.templates=/etc/prometheus/consoles" | |
| - "--web.enable-lifecycle" | |
| networks: |
π¬π§ English Version
β Completed Tasks
docker-compose.yml)prometheus.yml)alerts/alerts.yml)alertmanager.yml)blackbox.yml)β³ In Progress
β Not Started Yet
π Issues Encountered
π Key Files Changed
π§ͺ Testing Status
promtool check config)amtool check-config)ππΊ Magyar VerziΓ³
β ElkΓ©szΓΌlt Feladatok
docker-compose.yml)prometheus.yml)alerts/alerts.yml)alertmanager.yml)blackbox.yml)β³ Folyamatban
β MΓ©g Nem KezdΕdΓΆtt El
π FelmerΓΌlt ProblΓ©mΓ‘k
π FΕbb MΓ³dosΓtott FΓ‘jlok
π§ͺ TesztelΓ©s Γllapota
promtool check config)amtool check-config)π€ Generated with Claude Code
Summary by Sourcery
Add a complete Docker Composeβbased infrastructure monitoring stack with Prometheus, Alertmanager, Grafana, exporters, and provisioning.
New Features:
Documentation:
Summary by CodeRabbit
Release Notes
βοΈ Tip: You can customize this high-level summary in your review settings.