Skip to content

Add basic dashboards to IaC#170

Open
t-margheim wants to merge 11 commits intomainfrom
add-basic-dashboards-to-iac
Open

Add basic dashboards to IaC#170
t-margheim wants to merge 11 commits intomainfrom
add-basic-dashboards-to-iac

Conversation

@t-margheim
Copy link
Contributor

Description

This PR adds some basic grafana dashboards and the pulumi code needed to deploy and manage them. For now, there are two dashboards included:

  • Alert-Based Dashboard: a basic dashboard including panels for each case which can trigger alerts. Each panel also includes alerting thresholds where those exist.
  • Kubernetes Global View: a slightly modified version of the Kubernetes Global View dashboard from https://github.com/dotdc/grafana-dashboards-kubernetes.

Category of change

  • Bug fix (non-breaking change which fixes an issue)
  • Version upgrade (upgrading the version of a service or product)
  • New feature (non-breaking change which adds functionality)
  • Build: a code change that affects the build system or external dependencies
  • Performance: a code change that improves performance
  • Refactor: a code change that neither fixes a bug nor adds a feature
  • Documentation: documentation changes
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

t-margheim and others added 9 commits March 9, 2026 10:43
…compliance

Dashboard filenames with underscores (e.g., alerts_dashboard.json) were
causing Kubernetes ConfigMap creation to fail because underscores violate
RFC 1123 naming rules. This fix sanitizes dashboard names by replacing
underscores with hyphens for Kubernetes resource names while preserving
the original filename structure for Grafana UIDs and data keys.

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…validation

Improves dashboard ConfigMap naming to handle all RFC 1123 edge cases:

1. **Added sanitize_k8s_name() utility function** (lib.py):
   - Converts to lowercase
   - Replaces all non-alphanumeric chars (including underscores, dots) with hyphens
   - Collapses consecutive hyphens into single hyphen
   - Strips leading/trailing hyphens
   - Validates result matches RFC 1123 pattern: ^[a-z0-9]([a-z0-9-]*[a-z0-9])?$

2. **Updated dashboard provisioning** (aws_eks_cluster.py):
   - Uses new sanitize_k8s_name() function instead of inline replace
   - Validation now catches invalid names before Kubernetes API errors

3. **Comprehensive test coverage** (test_lib.py, test_dashboard_configmaps.py):
   - 40+ unit tests for sanitization edge cases
   - Tests for leading/trailing special chars, uppercase, consecutive chars
   - Integration tests for ConfigMap structure and naming patterns
   - RFC 1123 pattern validation tests

Fixes: dashboard names with underscores (alerts_dashboard.json) now correctly
sanitize to Kubernetes-compliant names (grafana-alerts-dashboard-dashboard).

Addresses review feedback: complete RFC 1123 sanitization, validation, and tests.

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
… dashboard

Fixed the k8s-views-global dashboard to properly support multi-cluster deployments:
- Enabled "All" cluster selection (includeAll: true)
- Added default selection to "All" clusters
- Changed all cluster query filters from exact match (=) to regex (=~) (59 occurrences)
- Changed all job query filters from exact match (=) to regex (=~) (19 occurrences)
- Fixed dashboard UID to use hyphens instead of underscores (k8s-views-global)

This aligns with the multi-cluster pattern used in alerts_dashboard.json and enables
users to view metrics across all clusters or select specific clusters.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…w dashboard

Cleaned up unnecessary template variables and annotations that don't apply to PTD:
- Removed datasource variable (now hardcoded to mimir for all queries)
- Removed job variable (only had one entry, not useful)
- Removed terraform and oncall annotation toggles (unused tags)
- Updated all 87 datasource references to use "mimir" directly
- Removed all 19 job filter references from queries

This simplifies the dashboard configuration and removes controls that don't apply
to PTD's infrastructure setup.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
… View

Changed the default resolution from 30s to 1m to provide a better balance
between query performance and data granularity.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@t-margheim t-margheim self-assigned this Mar 9, 2026
t-margheim and others added 2 commits March 9, 2026 16:04
Fixed ruff linter warnings:
- B904: Use explicit exception chaining with 'from e'
- TRY003/EM102: Assign error message to variable before raising
- F841: Remove unused 'cluster_name' variable in test

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@t-margheim t-margheim marked this pull request as ready for review March 9, 2026 22:20
@t-margheim t-margheim requested a review from a team as a code owner March 9, 2026 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant