📝 Description:
Implement automated health checks for all critical services and trigger alerts when thresholds are breached. Alerts should be delivered via a Discord Bot in the internal UTMIST Discord Server. Retry logic and escalation rules should ensure alerts are not missed.
🎯 Acceptance Criteria:
- Health checks run automatically and validate service responsiveness and key metrics
- Alerts trigger correctly on threshold breaches and are sent to the Discord Bot channel
- Retry and escalation logic works to ensure alerts are delivered
🛠 Tasks:
- Define health check endpoints
- Set threshold values for metrics
- Configure alert channels, including Discord Bot integration, and escalation rules
- Write tests for health checks, alert triggers, and Discord notifications