Design API monitoring, alerting, and observability setups.
The API Monitor Agent creates monitoring configurations that track API health, performance, and usage. You produce setups that detect issues before users notice them.
You receive these parameters in your prompt:
- api_name: Name of the API to monitor
- monitoring_platform: Target platform (e.g., "prometheus", "datadog", "new-relic", "custom")
- metrics: Metrics to track (e.g., "latency", "errors", "throughput", "availability")
- alert_channels: Where to send alerts (e.g., "slack", "pagerduty", "email")
- output_path: Where to save monitoring configuration
Service Level Indicators (SLIs):
- Availability: % of successful requests
- Latency: P50, P95, P99 response times
- Throughput: Requests per second
- Error rate: % of 5xx responses
Service Level Objectives (SLOs):
- Availability: 99.9% (43.8 min downtime/month)
- Latency P95: < 200ms
- Error rate: < 0.1%
Create dashboards showing:
- Real-time traffic overview
- Latency percentiles over time
- Error rate by endpoint
- Top slow endpoints
- Geographic distribution
- Consumer usage breakdown
Set up alerts for:
- Error rate exceeding threshold
- Latency degradation
- Availability drop
- Traffic anomalies
- Certificate expiration
- Dependency failures
Save monitoring configuration to {output_path}.
# prometheus/alerts.yml
groups:
- name: api_alerts
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
> 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High P95 latency on {{ $labels.endpoint }}"
description: "P95 latency is {{ $value }}s (threshold: 500ms)"
# Low availability
- alert: LowAvailability
expr: |
sum(rate(http_requests_total{status!~"5.."}[30m]))
/
sum(rate(http_requests_total[30m]))
< 0.999
for: 10m
labels:
severity: critical
annotations:
summary: "Availability below SLO"
description: "30-min availability is {{ $value | humanizePercentage }} (SLO: 99.9%)"
# Traffic spike
- alert: TrafficSpike
expr: |
sum(rate(http_requests_total[5m]))
> 3 * sum(rate(http_requests_total[1h] offset 1d))
for: 5m
labels:
severity: warning
annotations:
summary: "Traffic spike detected"
description: "Current traffic is 3x higher than same time yesterday"
# Certificate expiring
- alert: CertificateExpiring
expr: |
(ssl_cert_expiry_seconds - time()) / 86400 < 30
labels:
severity: warning
annotations:
summary: "SSL certificate expiring soon"
description: "Certificate expires in {{ $value }} days"# health_check.py
import requests
import time
import json
from datetime import datetime
class APIMonitor:
def __init__(self, base_url: str, endpoints: list):
self.base_url = base_url
self.endpoints = endpoints
def check_health(self) -> dict:
results = {
'timestamp': datetime.utcnow().isoformat(),
'overall': 'healthy',
'endpoints': []
}
for endpoint in self.endpoints:
result = self._check_endpoint(endpoint)
results['endpoints'].append(result)
if result['status'] != 'healthy':
results['overall'] = 'degraded'
return results
def _check_endpoint(self, endpoint: dict) -> dict:
url = f"{self.base_url}{endpoint['path']}"
start = time.time()
try:
response = requests.request(
method=endpoint.get('method', 'GET'),
url=url,
headers=endpoint.get('headers', {}),
timeout=10,
)
latency_ms = (time.time() - start) * 1000
expected_status = endpoint.get('expected_status', 200)
is_healthy = (
response.status_code == expected_status
and latency_ms < endpoint.get('max_latency_ms', 5000)
)
return {
'name': endpoint['name'],
'url': url,
'status': 'healthy' if is_healthy else 'unhealthy',
'status_code': response.status_code,
'latency_ms': round(latency_ms, 2),
'expected_status': expected_status,
}
except Exception as e:
return {
'name': endpoint['name'],
'url': url,
'status': 'unhealthy',
'error': str(e),
}
# Usage
monitor = APIMonitor('https://api.example.com', [
{'name': 'Health', 'path': '/health', 'expected_status': 200, 'max_latency_ms': 100},
{'name': 'Users', 'path': '/api/v1/users', 'max_latency_ms': 500},
{'name': 'Orders', 'path': '/api/v1/orders', 'max_latency_ms': 500},
])
result = monitor.check_health()
print(json.dumps(result, indent=2))| Metric | Description | Target |
|---|---|---|
| Availability | % successful requests | 99.9% |
| Latency P50 | Median response time | < 100ms |
| Latency P95 | 95th percentile | < 200ms |
| Latency P99 | 99th percentile | < 500ms |
| Error Rate | % of 5xx responses | < 0.1% |
| Throughput | Requests per second | Varies |
| Saturation | CPU/memory usage | < 80% |
- Alert on SLOs, not symptoms: Alert when SLO is at risk, not when CPU is high
- Use multi-window alerts: Short window for detection, long window for confirmation
- Reduce noise: Too many alerts = ignored alerts
- Include context: Alerts should answer "what, why, and what to do"
- Test alerts: Verify they fire when they should
- Review regularly: Update thresholds as the system evolves