|
| 1 | +# Service Level Agreement (SLA) |
| 2 | + |
| 3 | +## Uptime Target |
| 4 | + |
| 5 | +| Tier | Target | Error Budget (per 30 days) | |
| 6 | +|------|--------|---------------------------| |
| 7 | +| Search API | **99.9%** | 43.8 minutes/month | |
| 8 | +| Embedding Sidecar | 99.5% | 3.65 hours/month | |
| 9 | +| Delta Lake ingestion | 99.0% | 7.3 hours/month | |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## Performance SLOs |
| 14 | + |
| 15 | +| Metric | Target | Alert Threshold | Window | |
| 16 | +|--------|--------|-----------------|--------| |
| 17 | +| P50 search latency | < 50 ms | — | 5 min | |
| 18 | +| P99 search latency | < 500 ms | > 500 ms for 2 min | 5 min | |
| 19 | +| P99.9 search latency | < 2 000 ms | — | — | |
| 20 | +| Redis cache hit rate | ≥ 85% | < 70% for 5 min | 5 min | |
| 21 | +| HTTP 5xx error rate | < 0.1% | > 1% for 1 min | 5 min | |
| 22 | +| Circuit breaker open | 0 occurrences | > 0 for 30 s | — | |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Incident Definitions |
| 27 | + |
| 28 | +| Severity | Definition | Response Time | Resolution Time | |
| 29 | +|----------|-----------|---------------|-----------------| |
| 30 | +| **P0 — Critical** | Search API fully unavailable OR circuit breaker open > 5 min | 15 min | 2 hours | |
| 31 | +| **P1 — High** | P99 latency > 1 s sustained OR error rate > 5% | 30 min | 4 hours | |
| 32 | +| **P2 — Medium** | Cache hit rate < 70% OR P99 > 500 ms for > 10 min | 2 hours | 8 hours | |
| 33 | +| **P3 — Low** | Non-critical degradation, single transient errors | Next business day | 48 hours | |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## Alert Rules |
| 38 | + |
| 39 | +Four Prometheus alerts are defined in [`infra/docker/alert_rules.yml`](../infra/docker/alert_rules.yml): |
| 40 | + |
| 41 | +| Alert | Expression | Duration | Severity | |
| 42 | +|-------|-----------|----------|----------| |
| 43 | +| `CircuitBreakerOpen` | `vector_catalog_circuit_breaker_state > 0` | 30 s | critical | |
| 44 | +| `P99LatencyHigh` | `histogram_quantile(0.99, ...) > 500` | 2 min | warning | |
| 45 | +| `CacheHitRateLow` | `vector_catalog_cache_hit_rate < 0.70` | 5 min | warning | |
| 46 | +| `ErrorRateHigh` | 5xx rate > 1% (5-min rolling) | 1 min | critical | |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## Monitoring Links |
| 51 | + |
| 52 | +| Tool | URL | |
| 53 | +|------|-----| |
| 54 | +| Live health check | `https://vector-catalog-api.politefield-8fe8e6a2.eastus.azurecontainerapps.io/health` | |
| 55 | +| Live Prometheus metrics | `https://vector-catalog-api.politefield-8fe8e6a2.eastus.azurecontainerapps.io/metrics` | |
| 56 | +| Local Prometheus | `http://localhost:9090` | |
| 57 | +| Local Grafana dashboard | `http://localhost:3000` (admin / admin) | |
| 58 | +| GitHub Actions (CI) | `https://github.com/ritunjaym/vector-catalog-service/actions` | |
| 59 | +| Trivy security scan | `https://github.com/ritunjaym/vector-catalog-service/security` | |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +## Error Budget Burn Rate |
| 64 | + |
| 65 | +With a 99.9% uptime target the monthly error budget is **43.8 minutes**. |
| 66 | + |
| 67 | +| Burn rate | Meaning | Alert within | |
| 68 | +|-----------|---------|--------------| |
| 69 | +| 1× | Consuming budget at exactly the SLO rate | — | |
| 70 | +| 5× | Budget exhausted in ~6 days | P2 | |
| 71 | +| 14.4× | Budget exhausted in ~2 hours | P1 | |
| 72 | +| 36× | Budget exhausted in ~1 hour | P0 | |
| 73 | + |
| 74 | +--- |
| 75 | + |
| 76 | +## Exclusions |
| 77 | + |
| 78 | +- Scheduled maintenance windows (announced ≥ 24 hours in advance, max 4 hours/month) |
| 79 | +- Force majeure (Azure region outage, DNS provider failure) |
| 80 | +- Client-side network issues outside Azure |
0 commit comments