Skip to content

Commit 02cc24d

Browse files
feat: add real-time alerting MVP (webhook + slack) (#4)
* feat: add real-time alerting mvp * docs: document alerting mvp and roadmap status
1 parent 2773172 commit 02cc24d

16 files changed

Lines changed: 1403 additions & 1 deletion

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Added
11+
12+
- **Phase 3 Step 3 -- Real-Time Alerting MVP (Webhook + Slack)**
13+
- Async fail-open alert dispatcher with queue, retry/backoff, and sink delivery metrics
14+
- Alert event emission for `injection_blocked`, `rate_limit_exceeded`, and `scan_error`
15+
- Generic webhook sink with optional Bearer token and Slack Incoming Webhook sink
16+
- Alerting config/env surface: `alerting.*` and `PIF_ALERTING_*`
17+
1018
## [1.2.0] - 2026-03-07
1119

1220
### Added

README.md

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ PIF addresses this critical gap by providing a **transparent, low-latency detect
111111
- **Health check endpoint** (`/healthz`)
112112
- **Prometheus metrics endpoint** (`/metrics`)
113113
- **Embedded monitoring dashboard + custom rule management** (`/dashboard`, optional)
114+
- **Real-time alerting (Webhook + Slack)** with async fail-open delivery
114115
- **golangci-lint** and race-condition-tested CI
115116

116117
</td>
@@ -527,6 +528,34 @@ dashboard:
527528
# and dashboard.auth.enabled=true.
528529
# - Built-in rule files remain read-only; dashboard mutates only managed custom rules.
529530

531+
# Real-time alerting (optional)
532+
alerting:
533+
enabled: false
534+
queue_size: 1024
535+
events:
536+
block: true
537+
rate_limit: true
538+
scan_error: true
539+
throttle:
540+
window_seconds: 60 # Aggregate rate-limit and scan-error alerts per client/window
541+
webhook:
542+
enabled: false
543+
url: "" # Generic webhook endpoint
544+
timeout: "3s"
545+
max_retries: 3
546+
backoff_initial_ms: 200
547+
auth_bearer_token: "" # Optional outbound bearer token
548+
slack:
549+
enabled: false
550+
incoming_webhook_url: "" # Slack Incoming Webhook URL
551+
timeout: "3s"
552+
max_retries: 3
553+
backoff_initial_ms: 200
554+
555+
# Note:
556+
# - Alert delivery is async and fail-open: request path is never blocked by sink failures.
557+
# - Initial event scope: block, rate-limit, and scan-error.
558+
530559
# Rule file paths
531560
rules:
532561
paths:
@@ -562,6 +591,12 @@ PIF_DASHBOARD_AUTH_ENABLED=true
562591
PIF_DASHBOARD_AUTH_USERNAME=ops
563592
PIF_DASHBOARD_AUTH_PASSWORD=change-me
564593
PIF_DASHBOARD_RULE_MANAGEMENT_ENABLED=true
594+
PIF_ALERTING_ENABLED=true
595+
PIF_ALERTING_WEBHOOK_ENABLED=true
596+
PIF_ALERTING_WEBHOOK_URL=https://alerts.example.com/pif
597+
PIF_ALERTING_WEBHOOK_AUTH_BEARER_TOKEN=replace-me
598+
PIF_ALERTING_SLACK_ENABLED=true
599+
PIF_ALERTING_SLACK_INCOMING_WEBHOOK_URL=https://hooks.slack.com/services/T000/B000/XXX
565600
PIF_LOGGING_LEVEL=debug
566601
```
567602

@@ -704,7 +739,8 @@ Automated quality gates on every push and pull request:
704739

705740
- [x] Web-based read-only dashboard UI for monitoring (MVP)
706741
- [x] Dashboard rule management (write/edit workflows)
707-
- [ ] Real-time alerting (Slack, PagerDuty, webhooks)
742+
- [x] Real-time alerting: Webhook + Slack (MVP)
743+
- [ ] Real-time alerting: PagerDuty sink
708744
- [ ] Multi-tenant support with per-tenant policies
709745
- [ ] Attack replay and forensic analysis tools
710746
- [ ] Community rule marketplace

config.yaml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,29 @@ dashboard:
4545
rule_management:
4646
enabled: false
4747

48+
alerting:
49+
enabled: false
50+
queue_size: 1024
51+
events:
52+
block: true
53+
rate_limit: true
54+
scan_error: true
55+
throttle:
56+
window_seconds: 60
57+
webhook:
58+
enabled: false
59+
url: ""
60+
timeout: "3s"
61+
max_retries: 3
62+
backoff_initial_ms: 200
63+
auth_bearer_token: ""
64+
slack:
65+
enabled: false
66+
incoming_webhook_url: ""
67+
timeout: "3s"
68+
max_retries: 3
69+
backoff_initial_ms: 200
70+
4871
webhook:
4972
listen: ":8443"
5073
tls_cert_file: "/etc/pif/webhook/tls.crt"

docs/API_REFERENCE.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,64 @@ Core metric names:
3333
- `pif_injection_detections_total`
3434
- `pif_detection_score`
3535
- `pif_rate_limit_events_total`
36+
- `pif_alert_events_total`
37+
- `pif_alert_sink_deliveries_total`
38+
39+
### Outbound Alerting (Optional)
40+
41+
When `alerting.enabled=true`, PIF emits outbound alerts without exposing new inbound HTTP endpoints.
42+
43+
Initial event types:
44+
45+
- `injection_blocked` (immediate on block action)
46+
- `rate_limit_exceeded` (window-aggregated per client key)
47+
- `scan_error` (window-aggregated per client key)
48+
49+
Delivery model:
50+
51+
- Async queue + worker dispatcher
52+
- Retry with exponential backoff and jitter
53+
- Fail-open behavior (delivery failure never blocks proxy request handling)
54+
- Sink execution order is sequential (`webhook` then `slack` when both are enabled)
55+
56+
Supported sinks:
57+
58+
- Generic webhook (`alerting.webhook.*`)
59+
- Slack Incoming Webhook (`alerting.slack.*`)
60+
61+
Generic webhook sends JSON payloads with the following contract:
62+
63+
```json
64+
{
65+
"event_id": "evt-1741363854757000000-1",
66+
"timestamp": "2026-03-07T12:30:54Z",
67+
"event_type": "injection_blocked",
68+
"action": "block",
69+
"client_key": "203.0.113.10",
70+
"method": "POST",
71+
"path": "/v1/chat/completions",
72+
"target": "https://api.openai.com",
73+
"score": 0.92,
74+
"threshold": 0.50,
75+
"findings_count": 2,
76+
"reason": "blocked_by_policy",
77+
"sample_findings": [
78+
{
79+
"rule_id": "PIF-INJ-001",
80+
"category": "prompt_injection",
81+
"severity": 4,
82+
"match": "ignore all previous instructions"
83+
}
84+
],
85+
"aggregate_count": 1
86+
}
87+
```
88+
89+
Notes:
90+
91+
- `sample_findings` is capped at 3 entries.
92+
- `aggregate_count` is used by aggregated events (`rate_limit_exceeded`, `scan_error`).
93+
- When configured, webhook sink sends `Authorization: Bearer <token>`.
3694

3795
### Embedded Dashboard (Optional)
3896

internal/cli/proxy.go

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,10 @@ func runProxy(cmd *cobra.Command, args []string) error {
7878
if err != nil {
7979
return fmt.Errorf("parsing proxy.write_timeout: %w", err)
8080
}
81+
alertingOptions, err := parseAlertingOptions(cfg)
82+
if err != nil {
83+
return err
84+
}
8185

8286
detectorFactory := buildProxyDetectorFactory(cfg, modelPath)
8387
ruleManager, err := proxy.NewRuntimeRuleManager(proxy.RuntimeRuleManagerOptions{
@@ -146,6 +150,7 @@ func runProxy(cmd *cobra.Command, args []string) error {
146150
},
147151
RuleInventory: ruleSnapshot.RuleSets,
148152
RuleManager: ruleManager,
153+
Alerting: alertingOptions,
149154
}, ruleManager.Detector())
150155
}
151156

@@ -202,3 +207,44 @@ func buildProxyDetectorFactory(cfg *config.Config, modelPath string) proxy.Detec
202207
return ensemble, nil
203208
}
204209
}
210+
211+
func parseAlertingOptions(cfg *config.Config) (proxy.AlertingOptions, error) {
212+
webhookTimeout, err := time.ParseDuration(cfg.Alerting.Webhook.Timeout)
213+
if err != nil {
214+
return proxy.AlertingOptions{}, fmt.Errorf("parsing alerting.webhook.timeout: %w", err)
215+
}
216+
slackTimeout, err := time.ParseDuration(cfg.Alerting.Slack.Timeout)
217+
if err != nil {
218+
return proxy.AlertingOptions{}, fmt.Errorf("parsing alerting.slack.timeout: %w", err)
219+
}
220+
throttleWindow := time.Duration(cfg.Alerting.Throttle.WindowSeconds) * time.Second
221+
if throttleWindow <= 0 {
222+
throttleWindow = 60 * time.Second
223+
}
224+
225+
return proxy.AlertingOptions{
226+
Enabled: cfg.Alerting.Enabled,
227+
QueueSize: cfg.Alerting.QueueSize,
228+
Events: proxy.AlertingEventOptions{
229+
Block: cfg.Alerting.Events.Block,
230+
RateLimit: cfg.Alerting.Events.RateLimit,
231+
ScanError: cfg.Alerting.Events.ScanError,
232+
},
233+
ThrottleWindow: throttleWindow,
234+
Webhook: proxy.AlertingSinkOptions{
235+
Enabled: cfg.Alerting.Webhook.Enabled,
236+
URL: cfg.Alerting.Webhook.URL,
237+
Timeout: webhookTimeout,
238+
MaxRetries: cfg.Alerting.Webhook.MaxRetries,
239+
BackoffInitial: time.Duration(cfg.Alerting.Webhook.BackoffInitialMs) * time.Millisecond,
240+
AuthBearerToken: cfg.Alerting.Webhook.AuthBearerToken,
241+
},
242+
Slack: proxy.AlertingSinkOptions{
243+
Enabled: cfg.Alerting.Slack.Enabled,
244+
URL: cfg.Alerting.Slack.IncomingWebhookURL,
245+
Timeout: slackTimeout,
246+
MaxRetries: cfg.Alerting.Slack.MaxRetries,
247+
BackoffInitial: time.Duration(cfg.Alerting.Slack.BackoffInitialMs) * time.Millisecond,
248+
},
249+
}, nil
250+
}

internal/cli/proxy_runtime_test.go

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,63 @@ proxy:
120120
assert.Contains(t, err.Error(), "parsing proxy.write_timeout")
121121
}
122122

123+
func TestParseAlertingOptions(t *testing.T) {
124+
cfg := config.Default()
125+
cfg.Alerting.Enabled = true
126+
cfg.Alerting.QueueSize = 256
127+
cfg.Alerting.Events.Block = true
128+
cfg.Alerting.Events.RateLimit = true
129+
cfg.Alerting.Events.ScanError = false
130+
cfg.Alerting.Throttle.WindowSeconds = 45
131+
cfg.Alerting.Webhook.Enabled = true
132+
cfg.Alerting.Webhook.URL = "https://example.com/hook"
133+
cfg.Alerting.Webhook.Timeout = "5s"
134+
cfg.Alerting.Webhook.MaxRetries = 4
135+
cfg.Alerting.Webhook.BackoffInitialMs = 150
136+
cfg.Alerting.Webhook.AuthBearerToken = "token"
137+
cfg.Alerting.Slack.Enabled = true
138+
cfg.Alerting.Slack.IncomingWebhookURL = "https://hooks.slack.test/abc"
139+
cfg.Alerting.Slack.Timeout = "4s"
140+
cfg.Alerting.Slack.MaxRetries = 2
141+
cfg.Alerting.Slack.BackoffInitialMs = 300
142+
143+
opts, err := parseAlertingOptions(cfg)
144+
require.NoError(t, err)
145+
146+
assert.True(t, opts.Enabled)
147+
assert.Equal(t, 256, opts.QueueSize)
148+
assert.True(t, opts.Events.Block)
149+
assert.True(t, opts.Events.RateLimit)
150+
assert.False(t, opts.Events.ScanError)
151+
assert.Equal(t, 45*time.Second, opts.ThrottleWindow)
152+
assert.True(t, opts.Webhook.Enabled)
153+
assert.Equal(t, "https://example.com/hook", opts.Webhook.URL)
154+
assert.Equal(t, 5*time.Second, opts.Webhook.Timeout)
155+
assert.Equal(t, 4, opts.Webhook.MaxRetries)
156+
assert.Equal(t, 150*time.Millisecond, opts.Webhook.BackoffInitial)
157+
assert.Equal(t, "token", opts.Webhook.AuthBearerToken)
158+
assert.True(t, opts.Slack.Enabled)
159+
assert.Equal(t, "https://hooks.slack.test/abc", opts.Slack.URL)
160+
assert.Equal(t, 4*time.Second, opts.Slack.Timeout)
161+
assert.Equal(t, 2, opts.Slack.MaxRetries)
162+
assert.Equal(t, 300*time.Millisecond, opts.Slack.BackoffInitial)
163+
}
164+
165+
func TestParseAlertingOptions_InvalidTimeout(t *testing.T) {
166+
cfg := config.Default()
167+
cfg.Alerting.Webhook.Timeout = "bad"
168+
169+
_, err := parseAlertingOptions(cfg)
170+
require.Error(t, err)
171+
assert.Contains(t, err.Error(), "parsing alerting.webhook.timeout")
172+
173+
cfg = config.Default()
174+
cfg.Alerting.Slack.Timeout = "bad"
175+
_, err = parseAlertingOptions(cfg)
176+
require.Error(t, err)
177+
assert.Contains(t, err.Error(), "parsing alerting.slack.timeout")
178+
}
179+
123180
func testContext(t *testing.T) context.Context {
124181
t.Helper()
125182
ctx, cancel := context.WithTimeout(context.Background(), time.Second)

0 commit comments

Comments
 (0)