Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ tests/
10. **Route-group auth guard** - Next.js `(dashboard)/layout.tsx` wraps all protected pages
11. **Mode-aware auth** - `none`/`bearer`/`credentials`/`mtls` with role checks on protected endpoints

See `docs/adr/` for all 19 Architecture Decision Records.
See `docs/adr/` for all 22 Architecture Decision Records.

## API Endpoints

Expand Down Expand Up @@ -258,8 +258,8 @@ Completed:
- Nginx reverse proxy (single-origin deployment)
- ErrorBoundary wrapping all dashboard pages
- Cross-entity navigation (command palette → detail pages, event↔mitigation linking, audit log → mitigations, clickable stat cards)
- Multi-signal correlation engine with signal groups, Alertmanager/FastNetMon adapters, a generic JSONPath-driven webhook adapter (ADR 020), and corroborating-only signals from coarse telemetry (ADR 021)
- 21 Architecture Decision Records
- Multi-signal correlation engine with signal groups, Alertmanager/FastNetMon adapters, a generic JSONPath-driven webhook adapter (ADR 020), corroborating-only signals from coarse telemetry (ADR 021), and exponential confidence decay over time (ADR 022)
- 22 Architecture Decision Records
- CLI tool (prefixdctl) for all API operations
- OpenAPI spec with utoipa annotations
- 179 backend unit tests + 99 integration + 16 postgres tests (+ 17 ignored requiring GoBGP/Docker)
Expand Down
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

## [0.18.0] - 2026-05-11

### Added

- **Confidence decay for signal groups (ADR 022).** Exponential time decay applied to per-event confidence contributions when computing `derived_confidence`. Each event's effective weight is multiplied by `0.5 ^ (age_seconds / half_life_seconds)`, so older corroborating evidence smoothly loses influence without ever being discarded. Default is disabled (`confidence_decay_half_life_seconds: 0`); zero behavior change for existing deployments.
- **Global config:** `correlation.confidence_decay_half_life_seconds: u32` (default 0). Validated to `0 ≤ H ≤ 10 × window_seconds`.
- **Per-playbook override:** `correlation_override.confidence_decay_half_life_seconds: Option<u32>`. `Some(0)` explicitly disables decay for the playbook; `None` falls through to global.
- **Reconcile loop step.** New `refresh_decayed_confidence` step iterates every open signal group each tick (default 30 s) and recomputes `derived_confidence` from current events with decay applied. No-op when decay is disabled.
- **One-shot `corroboration_met`.** When `derived_confidence` falls below the threshold due to decay, `corroboration_met` is now sticky (`met_now || was_met`) — once a group has authorized mitigation, decay never revokes that authorization for the group's lifetime.
- **Metric:** `prefixd_signal_group_decay_refreshes_total` counter ticks once per `refresh_decayed_confidence` invocation. Alert on "decay loop not running" by watching for the counter going flat.
- **UI:** Group detail page surfaces "decayed, half-life Ns" next to `derived_confidence` when decay is active for the group's effective playbook.
- 17 new tests (7 engine unit + 7 config unit + 3 integration) covering decay math, override resolution, validation, disabled paths, and one-shot stickiness.

## [0.17.1] - 2026-05-11

### Changed
Expand Down
2 changes: 1 addition & 1 deletion ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,7 @@ Example: FastNetMon says UDP flood at 0.6 confidence + router CPU spiking + host
### Confidence Model

- [x] Derived confidence from traffic patterns
- [ ] Confidence decay over time
- [x] Confidence decay over time
- [x] Per-playbook thresholds

---
Expand Down
201 changes: 201 additions & 0 deletions docs/adr/022-confidence-decay.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# ADR 022: Confidence Decay for Signal Groups

**Status:** Accepted
**Date:** 2026-05-11
**Extends:** [ADR 018 — Multi-Signal Correlation Engine](018-multi-signal-correlation-engine.md), [ADR 021 — Corroborating Signals](021-corroborating-signals.md)

## Context

The correlation engine (ADR 018) computes a signal group's
`derived_confidence` as the source-weighted average of per-event
confidence values:

```
derived = Σ(confidence_i · weight_i) / Σ(weight_i)
```

Once a group's `derived_confidence` clears the configured
`confidence_threshold` and `min_sources` is satisfied,
`corroboration_met` is flipped to `true` and the group is allowed to
trigger mitigations (or, in the ADR 021 corroborating-only flow,
strengthen open groups).

This works well during an active attack — fresh telemetry keeps arriving
and the weighted average reflects current reality. It is less honest
once an incident winds down:

1. **Long correlation windows hold stale evidence.** Operators routinely
configure `window_seconds: 3600` to absorb burst-and-recover patterns.
A high-confidence event ingested 50 minutes ago still contributes to
the average at full weight, even though everything since has been
benign.
2. **Corroborating sources from ADR 021 amplify the problem.** A
`mode: corroborating` source that fired hours ago continues to inflate
`derived_confidence` long after its operational signal has gone
silent.
3. **Operators cannot express "fresh evidence matters more"** without
abandoning windowed correlation entirely.

The result: groups whose underlying attack has already abated continue
to read as "highly corroborated" for the remainder of the window. Any
ADR-021 corroborator that fires in that window — even on totally
unrelated telemetry — sees a green light from the cached confidence and
nudges the group toward (re-)mitigation.

A naive fix ("drop events older than X seconds from the average") loses
useful history and produces step-function discontinuities in the score.

## Decision

Introduce **exponential confidence decay** on the
weighted-average computation. Each event's contribution is multiplied by
`0.5 ^ (age_seconds / half_life_seconds)` before being summed, so older
events smoothly lose weight without ever being discarded outright:

```
weight_eff_i = weight_i · 0.5 ^ (age_i / H)
derived = Σ(confidence_i · weight_eff_i) / Σ(weight_eff_i)
```

Where:

- `age_i = now - ingested_at_i` (clamped to ≥ 0)
- `H = effective_decay_half_life_seconds` (resolved per-playbook, see below)
- `H = 0` disables decay (default; preserves ADR 018 behavior)

### Configuration

A new global field on `CorrelationConfig`:

```yaml
correlation:
enabled: true
window_seconds: 3600
min_sources: 2
confidence_threshold: 0.7
confidence_decay_half_life_seconds: 300 # 5-minute half-life
```

Per-playbook override on `PlaybookCorrelationOverride`:

```yaml
playbooks:
- vector: udp_flood
correlation_override:
confidence_decay_half_life_seconds: 60 # faster decay for noisy vector
- vector: dns_amplification
correlation_override:
confidence_decay_half_life_seconds: 0 # explicitly disable for this playbook
```

Override resolution (`effective_decay_half_life()`):

- `Some(0)` ⇒ decay explicitly disabled for this playbook
- `Some(n)` ⇒ use `n`
- `None` ⇒ fall through to global `confidence_decay_half_life_seconds`

Validation: `0 ≤ H ≤ 10 × window_seconds`. The upper bound prevents
configuration mistakes where a half-life longer than the correlation
window would render decay effectively a no-op.

### Compute Paths

Two recompute paths use the decayed variant:

1. **`POST /v1/events` ingestion.** Every event that lands in an open
group recomputes `derived_confidence` with decay applied.
2. **Reconcile loop (every tick, 30 s).** A new
`refresh_decayed_confidence` step iterates every open signal group
(`list_open_signal_groups`) and recomputes `derived_confidence` from
the current event set. This is what actually delivers the decay to
groups that aren't receiving fresh events.

The reconcile step is a no-op when `confidence_decay_half_life_seconds`
is 0 (so users not opting in pay no extra DB cost).

### One-Shot Corroboration (Sticky `corroboration_met`)

When `derived_confidence` falls below `confidence_threshold` due to
decay, `corroboration_met` **must not** flap back to `false`. The flag
is sticky once set:

```rust
corroboration_met = met_now || was_met
```

This preserves the operational invariant that "once mitigation was
authorized for this group, it stays authorized for the lifetime of the
group" — decay only shapes future authorizations on *other* groups,
never revokes one already granted.

### Observability

- **Metric:** `prefixd_signal_group_decay_refreshes_total` counter,
ticks once per `refresh_decayed_confidence` invocation (whether or not
any groups were refreshed). Lets operators alert on "decay loop not
running".
- **UI:** The group detail page surfaces "decayed, half-life Ns" next to
the `derived_confidence` value when decay is active for the group's
effective playbook, so operators can interpret the score correctly.

## Consequences

### Positive

- Stale corroboration evidence loses weight smoothly without
discontinuities.
- ADR-021 corroborating sources from earlier in the window no longer
hold groups at artificially high confidence.
- Per-playbook tuning lets operators dial decay speed per vector (e.g.
faster decay for noisy UDP floods, slower for slow-and-low credential
stuffing).
- Sticky `corroboration_met` prevents flap-back of authorized
mitigations even under aggressive decay configs.
- Defaults to disabled (`H = 0`) — zero behavior change for existing
deployments.

### Negative

- Reconcile loop now does O(open_groups · events_per_group) DB reads
per tick when decay is enabled. For typical deployments (< 100 open
groups, < 10 events each) this is negligible, but pathological
configurations would notice.
- `derived_confidence` is no longer a pure function of "events on the
group" — it now also depends on wall-clock time. This complicates
reproducing a group's score offline; the trade-off is acceptable
given the operational gain.
- Decay does not change `source_count` (still a raw distinct-source
count). Operators relying on `source_count` for thresholding will not
see decay affect their gate; only `confidence_threshold` benefits.

### Single-Event Math Note

For a group with exactly one event, `derived_confidence` is unaffected
by decay (the decay factor cancels in `Σ(c·w_eff) / Σ(w_eff)`). Decay
only meaningfully shifts the score when a group has events at different
ages. This is mathematically correct and matches operator intuition:
"one piece of evidence is one piece of evidence, regardless of how old".

## Alternatives Considered

1. **Hard cutoff (drop events older than X).** Rejected: step-function
discontinuities in score, and loses useful history for slow-attack
detection.
2. **Linear decay.** Considered. Rejected in favor of exponential
because half-life is the unit operators reason about intuitively
("after 5 minutes, evidence is worth half what it was") and matches
industry convention for time-decayed metrics.
3. **Per-source decay rates.** Considered. Deferred: introduces another
tuning knob whose value isn't obvious to operators and overlaps with
per-source `weight`. Global + per-playbook covers the immediate
need.
4. **Decay only on corroborating signals.** Considered. Rejected
because primary detectors also produce stale evidence (a firing
Prometheus alert that has been resolved for 40 minutes shouldn't
keep contributing at full weight either).

## Migration

No migration required. Default `confidence_decay_half_life_seconds: 0`
preserves ADR 018 behavior bit-for-bit. Operators opt in by setting a
non-zero value in `correlation.yaml`.
1 change: 1 addition & 0 deletions docs/adr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,5 +29,6 @@ Format follows [Michael Nygard's template](https://cognitect.com/blog/2011/11/15
| [019](019-signal-adapter-architecture.md) | Signal Adapter Architecture | Accepted | 2026-03-19 |
| [020](020-generic-webhook-adapter.md) | Generic Webhook Adapter | Accepted | 2026-04-18 |
| [021](021-corroborating-signals.md) | Corroborating Signals | Accepted | 2026-04-19 |
| [022](022-confidence-decay.md) | Confidence Decay for Signal Groups | Accepted | 2026-05-11 |

ADRs are numbered sequentially as written. Retroactive ADRs (009-013) were documented on 2026-02-18 but dated to when the decision was originally made.
2 changes: 1 addition & 1 deletion docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -546,7 +546,7 @@ GET /v1/signal-groups/{id}
Authorization: Bearer <token>
```

Returns group metadata and all contributing events with source, confidence, and source weight.
Returns group metadata and all contributing events with source, confidence, and source weight. When `correlation.confidence_decay_half_life_seconds` (or the playbook override) is non-zero, `derived_confidence` is computed with exponential time decay applied — older events contribute proportionally less weight. See [ADR 022](adr/022-confidence-decay.md). The `corroboration_met` flag is sticky: once set to `true` it never reverts to `false`, even if decay drives `derived_confidence` back under the threshold.

**Response:**

Expand Down
10 changes: 10 additions & 0 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,13 @@ correlation:
# Global minimum derived confidence threshold (0.0-1.0).
# A signal group must reach this threshold (in addition to min_sources) before triggering.
confidence_threshold: 0.5

# Exponential half-life applied to per-event confidence contributions when
# computing derived_confidence. 0 disables decay (default). When set, an
# event's effective weight is multiplied by 0.5^(age_seconds / H), so
# older corroborating evidence smoothly loses influence. See ADR 022.
# Must satisfy 0 <= H <= 10 * window_seconds.
confidence_decay_half_life_seconds: 0

# Default weight for sources not listed below
default_weight: 1.0
Expand All @@ -307,6 +314,7 @@ correlation:
| `window_seconds` | integer | `300` | Time window for grouping signals (seconds) |
| `min_sources` | integer | `1` | Minimum distinct sources to trigger mitigation |
| `confidence_threshold` | float | `0.5` | Minimum derived confidence to trigger |
| `confidence_decay_half_life_seconds` | integer | `0` | Exponential half-life (seconds) for time-decaying per-event confidence contributions; 0 disables decay. Bounded by `10 × window_seconds`. See [ADR 022](adr/022-confidence-decay.md). |
| `default_weight` | float | `1.0` | Weight for unknown/unconfigured sources |
| `sources` | map | `{}` | Per-source weight and type configuration |

Expand Down Expand Up @@ -367,6 +375,7 @@ playbooks:
correlation:
min_sources: 2 # Require corroboration for UDP floods
confidence_threshold: 0.7
confidence_decay_half_life_seconds: 60 # Faster decay for noisy vector
steps:
- action: police
rate_bps: 5000000
Expand All @@ -379,6 +388,7 @@ When a playbook has no `correlation` override, the global defaults from `prefixd
|----------------|------|-------------|
| `min_sources` | integer | Override global min_sources for this playbook |
| `confidence_threshold` | float | Override global confidence_threshold for this playbook |
| `confidence_decay_half_life_seconds` | integer (optional) | Override global half-life. `0` explicitly disables decay for this playbook even when the global is non-zero; omit to inherit the global value. |

#### Hot Reload

Expand Down
8 changes: 8 additions & 0 deletions frontend/app/(dashboard)/correlation/groups/[id]/page.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -449,6 +449,14 @@ export default function SignalGroupDetailPage({
<div>
<p className="text-xs text-muted-foreground mb-1">
Derived Confidence
{correlationConfig &&
correlationConfig.confidence_decay_half_life_seconds &&
correlationConfig.confidence_decay_half_life_seconds > 0 ? (
<span className="ml-1 text-muted-foreground/70">
(decayed, half-life{" "}
{correlationConfig.confidence_decay_half_life_seconds}s)
</span>
) : null}
</p>
<div className="flex items-center gap-2">
<div className="flex-1 h-2 rounded-full bg-muted overflow-hidden">
Expand Down
1 change: 1 addition & 0 deletions frontend/lib/api.ts
Original file line number Diff line number Diff line change
Expand Up @@ -808,6 +808,7 @@ export interface CorrelationConfig {
min_sources: number
confidence_threshold: number
default_weight: number
confidence_decay_half_life_seconds?: number
sources: Record<string, SourceConfig>
webhook_adapters?: WebhookAdapter[]
}
Expand Down
Loading
Loading