Skip to content

indexer: materialized incident events#322

Open
snormore wants to merge 18 commits intomainfrom
snor/incident-events
Open

indexer: materialized incident events#322
snormore wants to merge 18 commits intomainfrom
snor/incident-events

Conversation

@snormore
Copy link
Copy Markdown
Contributor

@snormore snormore commented Mar 29, 2026

Summary of Changes

  • Incident detection Temporal workflow that scans rollup tables every 30s using a watermark and writes append-only lifecycle events (openedsymptom_added/resolvedresolved) into new ClickHouse tables (link_incident_events, device_incident_events)
  • v2 incident API endpoints that aggregate events per entity into a single incident with severity, multi-symptom grouping, peak values, and full event history
  • Per-incident detail pages for links and devices with contextual charts scoped to the incident time window
  • Incidents list page migrated to v2 data model with severity badges, multi-symptom display, and split no_latency_data/no_traffic_data symptom types
  • Admin CLI command for backfilling historical incident events from existing rollup data
  • Page cache entries for v2 incident list endpoints

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 9 +2469 / -1 +2468
Scaffolding 5 +159 / -0 +159
Tests 3 +777 / -0 +777
Frontend 4 +1400 / -1013 +387
Docs 1 +132 / -0 +132
Migration 1 +50 / -0 +50

~50% core backend logic, ~25% frontend, ~15% tests, ~10% scaffolding/docs/migration

Key files (click to expand)

Testing Verification

  • Incident detection activities and backfill covered by integration tests against ClickHouse testcontainers
  • All Go tests pass with race detector (go test -race ./indexer/pkg/incidents/...)

@snormore snormore force-pushed the snor/incident-events branch from 2f20e0e to a60acd4 Compare March 29, 2026 23:45
@snormore snormore force-pushed the snor/incident-events branch from a60acd4 to 45ed5e0 Compare March 30, 2026 00:34
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 30, 2026

🔗 Preview: https://pr-322.data.malbeclabs.com

@snormore snormore force-pushed the snor/incident-events branch 3 times, most recently from d4b7225 to ba54c94 Compare March 31, 2026 02:28
@snormore snormore changed the base branch from main to snor/unified-metrics March 31, 2026 02:29
@snormore snormore force-pushed the snor/incident-events branch from ba54c94 to 97042c5 Compare March 31, 2026 03:16
@snormore snormore force-pushed the snor/unified-metrics branch from 522b112 to d11b76c Compare March 31, 2026 04:20
@snormore snormore force-pushed the snor/incident-events branch from 97042c5 to 7dca8fb Compare March 31, 2026 04:20
@snormore snormore force-pushed the snor/unified-metrics branch 2 times, most recently from 0e8a1c3 to b78d886 Compare April 2, 2026 01:14
Base automatically changed from snor/unified-metrics to main April 2, 2026 04:25
@snormore snormore force-pushed the snor/incident-events branch from 7dca8fb to 84ea48b Compare April 2, 2026 15:32
@snormore snormore changed the title api+web+indexer: materialized incident events and unified metrics endpoints indexer: materialized incident events with watermark-based detection Apr 2, 2026
@snormore snormore changed the title indexer: materialized incident events with watermark-based detection indexer: materialized incident events Apr 2, 2026
@snormore snormore changed the title indexer: materialized incident events indexer: materialized incident events Apr 2, 2026
@snormore snormore force-pushed the snor/incident-events branch 3 times, most recently from 475a436 to 73dca63 Compare April 5, 2026 14:15
@snormore snormore removed the preview label Apr 5, 2026
@snormore snormore force-pushed the snor/incident-events branch 6 times, most recently from 7154cb0 to df9903c Compare April 7, 2026 21:15
snormore added 2 commits April 9, 2026 09:05
…est infra fixes

Add unified /api/link-metrics and /api/device-metrics endpoints that return
composable time-bucketed status, latency, and traffic in a single response.

Add reusable chart components for device and link detail pages (health
timeline, interface issues, traffic, latency, jitter, packet loss).

Fix effectiveBucket/bucketForDuration drift, align start time to bucket
boundaries, increase ClickHouse dial timeouts, and fix test cleanup using
closed admin connections.
The alias must come before FINAL in ClickHouse syntax:
device_interface_rollup_5m r FINAL, not FINAL r.
snormore added 16 commits April 9, 2026 09:05
…etail improvements

- Add GET /api/link-metrics/{pk} and GET /api/device-metrics/{pk} unified endpoints
- Return status, latency, traffic, and status_changes in one response
- Add committed_jitter_us to link metrics response
- Add pps and max_bps/max_pps to traffic response
- Fix auto bucket sizing to use bucketForDuration instead of 24-bucket division
- Fix raw interface rollup bucketing to use requested interval
- Don't classify one-sided probe reporting as unhealthy on links
- Classify devices as unhealthy when not sending latency probes
- Add presetToDuration helper for time range conversion

- link-charts/: LinkHealthTimeline, LinkPacketLossChart, LinkInterfaceIssuesChart,
  LinkLatencyChart, LinkJitterChart, LinkTrafficChart
- device-charts/: DeviceHealthTimeline, DeviceInterfaceIssuesChart, DeviceTrafficChart
- All charts: empty state, consistent x-axis alignment, second-precision timestamps,
  loading indicators, bidirectional Rx/Tx traffic view with toggle, avg/peak and
  bps/pps selectors, committed RTT and jitter reference lines

- Client-side bar merging (max 64, min 48 bars)
- Detailed tooltips with reasons (packet loss severity, latency vs SLA, interface
  issues, ISIS state, missing data indicators)
- Missing latency/traffic data shows as degraded
- Trailing collecting bars handled for rollup lag
- Custom time range labels instead of "custom"/"Now"

- Link and device detail pages migrated to unified metrics endpoint
- Link and device incident detail pages migrated to new chart components
- Collapsible event timeline for incidents with many events
- 404-aware error handling for missing incidents
- Refresh button next to time range selector
- Back buttons use Link instead of history navigation

- Generate intermediate symptom_added/symptom_resolved events (not just
  opened/resolved) matching live detector behavior
- Use raw (pre-coalesce) windows for event generation to show individual spikes
- Hybrid SQL+Go approach: SQL computes symptom windows, Go generates events
- Widen live rollup window from 10 to 30 minutes for resilience
- Cap backfill end time to avoid overlapping with live rollup

- Hide timeline nav item from sidebar
- Rename legend "Max" column to "Peak"
Move Date.now() calls into useState initializers to satisfy the
react-hooks/purity rule that forbids impure functions during render.
Replace the dual live-detection/backfill approach with a single
window-based engine. The detection workflow tracks a watermark via
Temporal ContinueAsNew, checks rollup freshness each cycle, and
processes from watermark to latest rollup data using BackfillLinkChunk
and BackfillDeviceChunk.

This eliminates false no-data incidents during indexer downtime — the
detector never processes beyond actual rollup data. Gaps are
automatically backfilled in 1-hour chunks when the indexer catches up.

Also caps resolved event timestamps to the processing window end to
prevent future-dated events.
…detection

Materialized incident events with watermark-based detection that eliminates
false no-data incidents during indexer downtime. Adds v2 incident API
endpoints, incident detail pages, and incidents list page migration.
Use log_ingestion_runs to track per-pipeline freshness so incident
detection doesn't create false no_latency_data/no_traffic_data incidents
when a data pipeline falls behind.

- Rollup activities now record SourceMinEventTS/SourceMaxEventTS
- CheckRollupFreshness queries ingestion logs instead of rollup tables
- DetectionState tracks LatencyWatermark and TrafficWatermark separately
- Each watermark advances independently to its pipeline's freshness
- no_*_data symptoms are suppressed for windows past the pipeline cutoff
- When a lagging pipeline catches up, the gap is reprocessed with
  Overwrite to regenerate incidents with the full symptom set
Add status_changed as a first-class event type in incident events.
When an entity's status changes during an open incident (e.g., link
drained or undrained), the detector emits a status_changed event with
previous and new status.

- New migration adds status_changed to Enum8, previous_status/new_status columns
- Detect status transitions from dimension history tables during backfill
- Resolve entity status point-in-time from history (not current views)
- API exposes previous_status/new_status on incident events
- Frontend renders status changes with diamond dots in timeline
- Extend status change query window 1h past incident end
The is_drained flag on incident list items now checks for
status_changed events with a drain status, not just the entity's
current status. This means incidents that were drained at any point
during the incident show the drained badge, even if the entity was
later undrained.
The v2 event-based incident system is now the only incident system.
All web UI already uses v2 endpoints exclusively. The timeline page
was hidden and unused.
…etail pages

Wire up hoveredTimeRange and chartHoveredTime state so hovering
timeline bars highlights chart regions and hovering charts highlights
the corresponding timeline bar, matching the link/device detail pages.
@snormore snormore force-pushed the snor/incident-events branch from 6194b13 to 32e4eb0 Compare April 9, 2026 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant