Open
Conversation
2f20e0e to
a60acd4
Compare
a60acd4 to
45ed5e0
Compare
|
🔗 Preview: https://pr-322.data.malbeclabs.com |
d4b7225 to
ba54c94
Compare
ba54c94 to
97042c5
Compare
522b112 to
d11b76c
Compare
97042c5 to
7dca8fb
Compare
0e8a1c3 to
b78d886
Compare
7dca8fb to
84ea48b
Compare
475a436 to
73dca63
Compare
7154cb0 to
df9903c
Compare
…est infra fixes Add unified /api/link-metrics and /api/device-metrics endpoints that return composable time-bucketed status, latency, and traffic in a single response. Add reusable chart components for device and link detail pages (health timeline, interface issues, traffic, latency, jitter, packet loss). Fix effectiveBucket/bucketForDuration drift, align start time to bucket boundaries, increase ClickHouse dial timeouts, and fix test cleanup using closed admin connections.
The alias must come before FINAL in ClickHouse syntax: device_interface_rollup_5m r FINAL, not FINAL r.
…etail improvements
- Add GET /api/link-metrics/{pk} and GET /api/device-metrics/{pk} unified endpoints
- Return status, latency, traffic, and status_changes in one response
- Add committed_jitter_us to link metrics response
- Add pps and max_bps/max_pps to traffic response
- Fix auto bucket sizing to use bucketForDuration instead of 24-bucket division
- Fix raw interface rollup bucketing to use requested interval
- Don't classify one-sided probe reporting as unhealthy on links
- Classify devices as unhealthy when not sending latency probes
- Add presetToDuration helper for time range conversion
- link-charts/: LinkHealthTimeline, LinkPacketLossChart, LinkInterfaceIssuesChart,
LinkLatencyChart, LinkJitterChart, LinkTrafficChart
- device-charts/: DeviceHealthTimeline, DeviceInterfaceIssuesChart, DeviceTrafficChart
- All charts: empty state, consistent x-axis alignment, second-precision timestamps,
loading indicators, bidirectional Rx/Tx traffic view with toggle, avg/peak and
bps/pps selectors, committed RTT and jitter reference lines
- Client-side bar merging (max 64, min 48 bars)
- Detailed tooltips with reasons (packet loss severity, latency vs SLA, interface
issues, ISIS state, missing data indicators)
- Missing latency/traffic data shows as degraded
- Trailing collecting bars handled for rollup lag
- Custom time range labels instead of "custom"/"Now"
- Link and device detail pages migrated to unified metrics endpoint
- Link and device incident detail pages migrated to new chart components
- Collapsible event timeline for incidents with many events
- 404-aware error handling for missing incidents
- Refresh button next to time range selector
- Back buttons use Link instead of history navigation
- Generate intermediate symptom_added/symptom_resolved events (not just
opened/resolved) matching live detector behavior
- Use raw (pre-coalesce) windows for event generation to show individual spikes
- Hybrid SQL+Go approach: SQL computes symptom windows, Go generates events
- Widen live rollup window from 10 to 30 minutes for resilience
- Cap backfill end time to avoid overlapping with live rollup
- Hide timeline nav item from sidebar
- Rename legend "Max" column to "Peak"
Move Date.now() calls into useState initializers to satisfy the react-hooks/purity rule that forbids impure functions during render.
Replace the dual live-detection/backfill approach with a single window-based engine. The detection workflow tracks a watermark via Temporal ContinueAsNew, checks rollup freshness each cycle, and processes from watermark to latest rollup data using BackfillLinkChunk and BackfillDeviceChunk. This eliminates false no-data incidents during indexer downtime — the detector never processes beyond actual rollup data. Gaps are automatically backfilled in 1-hour chunks when the indexer catches up. Also caps resolved event timestamps to the processing window end to prevent future-dated events.
…detection Materialized incident events with watermark-based detection that eliminates false no-data incidents during indexer downtime. Adds v2 incident API endpoints, incident detail pages, and incidents list page migration.
Use log_ingestion_runs to track per-pipeline freshness so incident detection doesn't create false no_latency_data/no_traffic_data incidents when a data pipeline falls behind. - Rollup activities now record SourceMinEventTS/SourceMaxEventTS - CheckRollupFreshness queries ingestion logs instead of rollup tables - DetectionState tracks LatencyWatermark and TrafficWatermark separately - Each watermark advances independently to its pipeline's freshness - no_*_data symptoms are suppressed for windows past the pipeline cutoff - When a lagging pipeline catches up, the gap is reprocessed with Overwrite to regenerate incidents with the full symptom set
Add status_changed as a first-class event type in incident events. When an entity's status changes during an open incident (e.g., link drained or undrained), the detector emits a status_changed event with previous and new status. - New migration adds status_changed to Enum8, previous_status/new_status columns - Detect status transitions from dimension history tables during backfill - Resolve entity status point-in-time from history (not current views) - API exposes previous_status/new_status on incident events - Frontend renders status changes with diamond dots in timeline - Extend status change query window 1h past incident end
The is_drained flag on incident list items now checks for status_changed events with a drain status, not just the entity's current status. This means incidents that were drained at any point during the incident show the drained badge, even if the entity was later undrained.
The v2 event-based incident system is now the only incident system. All web UI already uses v2 endpoints exclusively. The timeline page was hidden and unused.
…etail pages Wire up hoveredTimeRange and chartHoveredTime state so hovering timeline bars highlights chart regions and hovering charts highlights the corresponding timeline bar, matching the link/device detail pages.
6194b13 to
32e4eb0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of Changes
opened→symptom_added/resolved→resolved) into new ClickHouse tables (link_incident_events,device_incident_events)Diff Breakdown
~50% core backend logic, ~25% frontend, ~15% tests, ~10% scaffolding/docs/migration
Key files (click to expand)
api/handlers/incidents_v2.go— v2 incident API with aggregated multi-symptom incidents, severity, peak values, event timeline, and detail endpointsindexer/pkg/incidents/backfill_test.go— integration tests for historical incident event backfill against ClickHouse testcontainersindexer/pkg/incidents/backfill.go— backfill logic to generate incident events from existing rollup tablesweb/src/components/link-incident-detail-page.tsx— link incident detail page with contextual health/latency/traffic chartsweb/src/components/device-incident-detail-page.tsx— device incident detail page with contextual health/traffic chartsindexer/pkg/incidents/backfill_queries.go— ClickHouse queries for backfill incident detection across link/device rollup tablesweb/src/components/incidents-page.tsx— incidents list rewritten for v2 model with multi-symptom and severity supportweb/src/lib/api.ts— v2 incident types and fetch functionsTesting Verification
go test -race ./indexer/pkg/incidents/...)