Transform the existing SHC skeleton into a fully working, visually demonstrable Self-Healing Cluster system suitable for a final-year academic project presentation.
The system will:
- Continuously monitor simulated cloud node metrics
- Run an ML anomaly detection model (Isolation Forest) trained on realistic node failure scenarios
- Automatically "heal" nodes when a persistent anomaly is confirmed
- Show everything live on a beautiful dark-mode dashboard
- Add more diverse failure scenarios to the dataset generator (CPU spike, OOM, disk I/O flood, network degradation, crash loop, thermal throttle)
- Tune
contamination=0.08to better reflect real-world failure rates
- Replace raw dict parameter with a Pydantic
MetricsInputmodel (prevents 500 on missing keys) - Add a
/healthGET endpoint returning{ "status": "ok" } - Add a
/infoGET endpoint that returns model metadata (algorithm, feature names, contamination rate)
- Add a
requirements.txtand install from it (better practice than inline pip)
- Add resource
requestsandlimits(memory: 256Mi/512Mi, cpu: 100m/300m)
- Fix
/stress: move CPU burn into aworker_threadso the event loop stays alive - Guard
/crashwith a?token=secretquery param check await restartPod()in the monitor loop- Replace pure-random metrics with trending simulation (metrics drift into anomalous ranges for demo effect)
- Add in-memory
healingLog[](last 50 events) — each entry:{ time, type, metrics, action } - Add WebSocket server (
wspackage) — broadcasts live events to the dashboard - Add REST endpoint
GET /api/eventsreturning the healing log (for dashboard initial load) - Add REST endpoint
GET /api/metricsreturning the last collected metrics snapshot - Serve the
dashboard/folder as static files at/
- Add
ws(WebSocket) dependency
- Remove the erroneous
replicasfield insidespec.template.spec
Dark-mode, glassmorphism-style single-page dashboard with:
- Header: project name, live clock, cluster status badge
- Metric cards (5 cards): CPU, Memory, Request Rate, Latency, Pod Restarts — each with a sparkline mini-chart
- Main chart: rolling 60-second line chart of all metrics (Chart.js)
- Anomaly status panel: shows NORMAL / DETECTING / ANOMALY CONFIRMED with animated indicator
- Healing Log table: timestamp, anomaly type, metrics at detection, action taken
- Node map: 3 animated node cards showing health status (Healthy / Degraded / Restarting)
- WebSocket client that connects back to Demo-Service for live updates
- Dark theme (
#0d0f14background), glassmorphism cards - Gradient accent colors (cyan/purple)
- Smooth CSS animations for state transitions
- Google Fonts (Inter)
- WebSocket client managing reconnects
- Chart.js setup for all charts
- DOM update functions for metrics, anomaly state, healing log
- Add proper entries:
node_modules/,*.pkl,__pycache__/,.env
- Pin all Python dependencies
-
ML model smoke test — already exists as test_model.py:
cd ML-Model python test_model.pyExpected: prints prediction
value_counts()with some-1anomalies present. -
FastAPI
/predicttest — run locally:cd ML-Model uvicorn ml_service:app --host 0.0.0.0 --port 8000 # In another terminal: curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{"cpu_usage": 95, "memory_usage": 92, "request_rate": 10, "latency": 5000, "pod_restarts": 8, "disk_io": 95, "network_errors": 50, "error_rate": 0.9}'
Expected:
{"anomaly": true}or{"anomaly": false}. -
Demo-Service API test — run locally:
cd Demo-Service npm install node index.js # In another terminal: curl http://localhost:3000/api/metrics curl http://localhost:3000/api/events
Expected: JSON metric snapshots and (initially empty) event array.
-
Dashboard visual check: Open
http://localhost:3000in a browser — verify dark dashboard loads, metric values update every 5 seconds, charts animate smoothly. -
Anomaly simulation: Hit
http://localhost:3000/stressin a separate tab — observe the dashboard's anomaly indicator change from NORMAL → DETECTING → ANOMALY CONFIRMED within ~50 seconds, and a new row appear in the Healing Log. -
WebSocket live feed: Open browser DevTools → Network → WS — verify the WebSocket connection is active and receives JSON frames every 5 seconds.