This project simulates the next layer of enterprise data and AI platforms: autonomous operations.
A traditional observability platform asks: "What is broken?"
This project asks: "What is broken, why did it happen, what is the blast radius, what should we do next, and how confident are we that the recovery action is safe?"
This project demonstrates autonomous data platform operations: turning fragmented signals into root-cause analysis, governed remediation decisions, and executive-ready incident intelligence.
Enterprise data platforms are becoming too complex for manual operations alone. Teams face thousands of pipelines, many data products, AI systems, semantic metrics, governance policies, SLA misses, model drift, schema drift, policy violations, hallucination alerts, and downstream business incidents.
The issue is no longer just monitoring. The challenge is decision-making.
Build a production-style local autonomous data platform runtime that ingests synthetic platform signals, detects and correlates incidents, estimates blast radius, predicts probable root causes, recommends remediation actions, evaluates governance constraints, scores recovery confidence, simulates autonomous operators, uses historical incident memory, and generates executive/operator briefings.
flowchart LR
A["Synthetic Platform Signals"] --> B["Incident Triage"]
C["Synthetic Incidents"] --> B
D["Historical Incident Memory"] --> E["Root-Cause Engine"]
B --> F["Blast Radius Analysis"]
F --> E
E --> G["Remediation Recommender"]
H["Governance Action Policies"] --> G
G --> I["Autonomous Operators"]
I --> J["Recovery Confidence + Action History"]
J --> K["Executive / Operator Briefings"]
K --> L["DuckDB Runtime Warehouse"]
L --> M["FastAPI + Streamlit"]
flowchart TD
A["Generate Signals"] --> B["Generate Incidents"]
B --> C["Generate Historical Memory"]
C --> D["Normalize Signals"]
D --> E["Triage Incidents"]
E --> F["Calculate Blast Radius"]
F --> G["Predict Root Cause"]
G --> H["Recommend Remediation"]
H --> I["Enforce Action Policy"]
I --> J["Simulate Operators"]
J --> K["Forecast Stability"]
K --> L["Briefings + Scorecards"]
blast_radius_analysis.json/csvroot_cause_prediction_report.json/csvremediation_recommendations.csvautonomous_operator_actions.csvoperator_decision_history.jsonplatform_stability_forecast.json/csvautonomous_runtime_scorecard.json/csvplatform_recovery_scorecard.json/csvexecutive_incident_briefings.mdoperator_incident_briefings.md
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m src.data_generation.generate_platform_signals
python -m src.data_generation.generate_incidents
python -m src.data_generation.generate_incident_memory
python -m src.pipeline.run_all
python -m pytest
python -m ruff check .
streamlit run src/dashboard/app.py
uvicorn src.api.main:app --reloadEndpoints include /health, /runtime-summary, /incidents, /blast-radius/{incident_id}, /root-cause/{incident_id}, /remediation/{incident_id}, /operator-actions, /platform-stability, /executive-briefings, /scorecards, /simulate-incident, /recommend-remediation, and /simulate-operator-action.
- Synthetic signals only
- Deterministic rules instead of live LLM agents
- Local DuckDB instead of enterprise warehouse
- Simulated integrations instead of real platform APIs
- No cloud deployment
- No authentication
- No live pager/alerting integration
- No OpenLineage, MLflow, Datadog, or PagerDuty integration yet
- LLM-assisted operator reasoning
- LangGraph/AutoGen/CrewAI operator workflow
- OpenLineage/Marquez integration
- MLflow model registry integration
- Datadog/Prometheus/Grafana ingestion
- PagerDuty/Slack alert routing
- Kafka streaming signal ingestion
- Airflow DAG remediation hooks
- Snowflake/Databricks deployment
- OpenPolicyAgent action policy
Enterprise data and AI platforms generate fragmented alerts across pipelines, data quality, RAG, ML models, semantic metrics, and AI governance.
Build an autonomous runtime that converts signals into root-cause predictions, governed remediation recommendations, recovery confidence scores, and briefings.
Created synthetic platform signals, historical incident memory, failure patterns, incident triage, blast-radius analysis, root-cause prediction, remediation recommendations, governance policies, autonomous operator simulations, API endpoints, dashboards, tests, Docker, and CI/CD.
Produced a reproducible flagship portfolio project demonstrating autonomous data platform operations and systems-level AI infrastructure thinking.
V0.1: Working baseline.