Built for the Elastic AI Agent Hackathon 🏆
OpsGuardian is an autonomous Site Reliability Engineer (SRE) agent built on the Elastic Stack. It bridges the gap between Deterministic Metrics and Unstructured Knowledge.
Unlike generic chatbots that hallucinate numbers, OpsGuardian uses ES|QL for mathematical rigor and relevance-based retrieval to reduce Mean Time To Resolution (MTTR).
(Place a screenshot of your Kibana Console conversation here)
+----------------+ +---------------------------------------------------------------+
| User / SRE | | ELASTIC CLOUD SERVERLESS |
+-------+--------+ | |
| | +-------------------+ +----------------------------+ |
v | | | | The Triad of Truth | |
+-------+--------+ | | | 1.Math | |
| Client Layer | | | +----->+------------------------+ | |
| (Claude/Kibana)|+----->| | OpsGuardian | | tool-calc-reliability | | |
+----------------+ MCP | | Agent | | (ES|QL Aggregations) | | |
| | (Reasoning Brain) | +-----------+--------------+ | |
| | | 2.Hist | | |
| | +----->+----------+----------------+ | |
| | | | tool-find-patterns | | |
| | | | (Log Pattern Matching) | | |
| | | +-----------+--------------+ | |
| | | 3.Know | | |
| | +----->+----------+----------------+ | |
| | | | tool-search-sops | | |
| +-------------------+ | (Semantic/Vector Search) | | |
| +-----------+--------------+ | |
| | | |
| v | |
| +-----------------------------+ | |
| | Data Layer | | |
| | 1. ops-server-logs | | |
| | 2. sre-knowledge-base | | |
| +-----------------------------+ | |
+---------------------------------------------------------------+
SREs suffer from "Dashboard Fatigue". When an incident occurs, they have to:
- Check Dashboards (Metrics)
- Search Logs (Patterns)
- Read Wikis (Knowledge) OpsGuardian unifies this into a single cognitive loop.
OpsGuardian is designed with a strict reasoning framework:
-
📊 Mathematical Rigor (The Calculator)
- Tech:
ES|QL,EVAL,STATS - Function: Calculates real-time Error Rates directly in the database. It doesn't guess; it proves.
- Tool:
tool-calc-reliability
- Tech:
-
🔍 Pattern Recognition (The Historian)
- Tech:
ES|QL,match() - Function: Instantly correlates current incidents with historical log patterns to find "Patient Zero".
- Tool:
tool-find-patterns
- Tech:
-
📘 Automated Remediation (The Fixer)
- Tech:
ES|QL(Relevance Search) - Function: Retrieves the exact Standard Operating Procedure (SOP) to fix the issue.
- Tool:
tool-search-sops-semantic
- Tech:
You can deploy OpsGuardian in your own Elastic Cloud Serverless environment in 5 minutes.
- Elastic Cloud Serverless Project
- Kibana Sample Data: "Sample Web Logs" (Load this from the Kibana Home page)
Copy the content from data/knowledge_base_bulk.json and run it in Kibana Dev Tools.
Open src/tools/ and copy the JSON content of each tool. Run them as POST requests to the Agent Builder API in Kibana Dev Tools.
Example:
POST kbn://api/agent_builder/tools
// ... paste content of tool_calc_reliability.json ...Copy the content from src/agent.json and run it:
POST kbn://api/agent_builder/agents
// ... paste content of agent.json ...Go to the Agent Builder Playground or use the Converse API:
POST kbn://api/agent_builder/converse
{
"agent_id": "ops-guardian-v3",
"input": "I see high error rates. Investigate and tell me how to fix."
}- Upgrade to ELSER: Migrate the knowledge retrieval tool to use
text_expansionfor true vector search (currently using ES|QL relevance matching for broader compatibility). - MCP Integration: Expose OpsGuardian as a Model Context Protocol server for IDE integration.
MIT