KubeCure

An AI-native, autonomic self-healing engine for Kubernetes.

What is KubeCure?

KubeCure extends the Kubernetes Control Plane to autonomously detect, diagnose, and remediate cluster failures using LLM-driven GitOps workflows. Think of it as an AI-powered SRE that never sleeps, it is continuously watching your cluster, understanding failures in context, and proposing intelligent fixes via Pull Requests.

The Problem

Modern Kubernetes clusters fail in complex, unpredictable ways:

CrashLoopBackOff from misconfigured environment variables
OOMKilled due to insufficient resource limits
ImagePullBackOff from typos in image tags
Application crashes buried in cryptic log traces

Engineers spend countless hours context-switching between logs, YAML manifests, and cluster events to diagnose issues that often have simple fixes. This Mean Time To Recovery (MTTR) is where KubeCure steps in.

The Solution

KubeCure acts as an intelligent intermediary between your failing workloads and your GitOps repository:

                              KubeCure Architecture

   +--------------+          +--------------+          +--------------+
   |  Kubernetes  |  watch   |   KubeCure   |  reason  |  Gemini AI   |
   |   Cluster    | -------> |  Controller  | -------> |    (LLM)     |
   |              |          |              | <------- |              |
   +--------------+          +--------------+   fix    +--------------+
          |                         |
          | events, logs            | PR / Issue
          | manifests               |
          v                         v
   +--------------+          +--------------+
   |   Failing    |          |    GitHub    |
   |     Pod      |          |  Repository  |
   +--------------+          +--------------+

Scope & Design Philosophy

Cluster-Wide Watching, Pod-Scoped Diagnosis

KubeCure watches the entire cluster for pod failures via Kubernetes informers, but diagnoses failures at the single-pod level by aggregating context from that pod's logs, events, and manifests.

The Domino Effect Problem

A key challenge in Kubernetes diagnosis: how does the LLM know if the issue is this container, or a cascading failure from another pod?

Consider this scenario:

A Redis pod OOMs and dies
An API pod fails readiness probes (can't reach Redis)
A frontend pod crashes with connection errors (can't reach API)

With only frontend pod context, an LLM might suggest "increase timeout", completely missing that Redis is the root cause.

Phased Approach

Phase	Scope	Focus
V1 (POC)	Intra-pod	Single-pod failures with clear error signals (`CrashLoopBackOff`, `OOMKilled`, `ImagePullBackOff`, config errors)
V2	Inter-pod	Cluster-aware diagnosis with dependency graphs for cascading failures

V1 targets failures where all diagnostic information lives within the pod's scope, these are self-contained and demonstrable. V2 will extend to multi-pod correlation where understanding service dependencies becomes essential.

How It Works

KubeCure operates as a Kubernetes Operator using the standard reconciliation loop pattern:

1. Observe — The Sensor Layer

The controller watches the Kubernetes API for Pod and Event resources, filtering for terminal failure states like CrashLoopBackOff, ImagePullBackOff, OOMKilled, and others.

2. Aggregate — Context Collection

Upon detecting a failure, KubeCure gathers diagnostic context:

Context Type	Description
Live Logs	Recent lines of `stdout/stderr` from the failing container
Manifests	Current YAML configuration (env vars, resource limits, image tags)
Events	Relevant warnings from the Kubernetes scheduler

3. Reason — The AI Brain

The aggregated context is sent to Gemini AI with a structured prompt. The LLM returns a diagnosis including root cause analysis, suggested fix, and confidence score.

4. Remediate — GitOps Integration

Based on the confidence score:

Confidence	Action
High (>=80)	Create a Pull Request with the fix to the source repository
Low (<80)	Open a GitHub Issue with the diagnostic report for human review

5. Observe — Telemetry

All actions are instrumented and exported to Prometheus/Grafana for observability.

Planned Architecture

kubecure/
├── cmd/                    # Application entrypoints
├── internal/               # Private application code
│   ├── controller/         # Reconciliation logic
│   ├── detector/           # Failure detection
│   ├── aggregator/         # Context collection
│   ├── ai/                 # LLM integration
│   └── remediation/        # GitOps handlers
├── pkg/                    # Shared libraries
├── api/                    # CRD definitions
├── config/                 # Kubernetes manifests
├── terraform/              # Infrastructure as Code
└── web/                    # Frontend dashboard

Design Principles

Clean Architecture: Decoupled layers with dependency injection
Interface-Driven AI: Swappable LLM providers (Gemini, GPT, Claude)
Idempotent Reconciliation: Safe to run repeatedly without side effects
Observability-First: Structured logging with correlation IDs

Tech Stack

Layer	Technology
Backend	Go, `operator-sdk`, `controller-runtime`
AI Engine	Google Gemini API
Infrastructure	AWS EKS, Terraform
State	Redis
GitOps	GitHub REST API
Frontend	React, TypeScript, Framer Motion, Tailwind CSS
Observability	Prometheus, Grafana

Current Status

Completed

Environment Setup: Go, kubectl, kind, operator-sdk, Terraform installed
Project Scaffolding: Operator structure generated with operator-sdk init
Failure Detection: Pod controller watching for 8 failure types:
- CrashLoopBackOff, ImagePullBackOff, OOMKilled
- CreateContainerConfigError, RunContainerError, Evicted
- Error, Unknown

In Progress

Context Aggregation (logs, events, manifests)

Upcoming

Gemini AI integration for diagnosis
GitHub PR/Issue creation
Prometheus metrics
React dashboard
Terraform EKS deployment

Getting Started

Prerequisites

Go 1.22+
Docker
kubectl
kind
operator-sdk

Local Development

# Create local Kubernetes cluster
kind create cluster --name kubecure-dev

# Run the operator locally
make run

Testing Failure Detection

# Create a pod with a bad image tag
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: broken-pod
  namespace: default
spec:
  containers:
  - name: broken
    image: nginx:doesnotexist
EOF

# Watch operator logs for failure detection
# You should see: Pod failure detected pod=broken-pod failureType=ImagePullBackOff

Documentation

See docs/LEARNING_GUIDE.md for detailed explanations of all concepts, technologies, and design decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
cmd		cmd
config		config
docs		docs
hack		hack
internal/controller		internal/controller
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
Dockerfile		Dockerfile
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KubeCure

What is KubeCure?

The Problem

The Solution

Scope & Design Philosophy

Cluster-Wide Watching, Pod-Scoped Diagnosis

The Domino Effect Problem

Phased Approach

How It Works

1. Observe — The Sensor Layer

2. Aggregate — Context Collection

3. Reason — The AI Brain

4. Remediate — GitOps Integration

5. Observe — Telemetry

Planned Architecture

Design Principles

Tech Stack

Current Status

Completed

In Progress

Upcoming

Getting Started

Prerequisites

Local Development

Testing Failure Detection

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KubeCure

What is KubeCure?

The Problem

The Solution

Scope & Design Philosophy

Cluster-Wide Watching, Pod-Scoped Diagnosis

The Domino Effect Problem

Phased Approach

How It Works

1. Observe — The Sensor Layer

2. Aggregate — Context Collection

3. Reason — The AI Brain

4. Remediate — GitOps Integration

5. Observe — Telemetry

Planned Architecture

Design Principles

Tech Stack

Current Status

Completed

In Progress

Upcoming

Getting Started

Prerequisites

Local Development

Testing Failure Detection

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages