KubeLLM is an LLM-based multi-agent framework that manages your kubernetes clusters all on its own. KubeLLM takes in ONE formatted prompt and it will automatically diagnose and apply fixes to Kubernetes configuration issues.
- Dr. Palden Lama - palden.lama@utsa.edu - (Current Contributor)
- William Clifford - william.clifford@utsa.edu - (Past Contributor)
- Aaron Perez - aaron.perez@utsa.edu - (Past Contributor)
- Mario De Jesus - mario.dejesus@utsa.edu - (Past Contributor)
(Coming Soon)
-
Navigate to KubeLLM directory and install software dependencies as follows: pip install -r requirements.txt
-
Make sure Kubernetes (MiniKube) is up and running.
-
Start the PgVector database using the following command: docker run -d \
-e POSTGRES_DB=ai \
-e POSTGRES_USER=ai \
-e POSTGRES_PASSWORD=ai \
-e PGDATA=/var/lib/postgresql/data/pgdata \
-v pgvolume:/var/lib/postgresql/data \
-p 5532:5432 \
--name pgvector \
phidata/pgvector:16 -
Start the Knowledge Agent with RAG capability by running the bash start_apiserver.sh.
-
Once you have the Knowledge Agent running in the background or another terminal, change directory to debug_assistant_latest.
-
Optional: if you need to run a single test case only python3 main.py ~/KubeLLM/debug_assistant_latest/troubleshooting/TEST_CASE_NAME/config_step.json.
-
You may need to update config to contain the right paths. (Note : This will be updated in a future update)
-
Finally, just sit back and let KubeLLM do all of the work. Make sure to teardown the environment after each individual test case run. python3 teardownenv.py TEST_CASE_NAME
Simply navigate to the kube_test.py file in debug_assistant_latest folder and run the test.
python3 kube_test.py
Currently our approach uses two agents, one for knowledge and one that takes corrective actions recommended by the knowledge agent. The knowledge agent uses a pgvector database and Retrieval-Augmented Generation (RAG) technique to store and retrieve relevant knowledge, which primarily consists of Kubernetes documentation.
- Our approach is currently based off this graph here Kubernetes Troubleshooting Graph
If you use KubeLLM in your work, please cite the following paper:
@inproceedings{de2025llm,
author = {Mario De Jesus and Perfect Sylvester and William Clifford and Aaron Perez and Palden Lama},
title = {LLM-Based Multi-Agent Framework for Troubleshooting Distributed Systems},
booktitle = {Proceedings of the IEEE Cloud Summit},
year = {2025}
}