Releases: defilantech/LLMKube
Releases · defilantech/LLMKube
v0.7.0
0.7.0 (2026-04-18)
⚠ BREAKING CHANGES
- sharding:
sharding.strategy: tensoron a Model now correctly maps to llama.cpp's--split-mode rowinstead of silently falling back to--split-mode layer. Configs that setstrategy: tensorexpecting layer behavior may see performance regressions or new failure modes under concurrent load (particularly on consumer PCIe multi-GPU setups with quantized models). Explicitly setstrategy: layerto retain the previous behavior. (#291) - vllm: InferenceService
spec.extraArgsis now forwarded to the vLLM runtime. PreviouslyextraArgswas silently ignored whenruntime: vllm. Configs that placed llama.cpp-only flags inextraArgson a vLLM InferenceService will start failing at pod startup. Audit any vLLM InferenceService that setsextraArgsbefore upgrading. (#291)
Features
- add hybrid GPU/CPU offloading support for MoE models (#281) (2287f66)
- add tensor overrides and batch size controls for hybrid offloading (#283) (8be4adc)
- expose additional runtime controls for llama.cpp and vllm (#291) (2245718)
- recognize runtime-resolved sources (HF repo IDs) in Model controller (#293) (953e8a7)
Bug Fixes
Documentation
llmkube-0.7.0
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
v0.6.0
0.6.0 (2026-04-08)
⚠ BREAKING CHANGES
- update default CUDA image to server-cuda13 for Qwen3.5 and Blackwell support (#262)
Features
- add first-class PersonaPlex (Moshi) runtime backend (#272) (2b1c948)
- add Grafana inference metrics dashboard (#269) (be376c6)
- add HPA autoscaling for InferenceService (#260) (2d16502)
- add pluggable runtime backends for non-llama.cpp inference engines (#271) (bb1576c)
- add vLLM and TGI runtime backends with per-runtime HPA metrics (#273) (441c7c7)
- separate image registry from repository in Helm chart (#268) (5c059a4)
- support custom layer splits from GPUShardingSpec (#267) (a37701c)
- update default CUDA image to server-cuda13 for Qwen3.5 and Blackwell support (#262) (cc9a95e)
llmkube-0.6.0
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
v0.5.3
llmkube-0.5.3
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
v0.5.2
llmkube-0.5.2
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
v0.5.1
llmkube-0.5.1
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference