Skip to content

Releases: defilantech/LLMKube

v0.7.0

18 Apr 07:38
237a4d8

Choose a tag to compare

0.7.0 (2026-04-18)

⚠ BREAKING CHANGES

  • sharding: sharding.strategy: tensor on a Model now correctly maps to llama.cpp's --split-mode row instead of silently falling back to --split-mode layer. Configs that set strategy: tensor expecting layer behavior may see performance regressions or new failure modes under concurrent load (particularly on consumer PCIe multi-GPU setups with quantized models). Explicitly set strategy: layer to retain the previous behavior. (#291)
  • vllm: InferenceService spec.extraArgs is now forwarded to the vLLM runtime. Previously extraArgs was silently ignored when runtime: vllm. Configs that placed llama.cpp-only flags in extraArgs on a vLLM InferenceService will start failing at pod startup. Audit any vLLM InferenceService that sets extraArgs before upgrading. (#291)

Features

  • add hybrid GPU/CPU offloading support for MoE models (#281) (2287f66)
  • add tensor overrides and batch size controls for hybrid offloading (#283) (8be4adc)
  • expose additional runtime controls for llama.cpp and vllm (#291) (2245718)
  • recognize runtime-resolved sources (HF repo IDs) in Model controller (#293) (953e8a7)

Bug Fixes

  • inherit runAsUser/runAsGroup from podSecurityContext (#274) (72b9b5c)

Documentation

  • surface breaking behavior changes for 0.7.0 (#294) (e234a40)

llmkube-0.7.0

18 Apr 07:39
237a4d8

Choose a tag to compare

A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference

v0.6.0

08 Apr 00:52
02a9242

Choose a tag to compare

0.6.0 (2026-04-08)

⚠ BREAKING CHANGES

  • update default CUDA image to server-cuda13 for Qwen3.5 and Blackwell support (#262)

Features

  • add first-class PersonaPlex (Moshi) runtime backend (#272) (2b1c948)
  • add Grafana inference metrics dashboard (#269) (be376c6)
  • add HPA autoscaling for InferenceService (#260) (2d16502)
  • add pluggable runtime backends for non-llama.cpp inference engines (#271) (bb1576c)
  • add vLLM and TGI runtime backends with per-runtime HPA metrics (#273) (441c7c7)
  • separate image registry from repository in Helm chart (#268) (5c059a4)
  • support custom layer splits from GPUShardingSpec (#267) (a37701c)
  • update default CUDA image to server-cuda13 for Qwen3.5 and Blackwell support (#262) (cc9a95e)

llmkube-0.6.0

08 Apr 00:52
02a9242

Choose a tag to compare

A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference

v0.5.3

01 Apr 17:10
86f9bbe

Choose a tag to compare

0.5.3 (2026-04-01)

Features

  • add KV cache type configuration and extraArgs escape hatch (#256) (7a4b855)
  • add Ollama as runtime backend for Metal agent (#258) (6148b89)
  • add oMLX as alternative runtime backend for Metal agent (#257) (eaf9045)

Bug Fixes

llmkube-0.5.3

01 Apr 17:10
86f9bbe

Choose a tag to compare

A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference

v0.5.2

28 Mar 02:36
eed8274

Choose a tag to compare

0.5.2 (2026-03-27)

Features

  • add pod security context defaults and CRD overrides (#239) (904432b)

Documentation

llmkube-0.5.2

28 Mar 02:36
eed8274

Choose a tag to compare

A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference

v0.5.1

16 Mar 06:48
4a22006

Choose a tag to compare

0.5.1 (2026-03-16)

Features

  • add memory pressure watchdog with runtime monitoring (#216) (5fa6d54)
  • add pvc:// model source and SHA256 integrity verification (#229) (1b94f5d)
  • auto-detect llama-server from Homebrew paths on macOS (#215) (a1e4302)

Bug Fixes

  • controller metrics port declarations and ServiceMonitor consistency (#214) (296ec99)
  • correct CHANGELOG entry from 0.4.21 to 0.5.0 (#212) (f7f703a)
  • quote job-level if expression to fix YAML parsing in helm-chart workflow (8714b9f)

llmkube-0.5.1

16 Mar 06:48
4a22006

Choose a tag to compare

A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference