Automation stack that discovers GPU hardware in a Deckhouse cluster, publishes
inventory objects, and prepares nodes for the GPU control plane. The repository
follows the same delivery patterns as the virtualization and ai-playground
modules: controller-runtime based controllers, module-sdk hooks, Helm templates
powered by deckhouse_lib_helm, and a werf build pipeline.
- Inventory controller that watches
Node+NodeFeature, createsGPUDevice/GPUNodeStateresources, emits Prometheus metrics and Kubernetes events. - Hooks that normalise module settings, deliver a mandatory
NodeFeatureRule, and expose module status conditions. - Helm/werf packaging identical to other Deckhouse modules: images are built as
artifacts (
controller,hooks), a bundle image stores rendered templates and OpenAPI schemas. OpenAPI schemas document a small set of ModuleConfig switches (managed node label, device approval policy, scheduling hints, inventory resync interval, HA override).
- Deckhouse cluster (>= 1.68) with ClusterConfiguration populated.
- Node Feature Discovery module (v0.17 or newer) must be enabled. The
module_statushook marks the module asPrerequisiteNotMetand blocks deployment ifnode-feature-discoveryis absent. - Nodes that should be inventoried must expose GPU topology via NFD (the module
ships a
NodeFeatureRuleautomatically).
# Prepare build-time tools (golangci-lint, module-sdk wrapper, etc.)
make ensure-tools
# Build all module images (hooks, controller, bundle) locally
werf build
# Build and push images into a registry (adds readable tags like <image>-dev)
MODULES_MODULE_SOURCE=127.0.0.1:5001/gpu-control-plane MODULES_MODULE_TAG=dev make werf-buildThe werf-giterminism.yaml file locks down the build context; fuzziness is
checked by make ensure-tools.
-
Upload images/bundle to the registry used by your Deckhouse edition.
-
Register the module (EE-style) or push the bundle via the Deckhouse modules operator.
-
Enable the module with a
ModuleConfigresource. Example with defaults:apiVersion: deckhouse.io/v1alpha1 kind: ModuleConfig metadata: name: gpu-control-plane spec: enabled: true settings: highAvailability: true # force HA regardless of control-plane topology managedNodes: labelKey: gpu.deckhouse.io/enabled enabledByDefault: true deviceApproval: mode: Manual # or Automatic / Selector scheduling: defaultStrategy: Spread topologyKey: topology.kubernetes.io/zone inventory: resyncPeriod: "30s" version: 1
Defaults (when fields are omitted) match the example above:
`managedNodes.labelKey=gpu.deckhouse.io/enabled`, `managedNodes.enabledByDefault=true`,
`deviceApproval.mode=Manual`, `scheduling.defaultStrategy=Spread`,
`scheduling.topologyKey=topology.kubernetes.io/zone`, `inventory.resyncPeriod=30s`.
When the module is enabled, the hook ensures that NFD is present and creates
the `NodeFeatureRule` required for GPU discovery.
4. Verify that the controller is running in the `d8-gpu-control-plane`
namespace and that `GPUDevice`/`GPUNodeState` objects appear for GPU
nodes.
> ℹ️ The module currently focuses on inventory/diagnostics. Bootstrap
> controllers (GFD/DCGM launch, MIG configuration) are brought in the next
> stages.
## Development workflow
```bash
# Static analysis + formatting
make lint
# Unit tests (hooks + controller)
make test
# Full CI equivalent (lint + tests)
make verify
# Controller tests only
cd src/controller && go test ./... -cover
Additional developer documentation (both EN/RU) is located in docs/README.md
and docs/README.ru.md.