Context
internal/controller/model_controller.go has a checkAcceleratorAvailability() method that always returns true with a TODO comment:
// TODO: implement actual GPU/Metal availability checking
func (r *ModelReconciler) checkAcceleratorAvailability(hardware *HardwareSpec) bool {
if hardware == nil {
return true
}
return true
}
Problem
status.acceleratorReady is always true regardless of whether the requested accelerator (CUDA, Metal, ROCm) is actually available on the target node. This can mislead users into thinking GPU acceleration is active when it isn't.
Proposed Solution
- For CUDA: Check if
nvidia.com/gpu resource is available on nodes (via node capacity or NVIDIA device plugin)
- For Metal: Check if Metal agent is reachable
- For ROCm: Check if
amd.com/gpu resource is available
- Set
status.acceleratorReady = false with a condition when unavailable
Location
internal/controller/model_controller.go:450
Context
internal/controller/model_controller.gohas acheckAcceleratorAvailability()method that always returnstruewith a TODO comment:Problem
status.acceleratorReadyis alwaystrueregardless of whether the requested accelerator (CUDA, Metal, ROCm) is actually available on the target node. This can mislead users into thinking GPU acceleration is active when it isn't.Proposed Solution
nvidia.com/gpuresource is available on nodes (via node capacity or NVIDIA device plugin)amd.com/gpuresource is availablestatus.acceleratorReady = falsewith a condition when unavailableLocation
internal/controller/model_controller.go:450