- If
make deploydies after "Initializing...", most likelyVM_TYPE=minikubewas not set. - If
make deploygets stuck at "Installing helm/tiller...", most likely helm is not installed.
- GPU image when only CPU available:
Deploying model with manifest '<manifest>' and model files in '<folder>/'...
FAILED
Error 200: OK
- Manifest not in zip file
FAILED
Error opening manifest file <folder>/<manifest>: open <folder>/<manifest>: no such file or directory
FAILED
Error reading manifest file.
- FfDL image names differ from DLaaS
Deploying model with manifest '<manifest>' and model files in '<folder>/'...
FAILED
Error: tensorflow version 1.3-py3 not supported.
FAILED
Error 200: OK
Need to adapt tensorflow version in manifest to what is specified on https://github.com/IBM/FfDL/blob/master/docs/user-guide.md#1-supported-deep-learning-frameworks (in this case "1.3.0-py3")
- ffdl-lcm, ffdl-restapi, ffdl-trainer, ffdl-trainingdata and ffdl-ui pods show ImagePullBackOff: See if Kubernetes secret regcred exists via
kubectl get secret | grep regcred. If it does not (output empty), create it withkubectl create secret docker-registry regcred --docker-server=${DOCKER_REPO} --docker-username=${DOCKER_REPO_USER} --docker-password=${DOCKER_REPO_PASS} --docker-email=unknown@docker.io -n ${NAMESPACE}.
- If you start a job and
lhelperandjobmonitorpods get toRunningstate, but the correspondinglearnerremains stuck inContainerCreating, please take a look atkubectl describe pod <learner-pod>. It is possible that your storage configuration in your manifest is invalid and if so, you should see events that point out the issues.