On using quickrun, there is a call to the function _get_gpu_info in openfe.utils.system_probe that does a subprocess call to nvidia-smi to populate a dict with info. It handles the case where nvidia-smi is not found - the program proceeds and the sims run on CPUs as expected. But it doesn't handle the case where nvidia-smi exists but does not find GPUs - in that case nvidia-smi returns the message "No devices were found" with error code 6. Some HPC setups have nvidia-smi available regardless of whether GPUs were requested in the job allocation, and this causes openfe to crash.
The fix, add another except:
except subprocess.CalledProcessError as e:
if e.returncode == 6:
logging.debug(
"Error: no GPU available"
)
return {}
On using quickrun, there is a call to the function
_get_gpu_infoinopenfe.utils.system_probethat does a subprocess call to nvidia-smi to populate a dict with info. It handles the case where nvidia-smi is not found - the program proceeds and the sims run on CPUs as expected. But it doesn't handle the case where nvidia-smi exists but does not find GPUs - in that case nvidia-smi returns the message "No devices were found" with error code 6. Some HPC setups have nvidia-smi available regardless of whether GPUs were requested in the job allocation, and this causes openfe to crash.The fix, add another
except: