Skip to content

System probe fails if nvidia-smi is available but it doesn't detect GPU #1178

@nffaruk

Description

@nffaruk

On using quickrun, there is a call to the function _get_gpu_info in openfe.utils.system_probe that does a subprocess call to nvidia-smi to populate a dict with info. It handles the case where nvidia-smi is not found - the program proceeds and the sims run on CPUs as expected. But it doesn't handle the case where nvidia-smi exists but does not find GPUs - in that case nvidia-smi returns the message "No devices were found" with error code 6. Some HPC setups have nvidia-smi available regardless of whether GPUs were requested in the job allocation, and this causes openfe to crash.

The fix, add another except:

except subprocess.CalledProcessError as e:
    if e.returncode == 6:
        logging.debug(
            "Error: no GPU available"
        )
    return {}

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions