Add CUDA toolkit check to rapids doctor#141
Add CUDA toolkit check to rapids doctor#141jayavenkatesh19 wants to merge 6 commits intorapidsai:mainfrom
rapids doctor#141Conversation
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
There was a problem hiding this comment.
Overall this looks great.
I left a comment about trying to use cuda.core.system instead of pynvml. I'm not sure if it supports enough features for us, but if we can we should.
I also notice the tests have a lot of mocking in them. Perhaps the dependency injection approach @mmccarty was exploring in #137 would help clean these up?
Also it looks like CI is failing because coverage has dropped below 95%.
| def _extract_major_from_cuda_path(path: Path) -> int | None: | ||
| """Extract CUDA major version from a path like /usr/local/cuda-12.4 or its version.txt.""" | ||
| match = re.search(r"cuda-(\d+)", str(path)) | ||
| if match: | ||
| return int(match.group(1)) | ||
| version_file = path / "version.txt" | ||
| if version_file.exists(): | ||
| match = re.search(r"(\d+)\.", version_file.read_text()) | ||
| if match: | ||
| return int(match.group(1)) | ||
| return None |
There was a problem hiding this comment.
There may be situations where multiple CTKs are installed. In this case we need to check which one /usr/local/cuda is symlinked to, as that will be the active one.
| pynvml.nvmlInit() | ||
| driver_major = pynvml.nvmlSystemGetCudaDriverVersion() // 1000 |
There was a problem hiding this comment.
Could we use cuda.core.system.get_driver_version() instead here, if we can it would be more future proof.
Adds a new
rapids doctorcheck that verifies that the CUDA toolkit (will refer to this as CTK from here on) is findable and version-compatible with the GPU driver.These are the things the check does:
Library discoverability: Use
cuda-pathfinderto verify that CUDA libraries can be loaded at runtime. The CTK itself has many libraries, some of which are not necessary for every RAPIDS operation. For now, this check verifies thatlibcudart.so,libnvrtc.soandlibnvvm.so. These 3 were chosen because they are more commonly used (cudart is required for all CUDA operations, while nvrtc and nvvm are used in JIT compilation). This can be extended to add other libraries of interest in the CTK, but to keep it universal and based on frequency of usage, I am checking for these 3 currently.Toolkit vs driver version: Detects when CTK major version is newer than the driver. Backward compatibility is supported. Version detection tries header parsing first (got this from Add CUDA toolkit major version check #140 Thanks @jacobtomlinson), and falls back to cudaRuntimeGetVersion (got the snippet from @ncclementi's comment on the PR above) for conda/pip environment as they do not ship dev headers.
System installation checks: When CTK is not installed via conda/pip, it checks the
/usr/local/cudasymlink and theCUDA_HOME/CUDA_PATHvariables for version mismatches.I based the order and the checks themselves after the
load_nvidia_dynamic_libdocumentation page forcuda-pathfinder, where the search order is specified as site-packages (pip) -> conda -> OS defaults -> CUDA_HOMEOne scenario which isn't covered by these tests is described in this comment. This check was originally only meant to test out compatibility and discoverability between the CTK and the GPU driver but not if the python packages match with the CTK. For
pippackages, reading the suffixes seems like an easy enough way to do it, but I'm not sure on how we would do that forcondapackages.