Skip to content

Add CUDA toolkit check to rapids doctor#141

Open
jayavenkatesh19 wants to merge 6 commits intorapidsai:mainfrom
jayavenkatesh19:feat/cuda-toolkit-check
Open

Add CUDA toolkit check to rapids doctor#141
jayavenkatesh19 wants to merge 6 commits intorapidsai:mainfrom
jayavenkatesh19:feat/cuda-toolkit-check

Conversation

@jayavenkatesh19
Copy link
Contributor

@jayavenkatesh19 jayavenkatesh19 commented Mar 11, 2026

Adds a new rapids doctor check that verifies that the CUDA toolkit (will refer to this as CTK from here on) is findable and version-compatible with the GPU driver.

These are the things the check does:

  • Library discoverability: Use cuda-pathfinder to verify that CUDA libraries can be loaded at runtime. The CTK itself has many libraries, some of which are not necessary for every RAPIDS operation. For now, this check verifies that libcudart.so, libnvrtc.so and libnvvm.so. These 3 were chosen because they are more commonly used (cudart is required for all CUDA operations, while nvrtc and nvvm are used in JIT compilation). This can be extended to add other libraries of interest in the CTK, but to keep it universal and based on frequency of usage, I am checking for these 3 currently.

  • Toolkit vs driver version: Detects when CTK major version is newer than the driver. Backward compatibility is supported. Version detection tries header parsing first (got this from Add CUDA toolkit major version check #140 Thanks @jacobtomlinson), and falls back to cudaRuntimeGetVersion (got the snippet from @ncclementi's comment on the PR above) for conda/pip environment as they do not ship dev headers.

  • System installation checks: When CTK is not installed via conda/pip, it checks the /usr/local/cuda symlink and the CUDA_HOME/CUDA_PATH variables for version mismatches.

I based the order and the checks themselves after the load_nvidia_dynamic_lib documentation page for cuda-pathfinder, where the search order is specified as site-packages (pip) -> conda -> OS defaults -> CUDA_HOME

One scenario which isn't covered by these tests is described in this comment. This check was originally only meant to test out compatibility and discoverability between the CTK and the GPU driver but not if the python packages match with the CTK. For pip packages, reading the suffixes seems like an easy enough way to do it, but I'm not sure on how we would do that for conda packages.

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
@jayavenkatesh19 jayavenkatesh19 self-assigned this Mar 11, 2026
@jayavenkatesh19 jayavenkatesh19 requested review from a team as code owners March 11, 2026 23:52
Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
@jayavenkatesh19 jayavenkatesh19 removed the request for review from msarahan March 11, 2026 23:57
Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great.

I left a comment about trying to use cuda.core.system instead of pynvml. I'm not sure if it supports enough features for us, but if we can we should.

I also notice the tests have a lot of mocking in them. Perhaps the dependency injection approach @mmccarty was exploring in #137 would help clean these up?

Also it looks like CI is failing because coverage has dropped below 95%.

Comment on lines +120 to +130
def _extract_major_from_cuda_path(path: Path) -> int | None:
"""Extract CUDA major version from a path like /usr/local/cuda-12.4 or its version.txt."""
match = re.search(r"cuda-(\d+)", str(path))
if match:
return int(match.group(1))
version_file = path / "version.txt"
if version_file.exists():
match = re.search(r"(\d+)\.", version_file.read_text())
if match:
return int(match.group(1))
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be situations where multiple CTKs are installed. In this case we need to check which one /usr/local/cuda is symlinked to, as that will be the active one.

Comment on lines +168 to +169
pynvml.nvmlInit()
driver_major = pynvml.nvmlSystemGetCudaDriverVersion() // 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use cuda.core.system.get_driver_version() instead here, if we can it would be more future proof.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants