Add CUDA toolkit major version check#140
Add CUDA toolkit major version check#140jacobtomlinson wants to merge 2 commits intorapidsai:mainfrom
Conversation
| get_driver_cuda_major=_get_driver_cuda_major, | ||
| get_toolkit_cuda_major=_get_toolkit_cuda_major, |
There was a problem hiding this comment.
I went with a dependency injection approach here after chatting about it with @mmccarty to make testing easier.
I haven't refactored other checks to reuse this to keep this PR simpler, but we could do that in the future.
| version_file = Path(header_dir) / "cuda_runtime_version.h" | ||
| if not version_file.exists(): | ||
| return None | ||
| match = re.search(r"#define\s+CUDA_VERSION\s+(\d+)", version_file.read_text()) | ||
| return int(match.group(1)) // 1000 if match else None |
There was a problem hiding this comment.
I'm curious if this is the best way to get the CUDA Toolkit version.
There was a problem hiding this comment.
I was doing some digging to see if we could pull it from cudart via python API if it was available, because cudaRuntimeGetVersion exists
but i wasn't able to do something like
from cuda import cudart
cudart.cudaRuntimeGetVersion()BUt with the help of perplexity, I was able to get the version using ctypes and accessing libcudart.
Idk if it's cleaner though. BUt it would be something like this
import ctypes
from ctypes import byref, c_int
libcudart = ctypes.cdll.LoadLibrary("libcudart.so") # conda cuda-cudart provides this
cudaRuntimeGetVersion = libcudart.cudaRuntimeGetVersion
cudaRuntimeGetVersion.argtypes = [ctypes.POINTER(c_int)]
cudaRuntimeGetVersion.restype = c_int
ver = c_int()
err = cudaRuntimeGetVersion(byref(ver))
if err != 0:
raise RuntimeError(f"cudaRuntimeGetVersion failed with error code {err}")
ver_int = ver.value
major = ver_int // 1000
minor = (ver_int % 1000) // 10
print("CUDA runtime version:", ver_int, f"({major}.{minor})")| f"CUDA toolkit major version ({toolkit_major}) is newer than what the installed driver supports " | ||
| f"({driver_major}). Update your NVIDIA driver to one that supports CUDA {toolkit_major} or " | ||
| f"downgrade your CUDA toolkit to CUDA {driver_major}." |
There was a problem hiding this comment.
I think we could improve these errors. It would be nice to detect how CUDA Toolkit has been installed (system, conda, pip, etc) and provide more nuanced advice for the user.
There was a problem hiding this comment.
We can do that via python, for example I'm in conda environment that has cudf and cuml and you can access that info via
>>> from cuda import pathfinder
>>> loaded = pathfinder.load_nvidia_dynamic_lib("cudart")
>>> loaded.abs_path
'/raid/myuser/conda/envs/ray-cuml/lib/libcudart.so'
>>> loaded.found_via
'conda'and on a different conda env, that only has cuda-python, but that doesn't have cuda-runtime installed I get this
>>> from cuda import pathfinder
>>> loaded = pathfinder.load_nvidia_dynamic_lib("cudart")
>>> loaded.abs_path
'/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13'
>>> loaded.found_via
'system-search'|
@jayavenkatesh19 I just pushed this draft up to share more broadly, but if you want to take over this I'd be more than happy. |
| if toolkit_major < driver_major: | ||
| raise ValueError( | ||
| f"CUDA toolkit major version ({toolkit_major}) is older than the driver's supported CUDA major version " | ||
| f"({driver_major}). Upgrade your CUDA toolkit to CUDA {driver_major} or " | ||
| f"downgrade your NVIDIA driver to one that supports CUDA {toolkit_major}." | ||
| ) |
There was a problem hiding this comment.
This shouldn't necessarily be an error, a newer driver is ok as long as the CTK major matches all the packages. The problem would be when you have driver CUDA 13, with CTK 12 but a foo-cu13 Python package. E.g rapidsai/deployment#516
Adds a check that uses
cuda.pathfinderto find your CUDA Toolkit and then compares the major version with the driver.xref #139