fix: prefer NVML v2 memory info for inference setup#41
Conversation
|
Pushed a test-only follow-up for the unittest failure. CI's pynvml build does not expose nvmlDeviceGetMemoryInfo_v2, so the regression test now monkeypatches that symbol with raising=False before exercising the v2-preferred path.\n\nLocal validation after the change:\n- python3 -m py_compile cosmos_framework/inference/args.py cosmos_framework/inference/args_test.py\n- git diff --check\n\nI also tried the targeted pytest locally, but this environment does not have pytest installed. |
|
Sorry, the approval became stale because I pushed a one-line test-only follow-up to fix the failing unittest on CI. The implementation code is unchanged from the approved version. |
|
@Rohithmatham12 Thx for the check, let's wait for the PPL pass and we could do the merge |
|
|
||
|
|
||
| def _get_nvml_device_memory_info(handle: Any) -> Any: | ||
| try: |
There was a problem hiding this comment.
Here is a bit mis-understanding: we split the exception capture into two parts: one is in inside _get_nvml_device_memory_info, the other is outside this func. Should we keep the exception capture unified in one-place?
There was a problem hiding this comment.
OK, this is trivial, this pr is good for me.
Summary
pynvml.nvmlDeviceGetMemoryInfo_v2()before the legacy v1 memory-info API in inference setupnvmlDeviceGetMemoryInfo()when v2 is unavailablenvmlShutdown()runs even when NVML probing failsNVMLError_NotSupported, plus the older-pynvml fallback pathWhy
DGX Spark / GB10 platforms can report
pynvml.NVMLError_NotSupportedfrom the legacy v1nvmlDeviceGetMemoryInfo()call during Cosmos3 inference setup. The v2 NVML memory-info API is the supported path there, while older environments still need the v1 fallback.Testing
python3 -m py_compile cosmos_framework/inference/args.py cosmos_framework/inference/args_test.pygit diff --checkNot run locally:
python3 -m pytest cosmos_framework/inference/args_test.py -qbecause this local environment does not have pytest installedRelated to NVIDIA/cosmos#180