Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Dec 18, 2025

Summary

Adds defensive error handling in DeviceProperties._get_attribute() to gracefully handle cases where cuDeviceGetAttribute returns CUDA_ERROR_INVALID_VALUE for unsupported attributes.

Changes

  • Modified _get_attribute in _device.pyx to check for CUDA_ERROR_INVALID_VALUE before raising an exception
  • When the driver returns this error, the method now returns 0 as a conservative default
    • For boolean attributes (e.g., vulkan_cig_supported), 0 means False
    • For integer attributes, 0 indicates "not supported" or "disabled"

Rationale

This addresses two scenarios:

  1. Version mismatch: When cuda-core is compiled against a newer CUDA toolkit (e.g., 12.9) but runs on an older driver (e.g., 12.8), newer attribute IDs may not be recognized by the driver
  2. Driver bugs: Cases like nvbug5605010 where the driver advertises CUDA 12.9 capability but incorrectly rejects attribute 138 (CU_DEVICE_ATTRIBUTE_VULKAN_CIG_SUPPORTED) with CUDA_ERROR_INVALID_VALUE

Previously, these scenarios would cause DeviceProperties properties to raise CUDAError exceptions, breaking tests and user code. With this change, unsupported attributes return sensible defaults (0/False) instead of raising exceptions.

Test Coverage

  • Existing tests should continue to pass
  • Tests that previously failed with CUDA_ERROR_INVALID_VALUE (e.g., test_device.py) will now pass by returning 0
  • No new tests added (defensive handling preserves existing behavior for supported attributes)

Related Work

  • Addresses nvbug5605010 (driver 575.66 incorrectly rejecting attribute 138)
  • Handles version mismatches observed on CUDA 12.8 driver machines rejecting attributes 138, 141, 142 from cuda-core built against CUDA 12.9

@Andy-Jost Andy-Jost added this to the cuda.core beta 11 milestone Dec 18, 2025
@Andy-Jost Andy-Jost added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Dec 18, 2025
@Andy-Jost Andy-Jost self-assigned this Dec 18, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Dec 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost
Copy link
Contributor Author

/ok to test def7312

@github-actions

This comment has been minimized.

@Andy-Jost
Copy link
Contributor Author

/ok to test 7e05f5a

@Andy-Jost
Copy link
Contributor Author

/ok to test 3f12cdb

@Andy-Jost Andy-Jost merged commit 9e9b4c8 into NVIDIA:main Dec 22, 2025
80 checks passed
@Andy-Jost Andy-Jost deleted the nvbug5605010 branch December 22, 2025 16:13
@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

@leofang
Copy link
Member

leofang commented Dec 22, 2025

I think this was an oversight when the new attributes were added. We should not always return 0 because it'd shadow the actual issue (driver version insufficient) and our standard treatment is to add version checks in both the codebase and the corresponding test suite to ensure cuda-core runs with any CUDA driver 12.x and 13.x. This needs to be revisited before the next release is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants