[CI] Support dynamic multi-GPU architecture testing (sm80/86/90/100) to fix NoneType extension load failures

### Problem Description
Currently, running `test_det_gemm.py` in our CI pipeline results in multiple test failures with the following error:

```python
    @staticmethod
    def forward(ctx, a, b):
        ctx.save_for_backward(a, b)
>       return _C.det_gemm_fwd(a, b)
               ^^^^^^^^^^^^^^^
E       AttributeError: 'NoneType' object has no attribute 'det_gemm_fwd'

```

### Root Cause:
This is caused by a CUDA architecture mismatch. The current CI script (ci/run_gpu_ci.sh) hardcodes the compilation target to sm86 (export TORCH_CUDA_ARCH_LIST=8.6). When the cloud provider (RunPod) provisions a GPU with a different architecture, such as an H100 (sm90) or B200 (sm100), PyTorch silently fails to load the compiled .so C++ extension. Consequently, the _C module becomes None, triggering the AttributeError during test execution.


### Proposed Solution
To resolve this and ensure cross-architecture compatibility (including the latest Blackwell series), we need to implement the following pipeline improvements:

### Dynamic GPU Provisioning:
Introduce a TARGET_SM environment variable in ci/run_gpu_ci.sh to dynamically rent the appropriate GPU instance based on the target architecture:
- TARGET_SM=100 → B200 / B100 (sm100)
- TARGET_SM=90 → H100 (sm90)
- TARGET_SM=80 → A100 (sm80)
- TARGET_SM=86 → RTX A4000 / A40 (sm86, default fallback)

### Dynamic Compilation Flags:
Replace the hardcoded TORCH_CUDA_ARCH_LIST with a dynamically injected variable that matches the rented hardware (e.g., 10.0, 9.0, 8.0, or 8.6).

### Fail-Fast Validation:
Add a sanity check (python -c "import torch; import rl_engine") immediately after compilation. If the C++ extension fails to load, the script should exit immediately with a clear architecture mismatch warning, rather than executing the entire test suite and generating misleading logs.

### GitHub Actions Matrix Strategy:
Update the CI workflow YAML to utilize a matrix strategy, triggering parallel test jobs for sm80, sm86, sm90, and sm100. This will provide comprehensive coverage and prevent architecture-specific regressions on future PRs across all supported NVIDIA generations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Support dynamic multi-GPU architecture testing (sm80/86/90/100) to fix NoneType extension load failures #191

Problem Description

Root Cause:

Proposed Solution

Dynamic GPU Provisioning:

Dynamic Compilation Flags:

Fail-Fast Validation:

GitHub Actions Matrix Strategy:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[CI] Support dynamic multi-GPU architecture testing (sm80/86/90/100) to fix NoneType extension load failures #191

Description

Problem Description

Root Cause:

Proposed Solution

Dynamic GPU Provisioning:

Dynamic Compilation Flags:

Fail-Fast Validation:

GitHub Actions Matrix Strategy:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions