Skip to content

22.2. Cuda Setup And Detection

FerrisMind edited this page Sep 10, 2025 · 1 revision

CUDA Setup and Detection

Update Summary

Changes Made

  • Added comprehensive documentation on auto device selection logic (CUDA → Metal → CPU)
  • Updated CUDA detection and initialization section with runtime detection details
  • Enhanced troubleshooting section with auto-selection failure scenarios
  • Added new section on device preference types and selection behavior
  • Updated code examples to reflect auto-selection capabilities
  • Added flowchart diagram for auto device selection process

Table of Contents

  1. Introduction
  2. Prerequisites for CUDA Support
  3. Device Preference and Selection Logic
  4. CUDA Detection and Initialization Logic
  5. Common Installation Pitfalls and Troubleshooting
  6. Verification of CUDA Initialization
  7. Environment Variables and Debugging
  8. Advanced CUDA Configuration

Introduction

This document provides a comprehensive guide to setting up and detecting CUDA support in the Oxide Lab environment. It covers the prerequisites for enabling CUDA, the detection logic implemented in the codebase, common issues encountered during installation, and methods to verify successful initialization. The goal is to ensure users can effectively leverage GPU acceleration for deep learning workloads using the candle-core framework. This update specifically focuses on the auto device selection feature that prioritizes CUDA, then Metal, then falls back to CPU.

Prerequisites for CUDA Support

NVIDIA Driver Requirements

To enable CUDA support, your system must have a compatible NVIDIA GPU with up-to-date drivers installed. The minimum required driver version depends on the CUDA toolkit version being used. For CUDA 12.x, NVIDIA driver version 525 or higher is required.

CUDA Toolkit Installation

The CUDA toolkit must be installed on the system. This includes:

  • CUDA Runtime: Required for executing CUDA applications
  • cuBLAS: For optimized linear algebra operations
  • cuDNN (optional but recommended): For accelerated deep neural network primitives
  • NVRTC: For runtime compilation of CUDA kernels

Installation can be done via:

  • Official NVIDIA installer (recommended for Windows)
  • Package managers (e.g., apt for Ubuntu, brew for macOS with Rosetta)

Compatible GPU Architectures

CUDA support requires GPUs with compute capability 5.0 or higher. Common compatible architectures include:

  • Pascal (compute capability 6.0, 6.1)
  • Volta (compute capability 7.0)
  • Turing (compute capability 7.5)
  • Ampere (compute capability 8.0, 8.6, 8.9)
  • Ada Lovelace (compute capability 8.9)
  • Hopper (compute capability 9.0)

You can check your GPU's compute capability using the deviceQuery sample from the CUDA toolkit.

Section sources

  • device.rs

Device Preference and Selection Logic

Device Preference Types

The system supports multiple device preference options through the DevicePreference enum:

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "kind", rename_all = "lowercase")]
pub enum DevicePreference {
    Auto,
    Cpu,
    Cuda { index: usize },
    Metal,
}

Section sources

  • types.rs

Auto Device Selection Process

The system implements an auto-selection logic that follows a priority order: CUDA → Metal → CPU. This process includes runtime detection and error handling with fallback behavior.

``mermaid flowchart TD A["Auto Device Selection"] --> B{"DevicePreference::Auto?"} B --> |Yes| C["Check CUDA Available at Compile Time"] C --> D{"cuda_is_available()?"} D --> |Yes| E["Attempt CUDA Initialization"] E --> F{"CUDA init successful?"} F --> |Yes| G["Return CUDA Device"] F --> |No| H["Log CUDA failure, continue"] H --> I["Check Metal Available at Compile Time"] I --> J{"metal_is_available()?"} J --> |Yes| K["Attempt Metal Initialization"] K --> L{"Metal init successful?"} L --> |Yes| M["Return Metal Device"] L --> |No| N["Log Metal failure, continue"] N --> O["Return CPU Device"] D --> |No| P["Skip CUDA, check Metal"] P --> I J --> |No| O B --> |No| Q["Handle specific device preference"]


**Diagram sources**
- [device.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/device.rs#L1-L65)
- [device.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/device.rs#L1-L125)

### Selection Implementation Details
The auto-selection logic is implemented in two key functions:

1. **`select_device`** in `src-tauri/src/core/device.rs` - Core selection logic
2. **`set_device`** in `src-tauri/src/api/device.rs` - API-level device setting with model reloading

The selection process follows these rules:
- Checks feature availability at compile time using `cuda_is_available()` and `metal_is_available()`
- Attempts device initialization in priority order
- Provides detailed error logging when initialization fails
- Falls back to the next available option in the priority chain
- Returns CPU as the final fallback option

**Section sources**
- [device.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/device.rs#L1-L65)
- [device.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/device.rs#L1-L125)

## CUDA Detection and Initialization Logic

### Device Initialization Process
The CUDA backend initialization follows a structured process to detect and configure GPU devices:

``mermaid
flowchart TD
A["Initialize CudaContext"] --> B["Create CudaStream"]
B --> C["Initialize cuBLAS handle"]
C --> D["Initialize cuRAND generator"]
D --> E["Load CUDA kernels (PTX)"]
E --> F["Return CudaDevice instance"]

Diagram sources

  • device.rs

Device Structure and Components

The CudaDevice struct manages all GPU resources and provides the interface for CUDA operations:

``mermaid classDiagram class CudaDevice { +DeviceId id +Arc context +Arc stream +Arc blas +Arc<Mutex> curand +Arc<RwLock> modules +Arc<RwLock<HashMap<String, Arc>>> custom_modules +new(ordinal : usize) Result +get_or_load_func(fn_name : &str, mdl : &Module) Result +alloc(len : usize) Result<CudaSlice> +memcpy_htod<Src, Dst>(src : &Src, dst : &mut Dst) Result<()> } class CudaContext { +new(ordinal : usize) Result +disable_event_tracking() +is_event_tracking() bool +load_module(ptx : &str) Result } class CudaStream { +new() Result +alloc(len : usize) Result<CudaSlice> +memcpy_htod<Src, Dst>(src : &Src, dst : &mut Dst) Result<()> } class CudaBlas { +new(stream : CudaStream) Result +gemm_strided_batched_ex(...) Result<()> } class CudaRng { +new(seed : u64, stream : CudaStream) Result +fill_with_uniform(data : &mut CudaSlice) Result<()> } CudaDevice --> CudaContext : "owns" CudaDevice --> CudaStream : "uses" CudaDevice --> CudaBlas : "uses" CudaDevice --> CudaRng : "uses"


**Diagram sources**
- [device.rs](file://d:/GitHub/Oxide-Lab/example/candle/candle-core/src/cuda_backend/device.rs#L50-L150)

### Kernel Management System
The CUDA backend employs a module store system to manage compiled kernels efficiently:

``mermaid
flowchart LR
A["Request Function: get_or_load_func()"] --> B{"Function in Cache?"}
B --> |Yes| C["Return Cached Function"]
B --> |No| D["Load PTX Module"]
D --> E["Extract Function"]
E --> F["Cache Function"]
F --> G["Return Function"]

Diagram sources

  • device.rs

The get_or_load_func method implements lazy loading of CUDA kernels, ensuring that PTX modules are only loaded when first needed and then cached for subsequent calls.

Section sources

  • device.rs

Common Installation Pitfalls and Troubleshooting

Windows-Specific Issues

Driver Conflicts

Common issues include:

  • Multiple NVIDIA driver versions: Can cause DLL conflicts
  • Outdated WDDM drivers: May prevent CUDA context creation
  • Insufficient permissions: Administrator rights required for driver installation

Solution: Use DDU (Display Driver Uninstaller) to completely remove existing drivers before installing the latest version.

Path Configuration

Ensure CUDA binaries are in the system PATH:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp

Linux-Specific Issues

Library Dependencies

Common missing dependencies:

  • libcuda.so: Part of NVIDIA driver package
  • libcudart.so: Part of CUDA toolkit
  • libcublas.so: Part of cuBLAS library

Solution: Install via package manager:

# Ubuntu/Debian
sudo apt install nvidia-cuda-toolkit nvidia-cuda-dev

# RHEL/CentOS
sudo yum install cuda-toolkit-12-3

Permission Issues

CUDA devices require appropriate permissions:

# Add user to video group
sudo usermod -aG video $USER

# Or create udev rules
echo 'SUBSYSTEM=="nvidia", MODE="0666"' | sudo tee /etc/udev/rules.d/99-nvidia.rules

Auto-Selection Specific Issues

Feature Not Compiled

If CUDA support was not compiled into the application:

  • The cuda_is_available() function will return false
  • Auto-selection will skip CUDA attempts entirely
  • No error will be logged about CUDA initialization failure

Verification:

use candle::utils::cuda_is_available;
println!("CUDA available at compile time: {}", cuda_is_available());

Initialization Failure with Auto Fallback

When CUDA initialization fails but auto-selection continues:

  • Error message will be logged: [device] CUDA init failed: {error}, falling back to next option
  • System will attempt Metal initialization if available
  • Final fallback is to CPU device

Troubleshooting:

  1. Check if CUDA is available at compile time
  2. Verify NVIDIA driver is properly installed (nvidia-smi)
  3. Ensure CUDA toolkit libraries are accessible
  4. Check for conflicting CUDA installations

Section sources

  • device.rs
  • device.rs
  • utils.rs

Verification of CUDA Initialization

Programmatic Verification

The following code demonstrates how to verify CUDA initialization:

use candle_core::Device;

fn verify_cuda_setup() -> Result<(), Box<dyn std::error::Error>> {
    // Attempt to create CUDA device
    let device = Device::new_cuda(0)?;
    
    // Test basic operations
    let tensor = Tensor::ones(&[100, 100], candle_core::DType::F32, &device)?;
    let result = tensor.matmul(&tensor.t()?)?;
    
    // Verify result
    assert_eq!(result.dims(), &[100, 100]);
    
    println!("CUDA setup verified successfully!");
    println!("Device: {:?}", device);
    println!("Result shape: {:?}", result.dims());
    
    Ok(())
}

Auto-Selection Verification

To verify the auto-selection process is working correctly:

use candle::Device;
use candle::utils::{cuda_is_available, metal_is_available};

fn verify_auto_selection() {
    println!("CUDA available: {}", cuda_is_available());
    println!("Metal available: {}", metal_is_available());
    
    // Test auto-selection
    let device = match Device::cuda_if_available(0) {
        Ok(device) => {
            println!("Successfully selected device: {:?}", device);
            device
        }
        Err(e) => {
            println!("Device selection failed: {}", e);
            Device::Cpu
        }
    };
    
    println!("Final device: {:?}", device);
}

System-Level Verification

Use these commands to verify the CUDA environment:

# Check GPU availability
nvidia-smi

# Verify CUDA version
nvcc --version

# Check library links
ldconfig -p | grep cuda

# Test with simple CUDA program
cat > test.cu << 'EOF'
#include <stdio.h>
#include <cuda_runtime.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("Found %d CUDA devices\n", deviceCount);
    for (int i = 0; i < deviceCount; ++i) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, i);
        printf("Device %d: %s\n", i, prop.name);
    }
    return 0;
}
EOF

nvcc test.cu -o test && ./test

Section sources

  • device.rs
  • device.rs

Environment Variables and Debugging

Key Environment Variables

CUDA Configuration

  • CUDA_VISIBLE_DEVICES: Controls which GPUs are visible to the application
  • CUDA_CACHE_PATH: Specifies location for compiled kernel cache
  • CUDA_LAUNCH_BLOCKING: Enables synchronous kernel execution for debugging

candle-core Specific

  • CUDA_BACKEND_VERBOSE: Enables verbose logging of CUDA operations
  • CUDNN_LOGDEST_DBG: Specifies cuDNN logging destination
  • CUBLAS_LOGDEST: Specifies cuBLAS logging destination

Enabling Verbose Logging

To enable detailed logging for debugging CUDA issues:

// Set environment variable before initialization
std::env::set_var("CUDA_BACKEND_VERBOSE", "1");

// Or use RUST_LOG for comprehensive logging
std::env::set_var("RUST_LOG", "candle_core=debug,cuda=trace");

// Initialize logger
env_logger::init();

This will provide detailed output about:

  • Kernel compilation and loading
  • Memory allocation and transfers
  • Function calls and execution times
  • Error conditions and recovery attempts

Section sources

  • device.rs

Advanced CUDA Configuration

Matrix Multiplication Precision Settings

The CUDA backend provides control over matrix multiplication precision:

// Control reduced precision for different data types
candle_core::cuda_backend::set_gemm_reduced_precision_f32(true);
candle_core::cuda_backend::set_gemm_reduced_precision_f16(true);
candle_core::cuda_backend::set_gemm_reduced_precision_bf16(true);

// Query current settings
let f32_reduced = candle_core::cuda_backend::gemm_reduced_precision_f32();
let f16_reduced = candle_core::cuda_backend::gemm_reduced_precision_f16();
let bf16_reduced = candle_core::cuda_backend::gemm_reduced_precision_bf16();

These settings control whether reduced precision arithmetic (like TF32 for F32 operations) is used, balancing performance and numerical accuracy.

Custom Kernel Integration

The backend supports loading custom CUDA kernels:

let device = Device::new_cuda(0)?;
let ptx_code = include_str!("my_kernel.ptx");
let func = device.get_or_load_custom_func("my_kernel", "my_module", ptx_code)?;

This allows for specialized operations beyond the built-in functionality.

Reference to candle-book Guides

For detailed information on low-level kernel configuration and optimization, refer to the candle-book CUDA guides:

  • Writing CUDA Kernels: Best practices for writing efficient CUDA kernels
  • Porting to CUDA: Guidelines for porting operations to CUDA
  • CUDA Inference: Optimizations for inference workloads

Section sources

  • mod.rs
  • device.rs

Referenced Files in This Document

  • device.rs - Updated with auto device selection logic in commit b2e27e5
  • device.rs - Updated with device preference handling in commit b2e27e5
  • types.rs - Contains DevicePreference enum definition
  • device.rs - Core CUDA device implementation
  • device.rs - Base device implementation with CUDA initialization
  • utils.rs - Contains cuda_is_available and metal_is_available functions
  • README.md
  • writing.md
  • porting.md
  • inference.md

Clone this wiki locally