22.2. Cuda Setup And Detection

CUDA Setup and Detection

Update Summary

Changes Made

Added comprehensive documentation on auto device selection logic (CUDA → Metal → CPU)
Updated CUDA detection and initialization section with runtime detection details
Enhanced troubleshooting section with auto-selection failure scenarios
Added new section on device preference types and selection behavior
Updated code examples to reflect auto-selection capabilities
Added flowchart diagram for auto device selection process

Introduction

This document provides a comprehensive guide to setting up and detecting CUDA support in the Oxide Lab environment. It covers the prerequisites for enabling CUDA, the detection logic implemented in the codebase, common issues encountered during installation, and methods to verify successful initialization. The goal is to ensure users can effectively leverage GPU acceleration for deep learning workloads using the candle-core framework. This update specifically focuses on the auto device selection feature that prioritizes CUDA, then Metal, then falls back to CPU.

Prerequisites for CUDA Support

NVIDIA Driver Requirements

To enable CUDA support, your system must have a compatible NVIDIA GPU with up-to-date drivers installed. The minimum required driver version depends on the CUDA toolkit version being used. For CUDA 12.x, NVIDIA driver version 525 or higher is required.

CUDA Toolkit Installation

The CUDA toolkit must be installed on the system. This includes:

CUDA Runtime: Required for executing CUDA applications
cuBLAS: For optimized linear algebra operations
cuDNN (optional but recommended): For accelerated deep neural network primitives
NVRTC: For runtime compilation of CUDA kernels

Installation can be done via:

Official NVIDIA installer (recommended for Windows)
Package managers (e.g., apt for Ubuntu, brew for macOS with Rosetta)

Compatible GPU Architectures

CUDA support requires GPUs with compute capability 5.0 or higher. Common compatible architectures include:

Pascal (compute capability 6.0, 6.1)
Volta (compute capability 7.0)
Turing (compute capability 7.5)
Ampere (compute capability 8.0, 8.6, 8.9)
Ada Lovelace (compute capability 8.9)
Hopper (compute capability 9.0)

You can check your GPU's compute capability using the deviceQuery sample from the CUDA toolkit.

Section sources

device.rs

Device Preference and Selection Logic

Device Preference Types

The system supports multiple device preference options through the DevicePreference enum:

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "kind", rename_all = "lowercase")]
pub enum DevicePreference {
    Auto,
    Cpu,
    Cuda { index: usize },
    Metal,
}

Section sources

types.rs

Auto Device Selection Process

The system implements an auto-selection logic that follows a priority order: CUDA → Metal → CPU. This process includes runtime detection and error handling with fallback behavior.

``mermaid flowchart TD A["Auto Device Selection"] --> B{"DevicePreference::Auto?"} B --> |Yes| C["Check CUDA Available at Compile Time"] C --> D{"cuda_is_available()?"} D --> |Yes| E["Attempt CUDA Initialization"] E --> F{"CUDA init successful?"} F --> |Yes| G["Return CUDA Device"] F --> |No| H["Log CUDA failure, continue"] H --> I["Check Metal Available at Compile Time"] I --> J{"metal_is_available()?"} J --> |Yes| K["Attempt Metal Initialization"] K --> L{"Metal init successful?"} L --> |Yes| M["Return Metal Device"] L --> |No| N["Log Metal failure, continue"] N --> O["Return CPU Device"] D --> |No| P["Skip CUDA, check Metal"] P --> I J --> |No| O B --> |No| Q["Handle specific device preference"]


**Diagram sources**
- [device.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/device.rs#L1-L65)
- [device.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/device.rs#L1-L125)

### Selection Implementation Details
The auto-selection logic is implemented in two key functions:

1. **`select_device`** in `src-tauri/src/core/device.rs` - Core selection logic
2. **`set_device`** in `src-tauri/src/api/device.rs` - API-level device setting with model reloading

The selection process follows these rules:
- Checks feature availability at compile time using `cuda_is_available()` and `metal_is_available()`
- Attempts device initialization in priority order
- Provides detailed error logging when initialization fails
- Falls back to the next available option in the priority chain
- Returns CPU as the final fallback option

**Section sources**
- [device.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/core/device.rs#L1-L65)
- [device.rs](file://d:/GitHub/Oxide-Lab/src-tauri/src/api/device.rs#L1-L125)

## CUDA Detection and Initialization Logic

### Device Initialization Process
The CUDA backend initialization follows a structured process to detect and configure GPU devices:

``mermaid
flowchart TD
A["Initialize CudaContext"] --> B["Create CudaStream"]
B --> C["Initialize cuBLAS handle"]
C --> D["Initialize cuRAND generator"]
D --> E["Load CUDA kernels (PTX)"]
E --> F["Return CudaDevice instance"]

Diagram sources

device.rs

Device Structure and Components

The CudaDevice struct manages all GPU resources and provides the interface for CUDA operations:

``mermaid classDiagram class CudaDevice { +DeviceId id +Arc context +Arc stream +Arc blas +Arc<Mutex> curand +Arc<RwLock> modules +Arc<RwLock<HashMap<String, Arc>>> custom_modules +new(ordinal : usize) Result +get_or_load_func(fn_name : &str, mdl : &Module) Result +alloc(len : usize) Result<CudaSlice> +memcpy_htod<Src, Dst>(src : &Src, dst : &mut Dst) Result<()> } class CudaContext { +new(ordinal : usize) Result +disable_event_tracking() +is_event_tracking() bool +load_module(ptx : &str) Result } class CudaStream { +new() Result +alloc(len : usize) Result<CudaSlice> +memcpy_htod<Src, Dst>(src : &Src, dst : &mut Dst) Result<()> } class CudaBlas { +new(stream : CudaStream) Result +gemm_strided_batched_ex(...) Result<()> } class CudaRng { +new(seed : u64, stream : CudaStream) Result +fill_with_uniform(data : &mut CudaSlice) Result<()> } CudaDevice --> CudaContext : "owns" CudaDevice --> CudaStream : "uses" CudaDevice --> CudaBlas : "uses" CudaDevice --> CudaRng : "uses"


**Diagram sources**
- [device.rs](file://d:/GitHub/Oxide-Lab/example/candle/candle-core/src/cuda_backend/device.rs#L50-L150)

### Kernel Management System
The CUDA backend employs a module store system to manage compiled kernels efficiently:

``mermaid
flowchart LR
A["Request Function: get_or_load_func()"] --> B{"Function in Cache?"}
B --> |Yes| C["Return Cached Function"]
B --> |No| D["Load PTX Module"]
D --> E["Extract Function"]
E --> F["Cache Function"]
F --> G["Return Function"]

Diagram sources

device.rs

The get_or_load_func method implements lazy loading of CUDA kernels, ensuring that PTX modules are only loaded when first needed and then cached for subsequent calls.

Section sources

device.rs

Common Installation Pitfalls and Troubleshooting

Windows-Specific Issues

Driver Conflicts

Common issues include:

Multiple NVIDIA driver versions: Can cause DLL conflicts
Outdated WDDM drivers: May prevent CUDA context creation
Insufficient permissions: Administrator rights required for driver installation

Solution: Use DDU (Display Driver Uninstaller) to completely remove existing drivers before installing the latest version.

Path Configuration

Ensure CUDA binaries are in the system PATH:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp

Linux-Specific Issues

Library Dependencies

Common missing dependencies:

libcuda.so: Part of NVIDIA driver package
libcudart.so: Part of CUDA toolkit
libcublas.so: Part of cuBLAS library

Solution: Install via package manager:

# Ubuntu/Debian
sudo apt install nvidia-cuda-toolkit nvidia-cuda-dev

# RHEL/CentOS
sudo yum install cuda-toolkit-12-3

Permission Issues

CUDA devices require appropriate permissions:

# Add user to video group
sudo usermod -aG video $USER

# Or create udev rules
echo 'SUBSYSTEM=="nvidia", MODE="0666"' | sudo tee /etc/udev/rules.d/99-nvidia.rules

Auto-Selection Specific Issues

Feature Not Compiled

If CUDA support was not compiled into the application:

The cuda_is_available() function will return false
Auto-selection will skip CUDA attempts entirely
No error will be logged about CUDA initialization failure

Verification:

use candle::utils::cuda_is_available;
println!("CUDA available at compile time: {}", cuda_is_available());

Initialization Failure with Auto Fallback

When CUDA initialization fails but auto-selection continues:

Error message will be logged: [device] CUDA init failed: {error}, falling back to next option
System will attempt Metal initialization if available
Final fallback is to CPU device

Troubleshooting:

Check if CUDA is available at compile time
Verify NVIDIA driver is properly installed (nvidia-smi)
Ensure CUDA toolkit libraries are accessible
Check for conflicting CUDA installations

Section sources

device.rs
device.rs
utils.rs

Verification of CUDA Initialization

Programmatic Verification

The following code demonstrates how to verify CUDA initialization:

use candle_core::Device;

fn verify_cuda_setup() -> Result<(), Box<dyn std::error::Error>> {
    // Attempt to create CUDA device
    let device = Device::new_cuda(0)?;
    
    // Test basic operations
    let tensor = Tensor::ones(&[100, 100], candle_core::DType::F32, &device)?;
    let result = tensor.matmul(&tensor.t()?)?;
    
    // Verify result
    assert_eq!(result.dims(), &[100, 100]);
    
    println!("CUDA setup verified successfully!");
    println!("Device: {:?}", device);
    println!("Result shape: {:?}", result.dims());
    
    Ok(())
}

Auto-Selection Verification

To verify the auto-selection process is working correctly:

use candle::Device;
use candle::utils::{cuda_is_available, metal_is_available};

fn verify_auto_selection() {
    println!("CUDA available: {}", cuda_is_available());
    println!("Metal available: {}", metal_is_available());
    
    // Test auto-selection
    let device = match Device::cuda_if_available(0) {
        Ok(device) => {
            println!("Successfully selected device: {:?}", device);
            device
        }
        Err(e) => {
            println!("Device selection failed: {}", e);
            Device::Cpu
        }
    };
    
    println!("Final device: {:?}", device);
}

System-Level Verification

Use these commands to verify the CUDA environment:

# Check GPU availability
nvidia-smi

# Verify CUDA version
nvcc --version

# Check library links
ldconfig -p | grep cuda

# Test with simple CUDA program
cat > test.cu << 'EOF'
#include <stdio.h>
#include <cuda_runtime.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("Found %d CUDA devices\n", deviceCount);
    for (int i = 0; i < deviceCount; ++i) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, i);
        printf("Device %d: %s\n", i, prop.name);
    }
    return 0;
}
EOF

nvcc test.cu -o test && ./test

Section sources

device.rs
device.rs

Environment Variables and Debugging

Key Environment Variables

CUDA Configuration

CUDA_VISIBLE_DEVICES: Controls which GPUs are visible to the application
CUDA_CACHE_PATH: Specifies location for compiled kernel cache
CUDA_LAUNCH_BLOCKING: Enables synchronous kernel execution for debugging

candle-core Specific

CUDA_BACKEND_VERBOSE: Enables verbose logging of CUDA operations
CUDNN_LOGDEST_DBG: Specifies cuDNN logging destination
CUBLAS_LOGDEST: Specifies cuBLAS logging destination

Enabling Verbose Logging

To enable detailed logging for debugging CUDA issues:

// Set environment variable before initialization
std::env::set_var("CUDA_BACKEND_VERBOSE", "1");

// Or use RUST_LOG for comprehensive logging
std::env::set_var("RUST_LOG", "candle_core=debug,cuda=trace");

// Initialize logger
env_logger::init();

This will provide detailed output about:

Kernel compilation and loading
Memory allocation and transfers
Function calls and execution times
Error conditions and recovery attempts

Section sources

device.rs

Advanced CUDA Configuration

Matrix Multiplication Precision Settings

The CUDA backend provides control over matrix multiplication precision:

// Control reduced precision for different data types
candle_core::cuda_backend::set_gemm_reduced_precision_f32(true);
candle_core::cuda_backend::set_gemm_reduced_precision_f16(true);
candle_core::cuda_backend::set_gemm_reduced_precision_bf16(true);

// Query current settings
let f32_reduced = candle_core::cuda_backend::gemm_reduced_precision_f32();
let f16_reduced = candle_core::cuda_backend::gemm_reduced_precision_f16();
let bf16_reduced = candle_core::cuda_backend::gemm_reduced_precision_bf16();

These settings control whether reduced precision arithmetic (like TF32 for F32 operations) is used, balancing performance and numerical accuracy.

Custom Kernel Integration

The backend supports loading custom CUDA kernels:

let device = Device::new_cuda(0)?;
let ptx_code = include_str!("my_kernel.ptx");
let func = device.get_or_load_custom_func("my_kernel", "my_module", ptx_code)?;

This allows for specialized operations beyond the built-in functionality.

Reference to candle-book Guides

For detailed information on low-level kernel configuration and optimization, refer to the candle-book CUDA guides:

Writing CUDA Kernels: Best practices for writing efficient CUDA kernels
Porting to CUDA: Guidelines for porting operations to CUDA
CUDA Inference: Optimizations for inference workloads

Section sources

mod.rs
device.rs

Referenced Files in This Document

device.rs - Updated with auto device selection logic in commit b2e27e5
device.rs - Updated with device preference handling in commit b2e27e5
types.rs - Contains DevicePreference enum definition
device.rs - Core CUDA device implementation
device.rs - Base device implementation with CUDA initialization
utils.rs - Contains cuda_is_available and metal_is_available functions
README.md
writing.md
porting.md
inference.md

22.2. Cuda Setup And Detection

CUDA Setup and Detection

Update Summary

Table of Contents

Introduction

Prerequisites for CUDA Support

NVIDIA Driver Requirements

CUDA Toolkit Installation

Compatible GPU Architectures

Device Preference and Selection Logic

Device Preference Types

Auto Device Selection Process

Device Structure and Components

Common Installation Pitfalls and Troubleshooting

Windows-Specific Issues

Driver Conflicts

Path Configuration

Linux-Specific Issues

Library Dependencies

Permission Issues

Auto-Selection Specific Issues

Feature Not Compiled

Initialization Failure with Auto Fallback

Verification of CUDA Initialization

Programmatic Verification

Auto-Selection Verification

System-Level Verification

Environment Variables and Debugging

Key Environment Variables

CUDA Configuration

candle-core Specific

Enabling Verbose Logging

Advanced CUDA Configuration

Matrix Multiplication Precision Settings

Custom Kernel Integration

Reference to candle-book Guides

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!