Skip to content

25. Api Reference

FerrisMind edited this page Sep 10, 2025 · 1 revision

API Reference

Update Summary

Changes Made

  • Added new section on Precision Policy Configuration to document the new feature
  • Updated Model Loading Failures section to include precision policy considerations
  • Enhanced Error Message Reference with precision policy related information
  • Updated Performance Problems and Optimization section to include precision policy impacts
  • Added new section sources reflecting recent code changes in precision policy implementation
  • Updated decision tree to include precision policy related branches

Table of Contents

  1. Common Issues and Solutions
  2. Error Message Reference
  3. Performance Problems and Optimization
  4. Platform-Specific Issues
  5. Error Handling Patterns
  6. Decision Tree for Diagnosing Model Loading Issues
  7. Precision Policy Configuration

Common Issues and Solutions

Model Loading Failures

Model loading failures can occur due to corrupted files, insufficient RAM/VRAM, or incompatible file formats. The system supports GGUF and SafeTensors formats, which can be loaded from local paths or Hugging Face Hub.

Corrupted Files When a model file is corrupted, the application will fail during the loading phase. Ensure the integrity of downloaded files by verifying checksums or re-downloading from trusted sources.

Insufficient RAM/VRAM Loading large models requires substantial memory. For example, a 7B parameter model may require 14GB of RAM in FP32 format. Use quantized versions (e.g., Q4_K_M) to reduce memory footprint.

``mermaid flowchart TD A[Start Model Load] --> B{File Format} B --> |GGUF| C[Load from Local Path or HF Hub] B --> |SafeTensors| D[Load from HF Hub] C --> E{Memory Check} D --> E E --> |Sufficient| F[Load Model] E --> |Insufficient| G[Display Error: Insufficient Memory] F --> H[Success] G --> I[Recommend Quantized Model]


**Updated** Added validation for repository identifier format and missing file handling in Hugging Face model loading.

**Section sources**
- [actions.ts](file://src/lib/chat/controller/actions.ts#L64-L89)
- [mod.rs](file://src-tauri/src/api/mod.rs#L255-L282)
- [hub_gguf.rs](file://src-tauri/src/api/model_loading/hub_gguf.rs#L24) - *Added repo_id format validation in commit d24451b*
- [hub_safetensors.rs](file://src-tauri/src/api/model_loading/hub_safetensors.rs#L19) - *Added repo_id format validation in commit d24451b*

### CUDA Initialization Errors
CUDA initialization errors typically arise when the GPU driver is outdated, CUDA toolkit is not installed, or there is a version mismatch between the toolkit and driver.

To diagnose CUDA issues:
1. Verify GPU compatibility with CUDA compute capability ≥ 5.0.
2. Check that `nvidia-smi` reports a valid driver version.
3. Ensure the CUDA toolkit version matches the driver requirements.

Use the `CUDA_LAUNCH_BLOCKING=1` environment variable to force synchronous kernel execution for better error tracing.

``mermaid
graph TD
A[CUDA Initialization] --> B{GPU Available}
B --> |No| C[Fail: No Compatible GPU]
B --> |Yes| D{Driver Installed}
D --> |No| E[Install NVIDIA Driver]
D --> |Yes| F{CUDA Toolkit Installed}
F --> |No| G[Install CUDA Toolkit]
F --> |Yes| H[Success]

Section sources

  • device.rs
  • error_manage.md

Tokenizer Mismatches

Tokenizer mismatches occur when the tokenizer configuration does not align with the model architecture. This often happens when using custom models or modified tokenizers.

The system attempts to extract tokenizer information from GGUF metadata:

  • Look for tokenizer.json embedded in the file
  • Reconstruct from BPE merge rules if necessary
  • Fallback to default tokenizer if reconstruction fails

Special tokens like <|im_start|>, <|im_end|>, and </s> are automatically marked as special to prevent generation artifacts.

Section sources

  • tokenizer.rs

Generation Stalls

Generation may stall due to:

  • Invalid sampling parameters (e.g., temperature ≤ 0 with min_p enabled)
  • Repeat penalty misconfiguration
  • EOS token detection failure

The generation loop includes safeguards:

  • Cancellation via atomic flag (CANCEL_GENERATION)
  • Progress tracking with performance monitoring
  • Automatic termination after repeated pad tokens

Section sources

  • stream.rs
  • minp.rs

Error Message Reference

Backend Error Codes (Rust)

Error Code Description Diagnostic Steps
ShapeMismatchBinaryOp Tensor dimensions incompatible for operation Check input shapes; ensure proper reshaping
CudaMemoryAllocation Failed to allocate GPU memory Reduce batch size; close other GPU applications
FileNotFound Model or tokenizer file not found Verify path/URL; check network connectivity for HF Hub
InvalidGgufFile Corrupted or unsupported GGUF structure Re-download file; verify with gguf-inspect
TokenizerDecodeError Failed to decode generated tokens Check tokenizer compatibility; validate special tokens
repo_id должен быть в формате 'owner/repo' Invalid Hugging Face repository identifier format Ensure repo_id follows 'owner/repo' format (e.g., 'meta-llama/Llama-3-8B')
В репозитории не найдены веса safetensors (model.safetensors[.index.json]) No SafeTensors weights found in repository Verify repository contains model.safetensors or model.safetensors.index.json files

Use RUST_BACKTRACE=1 to obtain detailed stack traces for debugging. The backtrace will show the exact location of failure in the codebase.

Updated Added new error messages for Hugging Face repository validation.

Section sources

  • error_manage.md
  • error.rs
  • hub_gguf.rs - Added repo_id format validation in commit d24451b
  • hub_safetensors.rs - Added missing weights validation in commit d24451b

Performance Problems and Optimization

Slow Inference

Slow inference can result from:

  • CPU fallback due to missing CUDA support
  • Suboptimal batch processing
  • Inefficient attention implementation

Optimization Recommendations:

  1. Enable CUDA: Use GPU acceleration when available
  2. Batch Processing: Process multiple sequences simultaneously
  3. Quantization: Use GGUF quantized models (e.g., Q4_K_M) for faster inference
  4. Memory Mapping: Load model weights directly from disk to reduce RAM usage

High Memory Usage

High memory consumption occurs with:

  • Full-precision models (FP32/FP16)
  • Large context lengths
  • Multiple concurrent generations

Memory Optimization:

  • Use quantized models (INT4, INT8)
  • Limit context length to minimum required
  • Implement proper resource cleanup with unload_model()
  • Monitor memory with system tools (e.g., nvidia-smi, htop)

``mermaid flowchart LR A[High Memory Usage] --> B{Model Type} B --> |Full Precision| C[Use Quantized Version] B --> |Quantized| D{Context Length} D --> |Large| E[Reduce Context] D --> |Optimal| F{Concurrent Tasks} F --> |Multiple| G[Limit Concurrent Generations] F --> |Single| H[Monitor System Memory]


**Section sources**
- [stream.rs](file://src-tauri/src/generate/stream.rs#L52-L74)
- [minp.rs](file://src-tauri/src/generate/minp.rs#L0-L30)

## Platform-Specific Issues

### Windows
- **CUDA**: Ensure Visual Studio build tools are installed
- **File Paths**: Use forward slashes or escaped backslashes in paths
- **Antivirus**: Exclude model directories from real-time scanning

### macOS
- **Metal Backend**: Preferred over CUDA for Apple Silicon
- **Gatekeeper**: May block execution of downloaded binaries
- **Memory Limits**: System-enforced limits on GPU memory allocation

### Linux
- **CUDA**: Requires proper driver installation via package manager
- **Permissions**: Ensure user has access to `/dev/nvidia*` devices
- **Shared Libraries**: Install `libcudnn8` and dependencies

**Section sources**
- [whisper/README.md](file://example/candle-wasm-examples/whisper/README.md#L40-L68)

## Error Handling Patterns

The codebase uses `anyhow` for error management, providing rich context and backtraces. Key patterns include:

- **Contextual Errors**: Add descriptive context to low-level errors
- **Backtrace Capture**: Automatically capture stack traces in debug builds
- **User-Friendly Messages**: Convert technical errors to understandable messages

Example from `error_manage.md`:

rust let z = x.matmul(&y)?; // Fails with shape mismatch // With RUST_BACKTRACE=1, shows exact location in source code


The `bt()` method appends backtrace information when enabled, helping pinpoint failure locations.

**Updated** Enhanced with new validation error patterns from recent code changes.

**Section sources**
- [error_manage.md](file://example/candle-book/src/error_manage.md#L0-L51)
- [error.rs](file://example/candle-core/src/error.rs#L218-L266)
- [hub_gguf.rs](file://src-tauri/src/api/model_loading/hub_gguf.rs#L24) - *Added repo_id format validation in commit d24451b*
- [hub_safetensors.rs](file://src-tauri/src/api/model_loading/hub_safetensors.rs#L19) - *Added repo_id format validation in commit d24451b*

## Decision Tree for Diagnosing Model Loading Issues

``mermaid
flowchart TD
A[Model Loading Failed] --> B{Error Type}
B --> |File Not Found| C{Source}
C --> |Local Path| D[Verify Path Exists]
C --> |HF Hub| E[Check Internet Connection]
E --> F[Validate Repo ID and Filename]
F --> G{Repo ID Format}
G --> |Incorrect| H[Use 'owner/repo' format]
G --> |Correct| I[Check Repository Contents]
I --> J[Verify model file exists]
B --> |Corrupted File| K[Verify File Integrity]
K --> L[Re-download Model]
B --> |Memory Error| M{Available Memory}
M --> |Insufficient RAM| N[Use Smaller Model]
M --> |Insufficient VRAM| O[Enable CPU Offload]
B --> |Format Error| P{File Format}
P --> |GGUF| Q[Check GGUF Version Compatibility]
P --> |SafeTensors| R[Validate Tensor Shapes]
R --> S{Weights Found}
S --> |No| T[Check for model.safetensors or index.json]
S --> |Yes| U[Verify Model Architecture]
B --> |CUDA Error| V[Check CUDA Installation]
V --> W[Verify Driver Version]
W --> X[Match CUDA Toolkit]
B --> |Tokenizer Error| Y[Extract from Metadata]
Y --> Z[Try BPE Reconstruction]
Z --> AA[Use Default Tokenizer]
B --> |Precision Policy Error| AB[Check Current Policy]
AB --> AC[Verify Policy Compatibility with Model]
AC --> AD[Adjust Policy Settings]
D --> AE[Success]
H --> F
J --> AE
L --> AE
N --> AE
O --> AE
Q --> AE
U --> AE
X --> AE
AA --> AE
AD --> AE

This decision tree covers the most common model loading issues based on file format, quantization level, and hardware compatibility. Follow the branches corresponding to your specific error message to identify the root cause and solution.

Updated Added new branches for repository identifier validation, missing SafeTensors weights, and precision policy related issues.

Diagram sources

  • actions.ts
  • mod.rs
  • tokenizer.rs
  • hub_gguf.rs - Added repo_id format validation in commit d24451b
  • hub_safetensors.rs - Added missing weights validation in commit d24451b
  • precision.rs - Added precision policy implementation

Precision Policy Configuration

The precision policy feature allows users to control the data type precision used during model loading and inference, affecting both memory consumption and computational performance.

Available Precision Policies

  1. Default: CPU=F32, GPU=BF16 (optimal balance)
  2. Memory Efficient: CPU=F32, GPU=F16 (lower memory usage)
  3. Maximum Precision: CPU=F32, GPU=F32 (highest accuracy)

Implementation Details

The precision policy is implemented through:

  • Backend (Rust): The PrecisionPolicy enum in precision.rs defines the three policy options
  • State Management: The precision_policy field in ModelState stores the current policy with default set to PrecisionPolicy::Default
  • Model Loading: The build_varbuilder_with_precision function in weights.rs applies the selected policy when loading models
  • Tauri Commands: get_precision_policy and set_precision_policy commands in mod.rs allow external control of precision settings

Usage

Users can access precision policy settings through the Settings page in the application. The selected policy will be applied to all subsequent model loading operations, affecting:

  • Memory consumption during model loading
  • Inference performance
  • Numerical precision of results

``mermaid flowchart TD A[User Interface] --> B[Settings Page] B --> C{Select Policy} C --> |Default| D[CPU=F32, GPU=BF16] C --> |Memory Efficient| E[CPU=F32, GPU=F16] C --> |Maximum Precision| F[CPU=F32, GPU=F32] D --> G[Apply Policy] E --> G F --> G G --> H[Store in ModelState] H --> I[Use in build_varbuilder_with_precision] I --> J[Load Model with Selected Precision]


**Section sources**
- [precision.rs](file://src-tauri/src/core/precision.rs#L10-L194) - *Precision policy implementation*
- [state.rs](file://src-tauri/src/core/state.rs#L22-L40) - *Application state with precision policy*
- [weights.rs](file://src-tauri/src/core/weights.rs#L201-L216) - *Weight loading with precision policy*
- [mod.rs](file://src-tauri/src/api/mod.rs#L133-L144) - *Tauri commands for precision policy*
- [+page.svelte](file://src/routes/settings/+page.svelte#L0-L271) - *Settings UI for precision policy*
- [types.ts](file://src/lib/types.ts#L1-L4) - *Frontend precision policy types*

**Referenced Files in This Document**   
- [error_manage.md](file://example/candle-book/src/error_manage.md) - *Updated error handling patterns*
- [device.rs](file://example/candle-core/src/cuda_backend/device.rs) - *CUDA initialization and device management*
- [actions.ts](file://src/lib/chat/controller/actions.ts) - *Frontend model loading logic*
- [mod.rs](file://src-tauri/src/api/mod.rs) - *API routing for model operations*
- [tokenizer.rs](file://src-tauri/src/core/tokenizer.rs) - *Tokenizer configuration and special token handling*
- [token_output_stream.rs](file://src-tauri/src/core/token_output_stream.rs) - *Token streaming and generation control*
- [stream.rs](file://src-tauri/src/generate/stream.rs) - *Generation loop and cancellation handling*
- [minp.rs](file://src-tauri/src/generate/minp.rs) - *MinP sampling parameter validation*
- [hub_gguf.rs](file://src-tauri/src/api/model_loading/hub_gguf.rs) - *Added repo_id format validation in commit d24451b*
- [hub_safetensors.rs](file://src-tauri/src/api/model_loading/hub_safetensors.rs) - *Added repo_id format validation in commit d24451b*
- [precision.rs](file://src-tauri/src/core/precision.rs) - *Precision policy implementation*
- [state.rs](file://src-tauri/src/core/state.rs) - *Application state with precision policy*
- [weights.rs](file://src-tauri/src/core/weights.rs) - *Weight loading with precision policy*
- [types.ts](file://src/lib/types.ts) - *Frontend precision policy types*
- [+page.svelte](file://src/routes/settings/+page.svelte) - *Settings UI for precision policy*

Clone this wiki locally