25. Api Reference

API Reference

Update Summary

Changes Made

Added new section on Precision Policy Configuration to document the new feature
Updated Model Loading Failures section to include precision policy considerations
Enhanced Error Message Reference with precision policy related information
Updated Performance Problems and Optimization section to include precision policy impacts
Added new section sources reflecting recent code changes in precision policy implementation
Updated decision tree to include precision policy related branches

Common Issues and Solutions

Model Loading Failures

Model loading failures can occur due to corrupted files, insufficient RAM/VRAM, or incompatible file formats. The system supports GGUF and SafeTensors formats, which can be loaded from local paths or Hugging Face Hub.

Corrupted Files When a model file is corrupted, the application will fail during the loading phase. Ensure the integrity of downloaded files by verifying checksums or re-downloading from trusted sources.

Insufficient RAM/VRAM Loading large models requires substantial memory. For example, a 7B parameter model may require 14GB of RAM in FP32 format. Use quantized versions (e.g., Q4_K_M) to reduce memory footprint.

``mermaid flowchart TD A[Start Model Load] --> B{File Format} B --> |GGUF| C[Load from Local Path or HF Hub] B --> |SafeTensors| D[Load from HF Hub] C --> E{Memory Check} D --> E E --> |Sufficient| F[Load Model] E --> |Insufficient| G[Display Error: Insufficient Memory] F --> H[Success] G --> I[Recommend Quantized Model]


**Updated** Added validation for repository identifier format and missing file handling in Hugging Face model loading.

**Section sources**
- [actions.ts](file://src/lib/chat/controller/actions.ts#L64-L89)
- [mod.rs](file://src-tauri/src/api/mod.rs#L255-L282)
- [hub_gguf.rs](file://src-tauri/src/api/model_loading/hub_gguf.rs#L24) - *Added repo_id format validation in commit d24451b*
- [hub_safetensors.rs](file://src-tauri/src/api/model_loading/hub_safetensors.rs#L19) - *Added repo_id format validation in commit d24451b*

### CUDA Initialization Errors
CUDA initialization errors typically arise when the GPU driver is outdated, CUDA toolkit is not installed, or there is a version mismatch between the toolkit and driver.

To diagnose CUDA issues:
1. Verify GPU compatibility with CUDA compute capability ≥ 5.0.
2. Check that `nvidia-smi` reports a valid driver version.
3. Ensure the CUDA toolkit version matches the driver requirements.

Use the `CUDA_LAUNCH_BLOCKING=1` environment variable to force synchronous kernel execution for better error tracing.

``mermaid
graph TD
A[CUDA Initialization] --> B{GPU Available}
B --> |No| C[Fail: No Compatible GPU]
B --> |Yes| D{Driver Installed}
D --> |No| E[Install NVIDIA Driver]
D --> |Yes| F{CUDA Toolkit Installed}
F --> |No| G[Install CUDA Toolkit]
F --> |Yes| H[Success]

Section sources

device.rs
error_manage.md

Tokenizer Mismatches

Tokenizer mismatches occur when the tokenizer configuration does not align with the model architecture. This often happens when using custom models or modified tokenizers.

The system attempts to extract tokenizer information from GGUF metadata:

Look for tokenizer.json embedded in the file
Reconstruct from BPE merge rules if necessary
Fallback to default tokenizer if reconstruction fails

Special tokens like <|im_start|>, <|im_end|>, and </s> are automatically marked as special to prevent generation artifacts.

Section sources

tokenizer.rs

Generation Stalls

Generation may stall due to:

Invalid sampling parameters (e.g., temperature ≤ 0 with min_p enabled)
Repeat penalty misconfiguration
EOS token detection failure

The generation loop includes safeguards:

Cancellation via atomic flag (CANCEL_GENERATION)
Progress tracking with performance monitoring
Automatic termination after repeated pad tokens

Section sources

stream.rs
minp.rs

Error Message Reference

Backend Error Codes (Rust)

Error Code	Description	Diagnostic Steps
`ShapeMismatchBinaryOp`	Tensor dimensions incompatible for operation	Check input shapes; ensure proper reshaping
`CudaMemoryAllocation`	Failed to allocate GPU memory	Reduce batch size; close other GPU applications
`FileNotFound`	Model or tokenizer file not found	Verify path/URL; check network connectivity for HF Hub
`InvalidGgufFile`	Corrupted or unsupported GGUF structure	Re-download file; verify with `gguf-inspect`
`TokenizerDecodeError`	Failed to decode generated tokens	Check tokenizer compatibility; validate special tokens
`repo_id должен быть в формате 'owner/repo'`	Invalid Hugging Face repository identifier format	Ensure repo_id follows 'owner/repo' format (e.g., 'meta-llama/Llama-3-8B')
`В репозитории не найдены веса safetensors (model.safetensors[.index.json])`	No SafeTensors weights found in repository	Verify repository contains model.safetensors or model.safetensors.index.json files

Use RUST_BACKTRACE=1 to obtain detailed stack traces for debugging. The backtrace will show the exact location of failure in the codebase.

Updated Added new error messages for Hugging Face repository validation.

Section sources

error_manage.md
error.rs
hub_gguf.rs - Added repo_id format validation in commit d24451b
hub_safetensors.rs - Added missing weights validation in commit d24451b

Performance Problems and Optimization

Slow Inference

Slow inference can result from:

CPU fallback due to missing CUDA support
Suboptimal batch processing
Inefficient attention implementation

Optimization Recommendations:

Enable CUDA: Use GPU acceleration when available
Batch Processing: Process multiple sequences simultaneously
Quantization: Use GGUF quantized models (e.g., Q4_K_M) for faster inference
Memory Mapping: Load model weights directly from disk to reduce RAM usage

High Memory Usage

High memory consumption occurs with:

Full-precision models (FP32/FP16)
Large context lengths
Multiple concurrent generations

Memory Optimization:

Use quantized models (INT4, INT8)
Limit context length to minimum required
Implement proper resource cleanup with unload_model()
Monitor memory with system tools (e.g., nvidia-smi, htop)

``mermaid flowchart LR A[High Memory Usage] --> B{Model Type} B --> |Full Precision| C[Use Quantized Version] B --> |Quantized| D{Context Length} D --> |Large| E[Reduce Context] D --> |Optimal| F{Concurrent Tasks} F --> |Multiple| G[Limit Concurrent Generations] F --> |Single| H[Monitor System Memory]


**Section sources**
- [stream.rs](file://src-tauri/src/generate/stream.rs#L52-L74)
- [minp.rs](file://src-tauri/src/generate/minp.rs#L0-L30)

## Platform-Specific Issues

### Windows
- **CUDA**: Ensure Visual Studio build tools are installed
- **File Paths**: Use forward slashes or escaped backslashes in paths
- **Antivirus**: Exclude model directories from real-time scanning

### macOS
- **Metal Backend**: Preferred over CUDA for Apple Silicon
- **Gatekeeper**: May block execution of downloaded binaries
- **Memory Limits**: System-enforced limits on GPU memory allocation

### Linux
- **CUDA**: Requires proper driver installation via package manager
- **Permissions**: Ensure user has access to `/dev/nvidia*` devices
- **Shared Libraries**: Install `libcudnn8` and dependencies

**Section sources**
- [whisper/README.md](file://example/candle-wasm-examples/whisper/README.md#L40-L68)

## Error Handling Patterns

The codebase uses `anyhow` for error management, providing rich context and backtraces. Key patterns include:

- **Contextual Errors**: Add descriptive context to low-level errors
- **Backtrace Capture**: Automatically capture stack traces in debug builds
- **User-Friendly Messages**: Convert technical errors to understandable messages

Example from `error_manage.md`:

rust let z = x.matmul(&y)?; // Fails with shape mismatch // With RUST_BACKTRACE=1, shows exact location in source code


The `bt()` method appends backtrace information when enabled, helping pinpoint failure locations.

**Updated** Enhanced with new validation error patterns from recent code changes.

**Section sources**
- [error_manage.md](file://example/candle-book/src/error_manage.md#L0-L51)
- [error.rs](file://example/candle-core/src/error.rs#L218-L266)
- [hub_gguf.rs](file://src-tauri/src/api/model_loading/hub_gguf.rs#L24) - *Added repo_id format validation in commit d24451b*
- [hub_safetensors.rs](file://src-tauri/src/api/model_loading/hub_safetensors.rs#L19) - *Added repo_id format validation in commit d24451b*

## Decision Tree for Diagnosing Model Loading Issues

``mermaid
flowchart TD
A[Model Loading Failed] --> B{Error Type}
B --> |File Not Found| C{Source}
C --> |Local Path| D[Verify Path Exists]
C --> |HF Hub| E[Check Internet Connection]
E --> F[Validate Repo ID and Filename]
F --> G{Repo ID Format}
G --> |Incorrect| H[Use 'owner/repo' format]
G --> |Correct| I[Check Repository Contents]
I --> J[Verify model file exists]
B --> |Corrupted File| K[Verify File Integrity]
K --> L[Re-download Model]
B --> |Memory Error| M{Available Memory}
M --> |Insufficient RAM| N[Use Smaller Model]
M --> |Insufficient VRAM| O[Enable CPU Offload]
B --> |Format Error| P{File Format}
P --> |GGUF| Q[Check GGUF Version Compatibility]
P --> |SafeTensors| R[Validate Tensor Shapes]
R --> S{Weights Found}
S --> |No| T[Check for model.safetensors or index.json]
S --> |Yes| U[Verify Model Architecture]
B --> |CUDA Error| V[Check CUDA Installation]
V --> W[Verify Driver Version]
W --> X[Match CUDA Toolkit]
B --> |Tokenizer Error| Y[Extract from Metadata]
Y --> Z[Try BPE Reconstruction]
Z --> AA[Use Default Tokenizer]
B --> |Precision Policy Error| AB[Check Current Policy]
AB --> AC[Verify Policy Compatibility with Model]
AC --> AD[Adjust Policy Settings]
D --> AE[Success]
H --> F
J --> AE
L --> AE
N --> AE
O --> AE
Q --> AE
U --> AE
X --> AE
AA --> AE
AD --> AE

This decision tree covers the most common model loading issues based on file format, quantization level, and hardware compatibility. Follow the branches corresponding to your specific error message to identify the root cause and solution.

Updated Added new branches for repository identifier validation, missing SafeTensors weights, and precision policy related issues.

Diagram sources

actions.ts
mod.rs
tokenizer.rs
hub_gguf.rs - Added repo_id format validation in commit d24451b
hub_safetensors.rs - Added missing weights validation in commit d24451b
precision.rs - Added precision policy implementation

Precision Policy Configuration

The precision policy feature allows users to control the data type precision used during model loading and inference, affecting both memory consumption and computational performance.

Available Precision Policies

Default: CPU=F32, GPU=BF16 (optimal balance)
Memory Efficient: CPU=F32, GPU=F16 (lower memory usage)
Maximum Precision: CPU=F32, GPU=F32 (highest accuracy)

Implementation Details

The precision policy is implemented through:

Backend (Rust): The PrecisionPolicy enum in precision.rs defines the three policy options
State Management: The precision_policy field in ModelState stores the current policy with default set to PrecisionPolicy::Default
Model Loading: The build_varbuilder_with_precision function in weights.rs applies the selected policy when loading models
Tauri Commands: get_precision_policy and set_precision_policy commands in mod.rs allow external control of precision settings

Usage

Users can access precision policy settings through the Settings page in the application. The selected policy will be applied to all subsequent model loading operations, affecting:

Memory consumption during model loading
Inference performance
Numerical precision of results

``mermaid flowchart TD A[User Interface] --> B[Settings Page] B --> C{Select Policy} C --> |Default| D[CPU=F32, GPU=BF16] C --> |Memory Efficient| E[CPU=F32, GPU=F16] C --> |Maximum Precision| F[CPU=F32, GPU=F32] D --> G[Apply Policy] E --> G F --> G G --> H[Store in ModelState] H --> I[Use in build_varbuilder_with_precision] I --> J[Load Model with Selected Precision]


**Section sources**
- [precision.rs](file://src-tauri/src/core/precision.rs#L10-L194) - *Precision policy implementation*
- [state.rs](file://src-tauri/src/core/state.rs#L22-L40) - *Application state with precision policy*
- [weights.rs](file://src-tauri/src/core/weights.rs#L201-L216) - *Weight loading with precision policy*
- [mod.rs](file://src-tauri/src/api/mod.rs#L133-L144) - *Tauri commands for precision policy*
- [+page.svelte](file://src/routes/settings/+page.svelte#L0-L271) - *Settings UI for precision policy*
- [types.ts](file://src/lib/types.ts#L1-L4) - *Frontend precision policy types*

**Referenced Files in This Document**   
- [error_manage.md](file://example/candle-book/src/error_manage.md) - *Updated error handling patterns*
- [device.rs](file://example/candle-core/src/cuda_backend/device.rs) - *CUDA initialization and device management*
- [actions.ts](file://src/lib/chat/controller/actions.ts) - *Frontend model loading logic*
- [mod.rs](file://src-tauri/src/api/mod.rs) - *API routing for model operations*
- [tokenizer.rs](file://src-tauri/src/core/tokenizer.rs) - *Tokenizer configuration and special token handling*
- [token_output_stream.rs](file://src-tauri/src/core/token_output_stream.rs) - *Token streaming and generation control*
- [stream.rs](file://src-tauri/src/generate/stream.rs) - *Generation loop and cancellation handling*
- [minp.rs](file://src-tauri/src/generate/minp.rs) - *MinP sampling parameter validation*
- [hub_gguf.rs](file://src-tauri/src/api/model_loading/hub_gguf.rs) - *Added repo_id format validation in commit d24451b*
- [hub_safetensors.rs](file://src-tauri/src/api/model_loading/hub_safetensors.rs) - *Added repo_id format validation in commit d24451b*
- [precision.rs](file://src-tauri/src/core/precision.rs) - *Precision policy implementation*
- [state.rs](file://src-tauri/src/core/state.rs) - *Application state with precision policy*
- [weights.rs](file://src-tauri/src/core/weights.rs) - *Weight loading with precision policy*
- [types.ts](file://src/lib/types.ts) - *Frontend precision policy types*
- [+page.svelte](file://src/routes/settings/+page.svelte) - *Settings UI for precision policy*

25. Api Reference

API Reference

Update Summary

Table of Contents

Common Issues and Solutions

Model Loading Failures

Tokenizer Mismatches

Generation Stalls

Error Message Reference

Backend Error Codes (Rust)

Performance Problems and Optimization

Slow Inference

High Memory Usage

Precision Policy Configuration

Available Precision Policies

Implementation Details

Usage

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally