-
Notifications
You must be signed in to change notification settings - Fork 3
25. Api Reference
Changes Made
- Added new section on Precision Policy Configuration to document the new feature
- Updated Model Loading Failures section to include precision policy considerations
- Enhanced Error Message Reference with precision policy related information
- Updated Performance Problems and Optimization section to include precision policy impacts
- Added new section sources reflecting recent code changes in precision policy implementation
- Updated decision tree to include precision policy related branches
- Common Issues and Solutions
- Error Message Reference
- Performance Problems and Optimization
- Platform-Specific Issues
- Error Handling Patterns
- Decision Tree for Diagnosing Model Loading Issues
- Precision Policy Configuration
Model loading failures can occur due to corrupted files, insufficient RAM/VRAM, or incompatible file formats. The system supports GGUF and SafeTensors formats, which can be loaded from local paths or Hugging Face Hub.
Corrupted Files When a model file is corrupted, the application will fail during the loading phase. Ensure the integrity of downloaded files by verifying checksums or re-downloading from trusted sources.
Insufficient RAM/VRAM Loading large models requires substantial memory. For example, a 7B parameter model may require 14GB of RAM in FP32 format. Use quantized versions (e.g., Q4_K_M) to reduce memory footprint.
``mermaid flowchart TD A[Start Model Load] --> B{File Format} B --> |GGUF| C[Load from Local Path or HF Hub] B --> |SafeTensors| D[Load from HF Hub] C --> E{Memory Check} D --> E E --> |Sufficient| F[Load Model] E --> |Insufficient| G[Display Error: Insufficient Memory] F --> H[Success] G --> I[Recommend Quantized Model]
**Updated** Added validation for repository identifier format and missing file handling in Hugging Face model loading.
**Section sources**
- [actions.ts](file://src/lib/chat/controller/actions.ts#L64-L89)
- [mod.rs](file://src-tauri/src/api/mod.rs#L255-L282)
- [hub_gguf.rs](file://src-tauri/src/api/model_loading/hub_gguf.rs#L24) - *Added repo_id format validation in commit d24451b*
- [hub_safetensors.rs](file://src-tauri/src/api/model_loading/hub_safetensors.rs#L19) - *Added repo_id format validation in commit d24451b*
### CUDA Initialization Errors
CUDA initialization errors typically arise when the GPU driver is outdated, CUDA toolkit is not installed, or there is a version mismatch between the toolkit and driver.
To diagnose CUDA issues:
1. Verify GPU compatibility with CUDA compute capability ≥ 5.0.
2. Check that `nvidia-smi` reports a valid driver version.
3. Ensure the CUDA toolkit version matches the driver requirements.
Use the `CUDA_LAUNCH_BLOCKING=1` environment variable to force synchronous kernel execution for better error tracing.
``mermaid
graph TD
A[CUDA Initialization] --> B{GPU Available}
B --> |No| C[Fail: No Compatible GPU]
B --> |Yes| D{Driver Installed}
D --> |No| E[Install NVIDIA Driver]
D --> |Yes| F{CUDA Toolkit Installed}
F --> |No| G[Install CUDA Toolkit]
F --> |Yes| H[Success]
Section sources
- device.rs
- error_manage.md
Tokenizer mismatches occur when the tokenizer configuration does not align with the model architecture. This often happens when using custom models or modified tokenizers.
The system attempts to extract tokenizer information from GGUF metadata:
- Look for
tokenizer.jsonembedded in the file - Reconstruct from BPE merge rules if necessary
- Fallback to default tokenizer if reconstruction fails
Special tokens like <|im_start|>, <|im_end|>, and </s> are automatically marked as special to prevent generation artifacts.
Section sources
- tokenizer.rs
Generation may stall due to:
- Invalid sampling parameters (e.g., temperature ≤ 0 with min_p enabled)
- Repeat penalty misconfiguration
- EOS token detection failure
The generation loop includes safeguards:
- Cancellation via atomic flag (
CANCEL_GENERATION) - Progress tracking with performance monitoring
- Automatic termination after repeated pad tokens
Section sources
- stream.rs
- minp.rs
| Error Code | Description | Diagnostic Steps |
|---|---|---|
ShapeMismatchBinaryOp |
Tensor dimensions incompatible for operation | Check input shapes; ensure proper reshaping |
CudaMemoryAllocation |
Failed to allocate GPU memory | Reduce batch size; close other GPU applications |
FileNotFound |
Model or tokenizer file not found | Verify path/URL; check network connectivity for HF Hub |
InvalidGgufFile |
Corrupted or unsupported GGUF structure | Re-download file; verify with gguf-inspect
|
TokenizerDecodeError |
Failed to decode generated tokens | Check tokenizer compatibility; validate special tokens |
repo_id должен быть в формате 'owner/repo' |
Invalid Hugging Face repository identifier format | Ensure repo_id follows 'owner/repo' format (e.g., 'meta-llama/Llama-3-8B') |
В репозитории не найдены веса safetensors (model.safetensors[.index.json]) |
No SafeTensors weights found in repository | Verify repository contains model.safetensors or model.safetensors.index.json files |
Use RUST_BACKTRACE=1 to obtain detailed stack traces for debugging. The backtrace will show the exact location of failure in the codebase.
Updated Added new error messages for Hugging Face repository validation.
Section sources
- error_manage.md
- error.rs
- hub_gguf.rs - Added repo_id format validation in commit d24451b
- hub_safetensors.rs - Added missing weights validation in commit d24451b
Slow inference can result from:
- CPU fallback due to missing CUDA support
- Suboptimal batch processing
- Inefficient attention implementation
Optimization Recommendations:
- Enable CUDA: Use GPU acceleration when available
- Batch Processing: Process multiple sequences simultaneously
- Quantization: Use GGUF quantized models (e.g., Q4_K_M) for faster inference
- Memory Mapping: Load model weights directly from disk to reduce RAM usage
High memory consumption occurs with:
- Full-precision models (FP32/FP16)
- Large context lengths
- Multiple concurrent generations
Memory Optimization:
- Use quantized models (INT4, INT8)
- Limit context length to minimum required
- Implement proper resource cleanup with
unload_model() - Monitor memory with system tools (e.g.,
nvidia-smi,htop)
``mermaid flowchart LR A[High Memory Usage] --> B{Model Type} B --> |Full Precision| C[Use Quantized Version] B --> |Quantized| D{Context Length} D --> |Large| E[Reduce Context] D --> |Optimal| F{Concurrent Tasks} F --> |Multiple| G[Limit Concurrent Generations] F --> |Single| H[Monitor System Memory]
**Section sources**
- [stream.rs](file://src-tauri/src/generate/stream.rs#L52-L74)
- [minp.rs](file://src-tauri/src/generate/minp.rs#L0-L30)
## Platform-Specific Issues
### Windows
- **CUDA**: Ensure Visual Studio build tools are installed
- **File Paths**: Use forward slashes or escaped backslashes in paths
- **Antivirus**: Exclude model directories from real-time scanning
### macOS
- **Metal Backend**: Preferred over CUDA for Apple Silicon
- **Gatekeeper**: May block execution of downloaded binaries
- **Memory Limits**: System-enforced limits on GPU memory allocation
### Linux
- **CUDA**: Requires proper driver installation via package manager
- **Permissions**: Ensure user has access to `/dev/nvidia*` devices
- **Shared Libraries**: Install `libcudnn8` and dependencies
**Section sources**
- [whisper/README.md](file://example/candle-wasm-examples/whisper/README.md#L40-L68)
## Error Handling Patterns
The codebase uses `anyhow` for error management, providing rich context and backtraces. Key patterns include:
- **Contextual Errors**: Add descriptive context to low-level errors
- **Backtrace Capture**: Automatically capture stack traces in debug builds
- **User-Friendly Messages**: Convert technical errors to understandable messages
Example from `error_manage.md`:
rust let z = x.matmul(&y)?; // Fails with shape mismatch // With RUST_BACKTRACE=1, shows exact location in source code
The `bt()` method appends backtrace information when enabled, helping pinpoint failure locations.
**Updated** Enhanced with new validation error patterns from recent code changes.
**Section sources**
- [error_manage.md](file://example/candle-book/src/error_manage.md#L0-L51)
- [error.rs](file://example/candle-core/src/error.rs#L218-L266)
- [hub_gguf.rs](file://src-tauri/src/api/model_loading/hub_gguf.rs#L24) - *Added repo_id format validation in commit d24451b*
- [hub_safetensors.rs](file://src-tauri/src/api/model_loading/hub_safetensors.rs#L19) - *Added repo_id format validation in commit d24451b*
## Decision Tree for Diagnosing Model Loading Issues
``mermaid
flowchart TD
A[Model Loading Failed] --> B{Error Type}
B --> |File Not Found| C{Source}
C --> |Local Path| D[Verify Path Exists]
C --> |HF Hub| E[Check Internet Connection]
E --> F[Validate Repo ID and Filename]
F --> G{Repo ID Format}
G --> |Incorrect| H[Use 'owner/repo' format]
G --> |Correct| I[Check Repository Contents]
I --> J[Verify model file exists]
B --> |Corrupted File| K[Verify File Integrity]
K --> L[Re-download Model]
B --> |Memory Error| M{Available Memory}
M --> |Insufficient RAM| N[Use Smaller Model]
M --> |Insufficient VRAM| O[Enable CPU Offload]
B --> |Format Error| P{File Format}
P --> |GGUF| Q[Check GGUF Version Compatibility]
P --> |SafeTensors| R[Validate Tensor Shapes]
R --> S{Weights Found}
S --> |No| T[Check for model.safetensors or index.json]
S --> |Yes| U[Verify Model Architecture]
B --> |CUDA Error| V[Check CUDA Installation]
V --> W[Verify Driver Version]
W --> X[Match CUDA Toolkit]
B --> |Tokenizer Error| Y[Extract from Metadata]
Y --> Z[Try BPE Reconstruction]
Z --> AA[Use Default Tokenizer]
B --> |Precision Policy Error| AB[Check Current Policy]
AB --> AC[Verify Policy Compatibility with Model]
AC --> AD[Adjust Policy Settings]
D --> AE[Success]
H --> F
J --> AE
L --> AE
N --> AE
O --> AE
Q --> AE
U --> AE
X --> AE
AA --> AE
AD --> AE
This decision tree covers the most common model loading issues based on file format, quantization level, and hardware compatibility. Follow the branches corresponding to your specific error message to identify the root cause and solution.
Updated Added new branches for repository identifier validation, missing SafeTensors weights, and precision policy related issues.
Diagram sources
- actions.ts
- mod.rs
- tokenizer.rs
- hub_gguf.rs - Added repo_id format validation in commit d24451b
- hub_safetensors.rs - Added missing weights validation in commit d24451b
- precision.rs - Added precision policy implementation
The precision policy feature allows users to control the data type precision used during model loading and inference, affecting both memory consumption and computational performance.
- Default: CPU=F32, GPU=BF16 (optimal balance)
- Memory Efficient: CPU=F32, GPU=F16 (lower memory usage)
- Maximum Precision: CPU=F32, GPU=F32 (highest accuracy)
The precision policy is implemented through:
-
Backend (Rust): The
PrecisionPolicyenum inprecision.rsdefines the three policy options -
State Management: The
precision_policyfield inModelStatestores the current policy with default set toPrecisionPolicy::Default -
Model Loading: The
build_varbuilder_with_precisionfunction inweights.rsapplies the selected policy when loading models -
Tauri Commands:
get_precision_policyandset_precision_policycommands inmod.rsallow external control of precision settings
Users can access precision policy settings through the Settings page in the application. The selected policy will be applied to all subsequent model loading operations, affecting:
- Memory consumption during model loading
- Inference performance
- Numerical precision of results
``mermaid flowchart TD A[User Interface] --> B[Settings Page] B --> C{Select Policy} C --> |Default| D[CPU=F32, GPU=BF16] C --> |Memory Efficient| E[CPU=F32, GPU=F16] C --> |Maximum Precision| F[CPU=F32, GPU=F32] D --> G[Apply Policy] E --> G F --> G G --> H[Store in ModelState] H --> I[Use in build_varbuilder_with_precision] I --> J[Load Model with Selected Precision]
**Section sources**
- [precision.rs](file://src-tauri/src/core/precision.rs#L10-L194) - *Precision policy implementation*
- [state.rs](file://src-tauri/src/core/state.rs#L22-L40) - *Application state with precision policy*
- [weights.rs](file://src-tauri/src/core/weights.rs#L201-L216) - *Weight loading with precision policy*
- [mod.rs](file://src-tauri/src/api/mod.rs#L133-L144) - *Tauri commands for precision policy*
- [+page.svelte](file://src/routes/settings/+page.svelte#L0-L271) - *Settings UI for precision policy*
- [types.ts](file://src/lib/types.ts#L1-L4) - *Frontend precision policy types*
**Referenced Files in This Document**
- [error_manage.md](file://example/candle-book/src/error_manage.md) - *Updated error handling patterns*
- [device.rs](file://example/candle-core/src/cuda_backend/device.rs) - *CUDA initialization and device management*
- [actions.ts](file://src/lib/chat/controller/actions.ts) - *Frontend model loading logic*
- [mod.rs](file://src-tauri/src/api/mod.rs) - *API routing for model operations*
- [tokenizer.rs](file://src-tauri/src/core/tokenizer.rs) - *Tokenizer configuration and special token handling*
- [token_output_stream.rs](file://src-tauri/src/core/token_output_stream.rs) - *Token streaming and generation control*
- [stream.rs](file://src-tauri/src/generate/stream.rs) - *Generation loop and cancellation handling*
- [minp.rs](file://src-tauri/src/generate/minp.rs) - *MinP sampling parameter validation*
- [hub_gguf.rs](file://src-tauri/src/api/model_loading/hub_gguf.rs) - *Added repo_id format validation in commit d24451b*
- [hub_safetensors.rs](file://src-tauri/src/api/model_loading/hub_safetensors.rs) - *Added repo_id format validation in commit d24451b*
- [precision.rs](file://src-tauri/src/core/precision.rs) - *Precision policy implementation*
- [state.rs](file://src-tauri/src/core/state.rs) - *Application state with precision policy*
- [weights.rs](file://src-tauri/src/core/weights.rs) - *Weight loading with precision policy*
- [types.ts](file://src/lib/types.ts) - *Frontend precision policy types*
- [+page.svelte](file://src/routes/settings/+page.svelte) - *Settings UI for precision policy*