Skip to content

27. Api Reference

FerrisMind edited this page Sep 10, 2025 · 1 revision

API Reference

Update Summary

Changes Made

  • Updated loadModel command documentation to include new local_safetensors and hub_safetensors formats
  • Added detailed explanation of safetensors model loading from both local paths and Hugging Face Hub
  • Enhanced error handling section with new safetensors-specific error scenarios
  • Added architecture detection documentation for Qwen3 models from both GGUF and safetensors formats
  • Updated frontend invocation examples to show safetensors loading
  • Added information about unified dtype policy for safetensors models (BF16 for GPU, F32 for CPU)
  • Added new section on universal weights utilities and VarBuilder creation
  • Updated all relevant sections to reflect improved model loading and error reporting
  • Added documentation for ModelFactory singleton and factory-based model creation pattern
  • Updated loadModel command to reflect unified model building approach using ModelFactory

Table of Contents

  1. Introduction
  2. Command Endpoints
  3. Streaming API
  4. Frontend Invocation Examples
  5. Error Handling

Introduction

This document provides comprehensive reference documentation for the Tauri command endpoints exposed by Oxide-Lab, a desktop application for local LLM inference. These endpoints enable communication between the Svelte frontend and Rust backend via Tauri's IPC mechanism. The API supports model loading/unloading, text generation, device management, and system status queries. All commands are asynchronous and return promises when invoked from the frontend.

Section sources

  • lib.rs - Updated in recent commit
  • mod.rs - Updated in recent commit

Command Endpoints

loadModel

Loads a language model from a local file or Hugging Face Hub repository using a unified ModelFactory pattern.

Request Schema (TypeScript)

interface LoadRequest {
  format: "gguf" | "hub_gguf" | "hub_safetensors" | "local_safetensors";
  model_path?: string;
  tokenizer_path?: string;
  repo_id?: string;
  revision?: string;
  filename?: string;
  context_length: number;
  device?: DevicePreference;
}

interface DevicePreference {
  kind: "auto" | "cpu" | "cuda" | "metal";
  index?: number;
}

Request Schema (Rust)

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "kind", rename_all = "lowercase")]
pub enum DevicePreference {
    Auto,
    Cpu,
    Cuda { index: usize },
    Metal,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "format", rename_all = "lowercase")]
pub enum LoadRequest {
    Gguf {
        model_path: String,
        tokenizer_path: Option<String>,
        context_length: usize,
        device: Option<DevicePreference>,
    },
    HubGguf {
        repo_id: String,
        revision: Option<String>,
        filename: String,
        context_length: usize,
        device: Option<DevicePreference>,
    },
    HubSafetensors {
        repo_id: String,
        revision: Option<String>,
        context_length: usize,
        device: Option<DevicePreference>,
    },
    LocalSafetensors {
        model_path: String,
        context_length: usize,
        device: Option<DevicePreference>,
    },
}

Response

  • Success: Result<(), String> - Returns Ok(()) on successful model loading
  • Failure: Returns Err(String) with error message

Implementation Details The model loading process uses a unified ModelFactory singleton pattern that provides a consistent interface for building models from different sources. The factory automatically detects the model architecture and routes the loading process to the appropriate builder.

For GGUF models, the factory uses architecture detection from GGUF metadata to determine the model type. For safetensors models, it uses config.json parsing to detect the architecture. Currently, only Qwen3 architecture is supported, but the factory pattern allows for easy addition of new architectures.

Example Usage

// Load model from Hugging Face Hub (safetensors)
await invoke("load_model", {
  req: {
    format: "hub_safetensors",
    repo_id: "Qwen/Qwen2-7B-Instruct",
    context_length: 32768,
    device: { kind: "auto" }
  }
});

// Load model from local path (safetensors)
await invoke("load_model", {
  req: {
    format: "local_safetensors",
    model_path: "/path/to/local/model/directory",
    context_length: 32768,
    device: { kind: "cuda", index: 0 }
  }
});

Section sources

  • mod.rs - Updated in recent commit
  • types.rs - Updated in recent commit
  • hub_safetensors.rs - Added in recent commit
  • local_safetensors.rs - Added in recent commit
  • gguf.rs - Updated in recent commit
  • registry.rs - Updated in recent commit
  • builder.rs - Updated in recent commit

unloadModel

Unloads the currently loaded model and frees associated resources.

Request Schema (TypeScript)

// No parameters required

Request Schema (Rust)

#[tauri::command]
pub fn unload_model(state: tauri::State<SharedState<Box<dyn ModelBackend + Send>>>) -> Result<(), String>

Response

  • Success: Result<(), String> - Returns Ok(()) when model is successfully unloaded
  • Failure: Returns Err(String) with error message

Example Usage

await invoke("unload_model");

Section sources

  • mod.rs - Updated in recent commit

generateStream

Generates text responses using the loaded model with streaming support.

Request Schema (TypeScript)

interface GenerateRequest {
  prompt: string;
  temperature?: number;
  top_p?: number;
  top_k?: number;
  min_p?: number;
  repeat_penalty?: number;
  repeat_last_n: number;
  use_custom_params?: boolean;
  seed?: number;
}

Request Schema (Rust)

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct GenerateRequest {
    pub prompt: String,
    pub temperature: Option<f64>,
    pub top_p: Option<f64>,
    pub top_k: Option<usize>,
    pub min_p: Option<f64>,
    pub repeat_penalty: Option<f32>,
    pub repeat_last_n: usize,
    #[serde(default)]
    pub use_custom_params: bool,
    #[serde(default)]
    pub seed: Option<u64>,
}

Response

  • Success: Result<(), String> - Returns Ok(()) to indicate streaming has started
  • Failure: Returns Err(String) with error message
  • Streaming: Emits "token" events with response chunks

Example Usage

await invoke("generate_stream", {
  req: {
    prompt: "Hello, how are you?",
    temperature: 0.8,
    top_p: 0.95,
    top_k: 50,
    min_p: 0.05,
    repeat_penalty: 1.1,
    repeat_last_n: 64,
    use_custom_params: true,
    seed: 42
  }
});

Section sources

  • mod.rs - Updated in recent commit
  • types.rs - Updated in recent commit

cancelGeneration

Cancels an ongoing text generation process.

Request Schema (TypeScript)

// No parameters required

Request Schema (Rust)

#[tauri::command]
pub fn cancel_generation() -> Result<(), String>

Response

  • Success: Result<(), String> - Returns Ok(()) when cancellation is requested
  • Failure: Returns Err(String) with error message

Example Usage

await invoke("cancel_generation");

Section sources

  • mod.rs - Updated in recent commit

setDevice

Sets the computational device for model execution with auto-selection capability.

Auto Device Selection Behavior When DevicePreference::Auto is specified, the system follows a priority chain with runtime detection:

  1. CUDA: First attempts to initialize CUDA device (index 0) if CUDA support is compiled in and available
  2. Metal: If CUDA is unavailable, attempts to initialize Metal device if Metal support is compiled in and available
  3. CPU: If both CUDA and Metal initialization fail, falls back to CPU

The system automatically reloads the currently loaded model onto the newly selected device if a model is already loaded.

Request Schema (TypeScript)

interface SetDeviceRequest {
  pref: DevicePreference;
}

interface DevicePreference {
  kind: "auto" | "cpu" | "cuda" | "metal";
  index?: number;
}

Request Schema (Rust)

#[tauri::command]
pub fn set_device(
    state: tauri::State<SharedState<Box<dyn ModelBackend + Send>>>,
    pref: DevicePreference
) -> Result<(), String>

Response

  • Success: Result<(), String> - Returns Ok(()) when device is successfully set
  • Failure: Returns Err(String) with error message

Example Usage

// Auto selection (recommended)
await invoke("set_device", { pref: { kind: "auto" } });

// Explicit CUDA selection
await invoke("set_device", { pref: { kind: "cuda", index: 0 } });

// Explicit CPU selection
await invoke("set_device", { pref: { kind: "cpu" } });

Section sources

  • mod.rs - Updated in recent commit
  • types.rs - Updated in recent commit
  • device.rs - Updated in recent commit
  • core/device.rs - Updated in recent commit

isModelLoaded

Checks whether a model is currently loaded in memory.

Request Schema (TypeScript)

// No parameters required

Request Schema (Rust)

#[tauri::command]
pub fn is_model_loaded(
    state: tauri::State<SharedState<Box<dyn ModelBackend + Send>>>
) -> Result<bool, String>

Response

  • Success: Result<boolean, String> - Returns Ok(true) if model is loaded, Ok(false) otherwise
  • Failure: Returns Err(String) with error message

Example Usage

const isLoaded = await invoke<boolean>("is_model_loaded");

Section sources

  • mod.rs - Updated in recent commit

getChatTemplate

Retrieves the chat template used for formatting conversations.

Request Schema (TypeScript)

// No parameters required

Request Schema (Rust)

#[tauri::command]
pub fn get_chat_template(
    state: tauri::State<SharedState<Box<dyn ModelBackend + Send>>>
) -> Result<Option<String>, String>

Response

  • Success: Result<string | null, string> - Returns Ok(template) with template string or Ok(null) if no template exists
  • Failure: Returns Err(String) with error message

Example Usage

const template = await invoke<string | null>("get_chat_template");

Section sources

  • mod.rs - Updated in recent commit

renderPrompt

Renders a formatted prompt using the chat template and message history.

Request Schema (TypeScript)

interface ChatMsgDto {
  role: "user" | "assistant" | "system";
  content: string;
}

interface RenderPromptRequest {
  messages: ChatMsgDto[];
}

Request Schema (Rust)

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ChatMsgDto { 
    pub role: String, 
    pub content: String 
}

#[tauri::command]
pub fn render_prompt(
    state: tauri::State<SharedState<Box<dyn ModelBackend + Send>>>,
    messages: Vec<ChatMsgDto>
) -> Result<String, String>

Response

  • Success: Result<string, string> - Returns Ok(prompt) with rendered prompt string
  • Failure: Returns Err(String) with error message

Example Usage

const prompt = await invoke<string>("render_prompt", {
  messages: [
    { role: "user", content: "Hello" },
    { role: "assistant", content: "Hi, how can I help you?" }
  ]
});

Section sources

  • mod.rs - Updated in recent commit

getDeviceInfo

Retrieves information about available computational devices.

Request Schema (TypeScript)

// No parameters required

Request Schema (Rust)

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct DeviceInfoDto {
    pub cuda_build: bool,
    pub cuda_available: bool,
    pub current: String,
}

#[tauri::command]
pub fn get_device_info(
    state: tauri::State<SharedState<Box<dyn ModelBackend + Send>>>
) -> Result<DeviceInfoDto, String>

Response

  • Success: Result<DeviceInfoDto, string> - Returns device information object
  • Failure: Returns Err(String) with error message

DeviceInfoDto Properties:

  • cuda_build: Whether the binary was compiled with CUDA support
  • cuda_available: Whether CUDA is available on the system
  • current: Currently active device ("CPU", "CUDA", or "Metal")

Example Usage

const info = await invoke<DeviceInfoDto>("get_device_info");
console.log(`CUDA available: ${info.cuda_available}`);

Section sources

  • mod.rs - Updated in recent commit

probeCuda

Probes the system for CUDA availability and capabilities.

Request Schema (TypeScript)

// No parameters required

Request Schema (Rust)

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ProbeCudaDto {
    pub cuda_build: bool,
    pub ok: bool,
    pub error: Option<String>,
}

#[tauri::command]
pub fn probe_cuda() -> Result<ProbeCudaDto, String>

Response

  • Success: Result<ProbeCudaDto, string> - Returns probe results including build status, availability, and error details
  • Failure: Returns Err(String) with error message

ProbeCudaDto Properties:

  • cuda_build: Whether the binary was compiled with CUDA support
  • ok: Whether CUDA is available on the system
  • error: Error message if CUDA is not available

Example Usage

const probeResult = await invoke<ProbeCudaDto>("probe_cuda");
console.log(`CUDA build: ${probeResult.cuda_build}, Available: ${probeResult.ok}`);

Section sources

  • mod.rs - Updated in recent commit

Streaming API

The Oxide-Lab API uses a callback-based streaming mechanism to deliver response chunks in real-time during text generation.

Architecture

The streaming system uses Tauri's event emission system to send token chunks from the Rust backend to the frontend. The ChunkEmitter struct manages buffering and timing of emissions. A new feature has been added to filter content between <think>...<think> tags based on the presence of these tags in the prompt.

Rust Implementation (ChunkEmitter)

pub struct ChunkEmitter {
    app: tauri::AppHandle,
    buffer: String,
    last_emit_at: Instant,
    emit_interval: Duration,
    max_chunk_len: usize,
    // If true — removes content between <think>...<think> tags in the stream
    strip_think: bool,
    // State flag — whether we are inside an unclosed <think> block
    in_think_block: bool,
}

impl ChunkEmitter {
    pub fn new(app: tauri::AppHandle, strip_think: bool) -> Self {
        Self {
            app,
            buffer: String::new(),
            last_emit_at: Instant::now(),
            emit_interval: Duration::from_millis(16),
            max_chunk_len: 2048,
            strip_think,
            in_think_block: false,
        }
    }

    pub fn push_maybe_emit(&mut self, text: &str) {
        if text.is_empty() { return; }
        self.buffer.push_str(text);
        let elapsed = self.last_emit_at.elapsed();
        if elapsed >= self.emit_interval || self.buffer.len() >= self.max_chunk_len {
            let chunk = std::mem::take(&mut self.buffer);
            let out = if self.strip_think {
                self.filter_think(&chunk)
            } else { chunk };
            if !out.is_empty() {
                let _ = self.app.emit("token", out);
            }
            self.last_emit_at = Instant::now();
        }
    }

    pub fn flush(&mut self) {
        if !self.buffer.is_empty() {
            let chunk = std::mem::take(&mut self.buffer);
            let out = if self.strip_think { self.filter_think(&chunk) } else { chunk };
            if !out.is_empty() {
                let _ = self.app.emit("token", out);
            }
            self.last_emit_at = Instant::now();
        }
    }

    // Removes content between <think> and </think> tags, properly handling boundaries between chunks
    fn filter_think(&mut self, mut s: &str) -> String {
        let mut out = String::new();
        while !s.is_empty() {
            if self.in_think_block {
                if let Some(pos) = find_case_insensitive(s, "</think>") {
                    s = &s[pos + "</think>".len()..];
                    self.in_think_block = false;
                    continue;
                } else {
                    return out;
                }
            } else {
                if let Some(pos) = find_case_insensitive(s, "<think>") {
                    out.push_str(&s[..pos]);
                    s = &s[pos + "<think>".len()..];
                    self.in_think_block = true;
                    continue;
                } else {
                    out.push_str(s);
                    break;
                }
            }
        }
        out
    }
}

Streaming Flow

sequenceDiagram
participant Frontend
participant Backend
participant Emitter
Frontend->>Backend : invoke("generate_stream", request)
Backend->>Backend : Initialize generation
Backend->>Emitter : Create ChunkEmitter with strip_think flag
loop For each generated token
Backend->>Emitter : push_maybe_emit(token)
Emitter->>Emitter : Buffer token
alt Buffer full or interval reached
Emitter->>Frontend : emit("token", filtered_chunk)
Emitter->>Emitter : Reset buffer and timer
end
end
Backend->>Emitter : flush()
Emitter->>Frontend : emit("token", final_filtered_chunk)
Backend->>Frontend : Return Ok(())

Diagram sources

  • emit.rs - Updated in recent commit
  • stream.rs - Updated in recent commit
  • mod.rs - Updated in recent commit

Section sources

  • emit.rs - Updated in recent commit
  • stream.rs - Updated in recent commit

Frontend Invocation Examples

The following examples demonstrate how to invoke Tauri commands from the Svelte frontend.

Basic Command Invocation

import { invoke } from "@tauri-apps/api/core";

// Load a model from Hugging Face Hub (safetensors)
async function loadModel() {
  try {
    await invoke("load_model", {
      req: {
        format: "hub_safetensors",
        repo_id: "Qwen/Qwen2-7B-Instruct",
        context_length: 32768,
        device: { kind: "auto" }
      }
    });
    console.log("Model loaded successfully");
  } catch (error) {
    console.error("Failed to load model:", error);
  }
}

Streaming Response Handling

import { invoke, listen } from "@tauri-apps/api/core";

let response = "";
let isGenerating = false;

// Set up listener for token events
const unlisten = await listen("token", (event) => {
  const chunk = event.payload as string;
  response += chunk;
  // Update UI with new chunk
  updateResponseDisplay(response);
});

async function generateResponse(prompt: string) {
  isGenerating = true;
  response = "";
  
  try {
    await invoke("generate_stream", {
      req: {
        prompt,
        temperature: 0.8,
        top_p: 0.95,
        top_k: 50,
        min_p: 0.05,
        repeat_penalty: 1.1,
        repeat_last_n: 64,
        use_custom_params: true,
        seed: 42
      }
    });
  } catch (error) {
    console.error("Generation failed:", error);
  } finally {
    isGenerating = false;
  }
}

// Clean up listener when component is destroyed
// unlisten(); // Call this to remove the listener

Device Information Check

async function checkSystemStatus() {
  try {
    // Check if model is loaded
    const isLoaded = await invoke<boolean>("is_model_loaded");
    
    // Get device information
    const deviceInfo = await invoke<DeviceInfoDto>("get_device_info");
    
    console.log({
      modelLoaded: isLoaded,
      cudaBuild: deviceInfo.cuda_build,
      cudaAvailable: deviceInfo.cuda_available,
      currentDevice: deviceInfo.current
    });
    
    return { isLoaded, ...deviceInfo };
  } catch (error) {
    console.error("Failed to get system status:", error);
    return null;
  }
}

Auto Device Selection

async function setupOptimalDevice() {
  try {
    // Use auto-selection to let the system choose the best available device
    await invoke("set_device", { pref: { kind: "auto" } });
    
    // Get updated device information
    const deviceInfo = await invoke<DeviceInfoDto>("get_device_info");
    
    console.log(`Optimal device selected: ${deviceInfo.current}`);
    
    return deviceInfo;
  } catch (error) {
    console.error("Failed to set optimal device:", error);
    return null;
  }
}

Section sources

  • actions.ts - Updated in recent commit
  • modelActions.ts - Updated in recent commit
  • chatActions.ts - Updated in recent commit
  • deviceActions.ts - Updated in recent commit
  • types.ts - Updated in recent commit

Error Handling

The API uses Rust's Result<T, String> pattern for error handling, where errors are returned as descriptive strings.

Error Response Structure

All commands return a Result<T, String> where:

  • Success: Ok(value) with the expected return value
  • Failure: Err(String) with a human-readable error message

Common Error Scenarios

Error Scenario Example Message Recovery Strategy
Model file not found "No such file or directory" Verify file path exists
Unsupported architecture "Unsupported GGUF architecture" Use compatible model
CUDA initialization failed "CUDA init failed (index=0): CUDA driver version is insufficient for CUDA runtime version" Update GPU drivers or use CPU
Metal initialization failed "Metal init failed: Metal is not supported on this platform" Use CPU or ensure Metal compatibility
Auto-selection fallback "CUDA init failed: ..., falling back to next option" System automatically tries Metal then CPU
Invalid device preference "Invalid device preference format" Use correct device preference structure
Invalid repository ID "repo_id must be in format 'owner/repo'" Use correct format
Hugging Face download failure "hf_hub get failed" Check network connection
Tokenizer parsing error "tokenizer.json parse error" Use model with valid tokenizer
Safetensors file not found "Safetensors file not found: /path/to/model.safetensors" Verify file path exists
No safetensors files found "No safetensors files found (model.safetensors[.index.json])" Check if directory contains model files
Failed to create VarBuilder "Failed to create VarBuilder: ..." Check file permissions and disk space
Qwen3 config parse error "Failed to parse Qwen3 config: ..." Use model with valid configuration

Frontend Error Handling Pattern

async function safeInvoke<T>(command: string, args?: any): Promise<T | null> {
  try {
    const result = await invoke<T>(command, args);
    return result;
  } catch (error) {
    // Errors from Tauri commands are typically strings
    const errorMessage = error instanceof Error ? error.message : String(error);
    
    console.error(`Command '${command}' failed:`, errorMessage);
    
    // Handle specific error cases
    if (errorMessage.includes("CUDA init failed")) {
      showNotification("CUDA initialization failed. The system will try Metal or fall back to CPU.");
    } else if (errorMessage.includes("Metal init failed")) {
      showNotification("Metal initialization failed. Falling back to CPU.");
    } else if (errorMessage.includes("file or directory")) {
      showNotification("Model file not found. Please check the path.");
    } else if (errorMessage.includes("Invalid device preference")) {
      showNotification("Invalid device preference. Please check device settings.");
    } else if (errorMessage.includes("safetensors file not found")) {
      showNotification("Safetensors file not found. Please verify the model path.");
    } else if (errorMessage.includes("No safetensors files found")) {
      showNotification("No safetensors files found in the specified directory. Please check the model location.");
    }
    
    return null;
  }
}

Section sources

  • mod.rs - Updated in recent commit
  • actions.ts - Updated in recent commit
  • modelActions.ts - Updated in recent commit
  • device.rs - Updated in recent commit
  • local_safetensors.rs - Added in recent commit
  • hub_safetensors.rs - Added in recent commit

Referenced Files in This Document

  • lib.rs - Updated in recent commit
  • mod.rs - Updated in recent commit
  • types.rs - Updated in recent commit
  • actions.ts - Updated in recent commit
  • modelActions.ts - Updated in recent commit
  • chatActions.ts - Updated in recent commit
  • deviceActions.ts - Updated in recent commit
  • types.ts - Updated in recent commit
  • emit.rs - Updated in recent commit
  • stream.rs - Updated in recent commit
  • device.rs - Updated in recent commit
  • core/device.rs - Updated in recent commit
  • local_safetensors.rs - Added in recent commit
  • hub_safetensors.rs - Added in recent commit
  • weights.rs - Added in recent commit
  • model.rs - Updated in recent commit
  • candle_llm.rs - Added in recent commit
  • registry.rs - Updated in recent commit
  • qwen3_builder.rs - Updated in recent commit
  • builder.rs - Updated in recent commit
  • gguf.rs - Updated in recent commit

Clone this wiki locally