diff --git a/docs/DEEPSEEKV2.md b/docs/DEEPSEEKV2.md index e3be6674..20affe3ad 100644 --- a/docs/DEEPSEEKV2.md +++ b/docs/DEEPSEEKV2.md @@ -1,6 +1,6 @@ -# DeepSeek V2: [`deepseek-ai/DeepSeek-V2-Lite`](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) +# V2: [`deepseek-ai/DeepSeek-V2-Lite`](https://huggingface.co/deepseek-ai/-Lite) -The DeepSeek V2 is a mixture of expert (MoE) model featuring ["Multi-head Latent Attention"](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite#5-model-architecture). +The V2 is a mixture of expert (MoE) model featuring ["Multi-head Latent Attention"](https://huggingface.co/deepseek-ai/-Lite#5-model-architecture). - Context length of **32k tokens** (Lite model), **128k tokens** (full model) - 64 routed experts (Lite model), 160 routed experts (full model) diff --git a/docs/DEEPSEEKV3.md b/docs/DEEPSEEKV3.md index 08e18b24..abe6ec8b 100644 --- a/docs/DEEPSEEKV3.md +++ b/docs/DEEPSEEKV3.md @@ -1,6 +1,6 @@ -# DeepSeek V3: [`deepseek-ai/DeepSeek-V3`](https://huggingface.co/deepseek-ai/DeepSeek-V3), [`deepseek-ai/DeepSeek-R1`](https://huggingface.co/deepseek-ai/DeepSeek-R1) +# V3: [`deepseek-ai/DeepSeek-V3`](https://huggingface.co/deepseek-ai/), [`deepseek-ai/DeepSeek-R1`](https://huggingface.co/deepseek-ai/-R1) -The DeepSeek V3 is a mixture of expert (MoE) model. +The V3 is a mixture of expert (MoE) model. ``` ./mistralrs-server --isq 4 -i plain -m deepseek-ai/DeepSeek-R1 diff --git a/docs/ISQ.md b/docs/ISQ.md index 588ba0c6..1a0b5900 100644 --- a/docs/ISQ.md +++ b/docs/ISQ.md @@ -48,8 +48,8 @@ When using ISQ, it will automatically load ISQ-able weights into CPU memory befo For Mixture of Expert models, a method called [MoQE](https://arxiv.org/abs/2310.02410) can be applied to only quantize MoE layers. This is configured via the ISQ "organization" parameter in all APIs. The following models support MoQE: - [Phi 3.5 MoE](PHI3.5MOE.md) -- [DeepSeek V2](DEEPSEEKV2.md) -- [DeepSeek V3 / DeepSeek R1](DEEPSEEKV3.md) +- [ V2](DEEPSEEKV2.md) +- [ V3 / R1](DEEPSEEKV3.md) ## Accuracy diff --git a/docs/QWEN2VL.md b/docs/QWEN2VL.md index dc8d92aa..1e6f024f 100644 --- a/docs/QWEN2VL.md +++ b/docs/QWEN2VL.md @@ -1,6 +1,6 @@ -# Qwen 2 Vision Model: [`Qwen2-VL Collection`](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) +# 2 Vision Model: [`Qwen2-VL Collection`](https://huggingface.co/collections//qwen2-vl-66cee7455501d7126940800d) -Mistral.rs supports the Qwen2-VL vision model family, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements. +Mistral.rs supports the -VL vision model family, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements. UQFF quantizations are also available. @@ -14,7 +14,7 @@ The Rust API takes an image from the [image](https://docs.rs/image/latest/image/ > Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters. *The text model has 28 layers*. ## ToC -- [Qwen 2 Vision Model: `Qwen2-VL Collection`](#qwen-2-vision-model-qwen2-vl-collection) +- [ 2 Vision Model: `Qwen2-VL Collection`](#qwen-2-vision-model-qwen2-vl-collection) - [ToC](#toc) - [Interactive mode](#interactive-mode) - [HTTP server](#http-server) @@ -25,7 +25,7 @@ The Rust API takes an image from the [image](https://docs.rs/image/latest/image/ Mistral.rs supports interactive mode for vision models! It is an easy way to interact with the model. -1) Start up interactive mode with the Qwen2-VL model +1) Start up interactive mode with the -VL model > [!NOTE] > You should replace `--features ...` with one of the features specified [here](../README.md#supported-accelerators), or remove it for pure CPU inference. diff --git a/docs/QWEN3.md b/docs/QWEN3.md index b62f50b0..d462a132 100644 --- a/docs/QWEN3.md +++ b/docs/QWEN3.md @@ -1,6 +1,6 @@ -# Qwen 3: [`collection`](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) +# 3: [`collection`](https://huggingface.co/collections//qwen3-67dd247413f0e2e4f653967f) -The Qwen 3 family is a collection of hybrid reasoning MoE and non-MoE models ranging from 0.6b to 235b parameters. +The 3 family is a collection of hybrid reasoning MoE and non-MoE models ranging from 0.6b to 235b parameters. ``` ./mistralrs-server --isq 4 -i plain -m Qwen/Qwen3-8B @@ -12,7 +12,7 @@ The Qwen 3 family is a collection of hybrid reasoning MoE and non-MoE models ran > Note: tool calling support is fully implemented for the Qwen 3 models, including agentic web search. ## Enabling thinking -The Qwen 3 models are hybrid reasoning models which can be controlled at inference-time. **By default, reasoning is enabled for these models.** To dynamically control this, it is recommended to either add `/no_think` or `/think` to your prompt. Alternatively, you can specify the `enable_thinking` flag as detailed by the API-specific examples. +The 3 models are hybrid reasoning models which can be controlled at inference-time. **By default, reasoning is enabled for these models.** To dynamically control this, it is recommended to either add `/no_think` or `/think` to your prompt. Alternatively, you can specify the `enable_thinking` flag as detailed by the API-specific examples. ## HTTP API You can find a more detailed example demonstrating enabling/disabling thinking [here](../examples/server/qwen3.py). diff --git a/docs/README.md b/docs/README.md index 2118e5f8..b48553ca 100644 --- a/docs/README.md +++ b/docs/README.md @@ -11,15 +11,15 @@ - [Phi 3.5 MoE](PHI3.5MOE.md) - [Phi 3.5 Vision](PHI3V.md) - [Llama 3.2 Vision](VLLAMA.md) -- [Qwen2-VL](QWEN2VL.md) +- [-VL](QWEN2VL.md) - [Idefics 3 and Smol VLM](IDEFICS3.md) -- [DeepSeek V2](DEEPSEEKV2.md) -- [DeepSeek V3](DEEPSEEKV3.md) +- [ V2](DEEPSEEKV2.md) +- [ V3](DEEPSEEKV3.md) - [MiniCPM-O 2.6](MINICPMO_2_6.md) - [Gemma 3](GEMMA3.md) - [Mistral 3](MISTRAL3.md) - [Llama 4](LLAMA4.md) -- [Qwen 3](QWEN3.md) +- [ 3](QWEN3.md) ## Adapters - [Docs](ADAPTER_MODELS.md) diff --git a/docs/TOOL_CALLING.md b/docs/TOOL_CALLING.md index 5ec76853..9019b1b4 100644 --- a/docs/TOOL_CALLING.md +++ b/docs/TOOL_CALLING.md @@ -20,8 +20,8 @@ We support the following models' tool calling in OpenAI-compatible and parse nat - Mistral Nemo - Hermes 2 Pro - Hermes 3 -- DeepSeek V2/V3/R1 -- Qwen 3 +- V2/V3/R1 +- 3 All models that support tool calling will respond according to the OpenAI tool calling API. diff --git a/docs/VISION_MODELS.md b/docs/VISION_MODELS.md index 67b23ef7..bd8e4287 100644 --- a/docs/VISION_MODELS.md +++ b/docs/VISION_MODELS.md @@ -8,7 +8,7 @@ Please see docs for the following model types: - Idefics2: [IDEFICS2.md](IDEFICS2.md) - LLaVA and LLaVANext: [LLAVA.md](LLaVA.md) - Llama 3.2 Vision: [VLLAMA.md](VLLAMA.md) -- Qwen2-VL: [QWEN2VL.md](QWEN2VL.md) +- -VL: [QWEN2VL.md](QWEN2VL.md) - Idefics 3 and Smol VLM: [IDEFICS3.md](IDEFICS3.md) - Phi 4 Multimodal: [PHI4MM.md](PHI4MM.md) diff --git a/docs/WEB_SEARCH.md b/docs/WEB_SEARCH.md index a815c391..39eebdee 100644 --- a/docs/WEB_SEARCH.md +++ b/docs/WEB_SEARCH.md @@ -7,7 +7,7 @@ This works with all models that support [tool calling](TOOL_CALLING.md). However - Hermes 3 3b/8b - Mistral 3 24b - Llama 4 Scout/Maverick -- Qwen 3 (⭐ Recommended!) +- 3 (⭐ Recommended!) Web search is supported both in streaming and completion responses! This makes it easy to integrate and test out in interactive mode!