diff --git a/openhands/usage/llms/local-llms.mdx b/openhands/usage/llms/local-llms.mdx index 1c12f57c..f7ee28de 100644 --- a/openhands/usage/llms/local-llms.mdx +++ b/openhands/usage/llms/local-llms.mdx @@ -5,34 +5,34 @@ description: When using a Local LLM, OpenHands may have limited functionality. I ## News -- 2025/05/21: We collaborated with Mistral AI and released [Devstral Small](https://mistral.ai/news/devstral) that achieves [46.8% on SWE-Bench Verified](https://github.com/SWE-bench/experiments/pull/228)! -- 2025/03/31: We released an open model OpenHands LM 32B v0.1 that achieves 37.1% on SWE-Bench Verified -([blog](https://openhands.dev/blog/introducing-openhands-lm-32b----a-strong-open-coding-agent-model), [model](https://huggingface.co/all-hands/openhands-lm-32b-v0.1)). +- 2025/12/12: We now recommend two powerful local models for OpenHands: [Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct) and [Devstral Small 2 (24B)](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512). Both models deliver excellent performance on coding tasks and work great with OpenHands! ## Quickstart: Running OpenHands with a Local LLM using LM Studio -This guide explains how to serve a local Devstral LLM using [LM Studio](https://lmstudio.ai/) and have OpenHands connect to it. +This guide explains how to serve a local LLM using [LM Studio](https://lmstudio.ai/) and have OpenHands connect to it. We recommend: - **LM Studio** as the local model server, which handles metadata downloads automatically and offers a simple, user-friendly interface for configuration. -- **Devstral Small 2505** as the LLM for software development, trained on real GitHub issues and optimized for agent-style workflows like OpenHands. +- **Qwen3-Coder-30B-A3B-Instruct** as the LLM for software development. This model is optimized for coding tasks and works excellently with agent-style workflows like OpenHands. ### Hardware Requirements -Running Devstral requires a recent GPU with at least 16GB of VRAM, or a Mac with Apple Silicon (M1, M2, etc.) with at least 32GB of RAM. +Running Qwen3-Coder-30B-A3B-Instruct requires: +- A recent GPU with at least 12GB of VRAM (tested on RTX 3060 with 12GB VRAM + 64GB RAM), or +- A Mac with Apple Silicon with at least 32GB of RAM ### 1. Install LM Studio Download and install the LM Studio desktop app from [lmstudio.ai](https://lmstudio.ai/). -### 2. Download Devstral Small +### 2. Download the Model 1. Make sure to set the User Interface Complexity Level to "Power User", by clicking on the appropriate label at the bottom of the window. 2. Click the "Discover" button (Magnifying Glass icon) on the left navigation bar to open the Models download page. ![image](./screenshots/01_lm_studio_open_model_hub.png) -3. Search for the "Devstral Small 2505" model, confirm it's the official Mistral AI (mistralai) model, then proceed to download. +3. Search for **"Qwen3-Coder-30B-A3B-Instruct"**, confirm you're downloading from the official Qwen publisher, then proceed to download. ![image](./screenshots/02_lm_studio_download_devstral.png) @@ -46,12 +46,12 @@ Download and install the LM Studio desktop app from [lmstudio.ai](https://lmstud ![image](./screenshots/03_lm_studio_open_load_model.png) 3. Enable the "Manually choose model load parameters" switch. -4. Select 'Devstral Small 2505' from the model list. +4. Select **Qwen3-Coder-30B-A3B-Instruct** from the model list. ![image](./screenshots/04_lm_studio_setup_devstral_part_1.png) 5. Enable the "Show advanced settings" switch at the bottom of the Model settings flyout to show all the available settings. -6. Set "Context Length" to at least 32768 and enable Flash Attention. +6. Set "Context Length" to at least 22000 (for lower VRAM systems) or 32768 (recommended for better performance) and enable Flash Attention. 7. Click "Load Model" to start loading the model. ![image](./screenshots/05_lm_studio_setup_devstral_part_2.png) @@ -109,7 +109,7 @@ When started for the first time, OpenHands will prompt you to set up the LLM pro 2. Enable the "Advanced" switch at the top of the page to show all the available settings. 3. Set the following values: - - **Custom Model**: `openai/mistralai/devstral-small-2505` (the Model API identifier from LM Studio, prefixed with "openai/") + - **Custom Model**: `openai/qwen/qwen3-coder-30b-a3b-instruct` (the Model API identifier from LM Studio, prefixed with "openai/") - **Base URL**: `http://host.docker.internal:1234/v1` - **API Key**: `local-llm` @@ -128,33 +128,33 @@ This section describes how to run local LLMs with OpenHands using alternative ba ### Create an OpenAI-Compatible Endpoint with Ollama - Install Ollama following [the official documentation](https://ollama.com/download). -- Example launch command for Devstral Small 2505: +- Example launch command for Qwen3-Coder-30B-A3B-Instruct: ```bash # ⚠️ WARNING: OpenHands requires a large context size to work properly. -# When using Ollama, set OLLAMA_CONTEXT_LENGTH to at least 32768. +# When using Ollama, set OLLAMA_CONTEXT_LENGTH to at least 22000. # The default (4096) is way too small — not even the system prompt will fit, and the agent will not behave correctly. OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=-1 nohup ollama serve & -ollama pull devstral:latest +ollama pull qwen3-coder:30b ``` ### Create an OpenAI-Compatible Endpoint with vLLM or SGLang -First, download the model checkpoints. For [Devstral Small 2505](https://huggingface.co/mistralai/Devstral-Small-2505): +First, download the model checkpoint: ```bash -huggingface-cli download mistralai/Devstral-Small-2505 --local-dir mistralai/Devstral-Small-2505 +huggingface-cli download Qwen/Qwen3-Coder-30B-A3B-Instruct --local-dir Qwen/Qwen3-Coder-30B-A3B-Instruct ``` #### Serving the model using SGLang - Install SGLang following [the official documentation](https://docs.sglang.ai/start/install.html). -- Example launch command for Devstral Small 2505 (with at least 2 GPUs): +- Example launch command (with at least 2 GPUs): ```bash SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ - --model mistralai/Devstral-Small-2505 \ - --served-model-name Devstral-Small-2505 \ + --model Qwen/Qwen3-Coder-30B-A3B-Instruct \ + --served-model-name Qwen3-Coder-30B-A3B-Instruct \ --port 8000 \ --tp 2 --dp 1 \ --host 0.0.0.0 \ @@ -164,14 +164,14 @@ SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ #### Serving the model using vLLM - Install vLLM following [the official documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html). -- Example launch command for Devstral Small 2505 (with at least 2 GPUs): +- Example launch command (with at least 2 GPUs): ```bash -vllm serve mistralai/Devstral-Small-2505 \ +vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \ --host 0.0.0.0 --port 8000 \ --api-key mykey \ --tensor-parallel-size 2 \ - --served-model-name Devstral-Small-2505 \ + --served-model-name Qwen3-Coder-30B-A3B-Instruct \ --enable-prefix-caching ``` @@ -188,11 +188,11 @@ pip install git+https://github.com/snowflakedb/ArcticInference.git 2. Run the launch command with speculative decoding enabled: ```bash -vllm serve mistralai/Devstral-Small-2505 \ +vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \ --host 0.0.0.0 --port 8000 \ --api-key mykey \ --tensor-parallel-size 2 \ - --served-model-name Devstral-Small-2505 \ + --served-model-name Qwen3-Coder-30B-A3B-Instruct \ --speculative-config '{"method": "suffix"}' ``` @@ -216,7 +216,8 @@ Once OpenHands is running, open the Settings page in the UI and go to the `LLM` 2. Enable the **Advanced** toggle at the top of the page. 3. Set the following parameters, if you followed the examples above: - **Custom Model**: `openai/` - e.g. `openai/devstral` if you're using Ollama, or `openai/Devstral-Small-2505` for SGLang or vLLM. + - For **Ollama**: `openai/qwen3-coder:30b` + - For **SGLang/vLLM**: `openai/Qwen3-Coder-30B-A3B-Instruct` - **Base URL**: `http://host.docker.internal:/v1` Use port `11434` for Ollama, or `8000` for SGLang and vLLM. - **API Key**: