From bb4bc308546eac7d3ef15d669336bd698927ab44 Mon Sep 17 00:00:00 2001 From: openhands Date: Fri, 12 Dec 2025 19:42:53 +0000 Subject: [PATCH 1/3] Update local LLMs documentation to feature Qwen 3 Coder 30B and Devstral Small 2512 - Updated News section to highlight the two recommended models - Revised Quickstart guide to cover both Qwen 3 Coder 30B and Devstral Small 2512 - Updated hardware requirements based on user feedback (12GB VRAM for Qwen 3 Coder) - Added context window recommendations (22k minimum for Qwen 3 Coder) - Updated all model references throughout the document including Ollama, SGLang, and vLLM sections - Removed references to older models (OpenHands LM 32B v0.1 and Devstral Small 2505) --- openhands/usage/llms/local-llms.mdx | 93 +++++++++++++++++++++-------- 1 file changed, 68 insertions(+), 25 deletions(-) diff --git a/openhands/usage/llms/local-llms.mdx b/openhands/usage/llms/local-llms.mdx index 1cb96ae6..bd8ccebd 100644 --- a/openhands/usage/llms/local-llms.mdx +++ b/openhands/usage/llms/local-llms.mdx @@ -5,34 +5,38 @@ description: When using a Local LLM, OpenHands may have limited functionality. I ## News -- 2025/05/21: We collaborated with Mistral AI and released [Devstral Small](https://mistral.ai/news/devstral) that achieves [46.8% on SWE-Bench Verified](https://github.com/SWE-bench/experiments/pull/228)! -- 2025/03/31: We released an open model OpenHands LM 32B v0.1 that achieves 37.1% on SWE-Bench Verified -([blog](https://openhands.dev/blog/introducing-openhands-lm-32b----a-strong-open-coding-agent-model), [model](https://huggingface.co/all-hands/openhands-lm-32b-v0.1)). +- 2025/12/12: We now recommend two powerful local models for OpenHands: [Qwen 3 Coder 30B](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) and [Devstral Small 2512](https://mistral.ai/news/devstral). Both models deliver excellent performance on coding tasks and work great with OpenHands! ## Quickstart: Running OpenHands with a Local LLM using LM Studio -This guide explains how to serve a local Devstral LLM using [LM Studio](https://lmstudio.ai/) and have OpenHands connect to it. +This guide explains how to serve a local LLM using [LM Studio](https://lmstudio.ai/) and have OpenHands connect to it. We recommend: - **LM Studio** as the local model server, which handles metadata downloads automatically and offers a simple, user-friendly interface for configuration. -- **Devstral Small 2505** as the LLM for software development, trained on real GitHub issues and optimized for agent-style workflows like OpenHands. +- **Qwen 3 Coder 30B** or **Devstral Small 2512** as the LLM for software development. Both models are optimized for coding tasks and work excellently with agent-style workflows like OpenHands. ### Hardware Requirements -Running Devstral requires a recent GPU with at least 16GB of VRAM, or a Mac with Apple Silicon (M1, M2, etc.) with at least 32GB of RAM. +Running these models requires: +- **Qwen 3 Coder 30B**: A recent GPU with at least 12GB of VRAM (tested on RTX 3060 with 12GB VRAM + 64GB RAM), or a Mac with Apple Silicon with at least 32GB of RAM. +- **Devstral Small 2512**: A recent GPU with at least 16GB of VRAM, or a Mac with Apple Silicon with at least 32GB of RAM. ### 1. Install LM Studio Download and install the LM Studio desktop app from [lmstudio.ai](https://lmstudio.ai/). -### 2. Download Devstral Small +### 2. Download a Model 1. Make sure to set the User Interface Complexity Level to "Power User", by clicking on the appropriate label at the bottom of the window. 2. Click the "Discover" button (Magnifying Glass icon) on the left navigation bar to open the Models download page. ![image](./screenshots/01_lm_studio_open_model_hub.png) -3. Search for the "Devstral Small 2505" model, confirm it's the official Mistral AI (mistralai) model, then proceed to download. +3. Search for either: + - **"Qwen 3 Coder 30B"** (also listed as Qwen2.5-Coder-32B-Instruct) - Recommended for systems with 12GB+ VRAM + - **"Devstral Small 2512"** - Recommended for systems with 16GB+ VRAM + + Confirm you're downloading from the official publisher, then proceed to download. ![image](./screenshots/02_lm_studio_download_devstral.png) @@ -46,12 +50,12 @@ Download and install the LM Studio desktop app from [lmstudio.ai](https://lmstud ![image](./screenshots/03_lm_studio_open_load_model.png) 3. Enable the "Manually choose model load parameters" switch. -4. Select 'Devstral Small 2505' from the model list. +4. Select your downloaded model from the model list. ![image](./screenshots/04_lm_studio_setup_devstral_part_1.png) 5. Enable the "Show advanced settings" switch at the bottom of the Model settings flyout to show all the available settings. -6. Set "Context Length" to at least 32768 and enable Flash Attention. +6. Set "Context Length" to at least 22000 (for Qwen 3 Coder on lower VRAM systems) or 32768 (recommended for better performance) and enable Flash Attention. 7. Click "Load Model" to start loading the model. ![image](./screenshots/05_lm_studio_setup_devstral_part_2.png) @@ -109,7 +113,9 @@ When started for the first time, OpenHands will prompt you to set up the LLM pro 2. Enable the "Advanced" switch at the top of the page to show all the available settings. 3. Set the following values: - - **Custom Model**: `openai/mistralai/devstral-small-2505` (the Model API identifier from LM Studio, prefixed with "openai/") + - **Custom Model**: Use the Model API identifier from LM Studio, prefixed with "openai/". For example: + - `openai/qwen/qwen-3-coder-30b-a3b` for Qwen 3 Coder 30B + - `openai/mistralai/devstral-small-2512` for Devstral Small 2512 - **Base URL**: `http://host.docker.internal:1234/v1` - **API Key**: `local-llm` @@ -128,33 +134,58 @@ This section describes how to run local LLMs with OpenHands using alternative ba ### Create an OpenAI-Compatible Endpoint with Ollama - Install Ollama following [the official documentation](https://ollama.com/download). -- Example launch command for Devstral Small 2505: +- Example launch commands: +**For Qwen 3 Coder 30B:** ```bash # ⚠️ WARNING: OpenHands requires a large context size to work properly. -# When using Ollama, set OLLAMA_CONTEXT_LENGTH to at least 32768. +# When using Ollama, set OLLAMA_CONTEXT_LENGTH to at least 22000. # The default (4096) is way too small — not even the system prompt will fit, and the agent will not behave correctly. OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=-1 nohup ollama serve & +ollama pull qwen2.5-coder:32b-instruct +``` + +**For Devstral Small 2512:** +```bash +OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=-1 nohup ollama serve & ollama pull devstral:latest ``` ### Create an OpenAI-Compatible Endpoint with vLLM or SGLang -First, download the model checkpoints. For [Devstral Small 2505](https://huggingface.co/mistralai/Devstral-Small-2505): +First, download the model checkpoints: + +**For Qwen 3 Coder 30B:** +```bash +huggingface-cli download Qwen/Qwen2.5-Coder-32B-Instruct --local-dir Qwen/Qwen2.5-Coder-32B-Instruct +``` +**For Devstral Small 2512:** ```bash -huggingface-cli download mistralai/Devstral-Small-2505 --local-dir mistralai/Devstral-Small-2505 +huggingface-cli download mistralai/Devstral-Small-2512 --local-dir mistralai/Devstral-Small-2512 ``` #### Serving the model using SGLang - Install SGLang following [the official documentation](https://docs.sglang.ai/start/install.html). -- Example launch command for Devstral Small 2505 (with at least 2 GPUs): +- Example launch commands (with at least 2 GPUs): +**For Qwen 3 Coder 30B:** ```bash SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ - --model mistralai/Devstral-Small-2505 \ - --served-model-name Devstral-Small-2505 \ + --model Qwen/Qwen2.5-Coder-32B-Instruct \ + --served-model-name Qwen2.5-Coder-32B-Instruct \ + --port 8000 \ + --tp 2 --dp 1 \ + --host 0.0.0.0 \ + --api-key mykey --context-length 131072 +``` + +**For Devstral Small 2512:** +```bash +SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ + --model mistralai/Devstral-Small-2512 \ + --served-model-name Devstral-Small-2512 \ --port 8000 \ --tp 2 --dp 1 \ --host 0.0.0.0 \ @@ -164,14 +195,25 @@ SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ #### Serving the model using vLLM - Install vLLM following [the official documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html). -- Example launch command for Devstral Small 2505 (with at least 2 GPUs): +- Example launch commands (with at least 2 GPUs): + +**For Qwen 3 Coder 30B:** +```bash +vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \ + --host 0.0.0.0 --port 8000 \ + --api-key mykey \ + --tensor-parallel-size 2 \ + --served-model-name Qwen2.5-Coder-32B-Instruct \ + --enable-prefix-caching +``` +**For Devstral Small 2512:** ```bash -vllm serve mistralai/Devstral-Small-2505 \ +vllm serve mistralai/Devstral-Small-2512 \ --host 0.0.0.0 --port 8000 \ --api-key mykey \ --tensor-parallel-size 2 \ - --served-model-name Devstral-Small-2505 \ + --served-model-name Devstral-Small-2512 \ --enable-prefix-caching ``` @@ -185,14 +227,14 @@ which can achieve up to 2x speedup in some cases. pip install git+https://github.com/snowflakedb/ArcticInference.git ``` -2. Run the launch command with speculative decoding enabled: +2. Run the launch command with speculative decoding enabled (example for Qwen 3 Coder 30B): ```bash -vllm serve mistralai/Devstral-Small-2505 \ +vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \ --host 0.0.0.0 --port 8000 \ --api-key mykey \ --tensor-parallel-size 2 \ - --served-model-name Devstral-Small-2505 \ + --served-model-name Qwen2.5-Coder-32B-Instruct \ --speculative-config '{"method": "suffix"}' ``` @@ -216,7 +258,8 @@ Once OpenHands is running, open the Settings page in the UI and go to the `LLM` 2. Enable the **Advanced** toggle at the top of the page. 3. Set the following parameters, if you followed the examples above: - **Custom Model**: `openai/` - e.g. `openai/devstral` if you're using Ollama, or `openai/Devstral-Small-2505` for SGLang or vLLM. + - For **Ollama**: `openai/qwen2.5-coder:32b-instruct` or `openai/devstral` + - For **SGLang/vLLM**: `openai/Qwen2.5-Coder-32B-Instruct` or `openai/Devstral-Small-2512` - **Base URL**: `http://host.docker.internal:/v1` Use port `11434` for Ollama, or `8000` for SGLang and vLLM. - **API Key**: From 91faa1ffef7c5ec8a0a4e4a51657dc90f4097963 Mon Sep 17 00:00:00 2001 From: openhands Date: Fri, 12 Dec 2025 19:47:30 +0000 Subject: [PATCH 2/3] Fix model names to use correct official identifiers - Corrected Qwen model name to Qwen3-Coder-30B-A3B-Instruct (not Qwen2.5-Coder-32B-Instruct) - Updated Devstral to full official name: Devstral-Small-2-24B-Instruct-2512 - Fixed Ollama pull commands: qwen3-coder:30b and devstral-small-2 - Updated all HuggingFace repository paths to match official releases - Corrected model identifiers in all examples (LM Studio, Ollama, SGLang, vLLM) --- openhands/usage/llms/local-llms.mdx | 68 ++++++++++++++--------------- 1 file changed, 34 insertions(+), 34 deletions(-) diff --git a/openhands/usage/llms/local-llms.mdx b/openhands/usage/llms/local-llms.mdx index bd8ccebd..390d12b5 100644 --- a/openhands/usage/llms/local-llms.mdx +++ b/openhands/usage/llms/local-llms.mdx @@ -5,7 +5,7 @@ description: When using a Local LLM, OpenHands may have limited functionality. I ## News -- 2025/12/12: We now recommend two powerful local models for OpenHands: [Qwen 3 Coder 30B](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) and [Devstral Small 2512](https://mistral.ai/news/devstral). Both models deliver excellent performance on coding tasks and work great with OpenHands! +- 2025/12/12: We now recommend two powerful local models for OpenHands: [Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct) and [Devstral Small 2 (24B)](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512). Both models deliver excellent performance on coding tasks and work great with OpenHands! ## Quickstart: Running OpenHands with a Local LLM using LM Studio @@ -13,13 +13,13 @@ This guide explains how to serve a local LLM using [LM Studio](https://lmstudio. We recommend: - **LM Studio** as the local model server, which handles metadata downloads automatically and offers a simple, user-friendly interface for configuration. -- **Qwen 3 Coder 30B** or **Devstral Small 2512** as the LLM for software development. Both models are optimized for coding tasks and work excellently with agent-style workflows like OpenHands. +- **Qwen3-Coder-30B-A3B-Instruct** or **Devstral Small 2 (24B)** as the LLM for software development. Both models are optimized for coding tasks and work excellently with agent-style workflows like OpenHands. ### Hardware Requirements Running these models requires: -- **Qwen 3 Coder 30B**: A recent GPU with at least 12GB of VRAM (tested on RTX 3060 with 12GB VRAM + 64GB RAM), or a Mac with Apple Silicon with at least 32GB of RAM. -- **Devstral Small 2512**: A recent GPU with at least 16GB of VRAM, or a Mac with Apple Silicon with at least 32GB of RAM. +- **Qwen3-Coder-30B-A3B-Instruct**: A recent GPU with at least 12GB of VRAM (tested on RTX 3060 with 12GB VRAM + 64GB RAM), or a Mac with Apple Silicon with at least 32GB of RAM. +- **Devstral Small 2 (24B)**: A recent GPU with at least 16GB of VRAM, or a Mac with Apple Silicon with at least 32GB of RAM. ### 1. Install LM Studio @@ -33,10 +33,10 @@ Download and install the LM Studio desktop app from [lmstudio.ai](https://lmstud ![image](./screenshots/01_lm_studio_open_model_hub.png) 3. Search for either: - - **"Qwen 3 Coder 30B"** (also listed as Qwen2.5-Coder-32B-Instruct) - Recommended for systems with 12GB+ VRAM - - **"Devstral Small 2512"** - Recommended for systems with 16GB+ VRAM + - **"Qwen3-Coder-30B-A3B-Instruct"** - Recommended for systems with 12GB+ VRAM + - **"Devstral Small 2"** or **"Devstral-Small-2-24B-Instruct-2512"** - Recommended for systems with 16GB+ VRAM - Confirm you're downloading from the official publisher, then proceed to download. + Confirm you're downloading from the official publisher (Qwen or Mistral AI), then proceed to download. ![image](./screenshots/02_lm_studio_download_devstral.png) @@ -114,8 +114,8 @@ When started for the first time, OpenHands will prompt you to set up the LLM pro 3. Set the following values: - **Custom Model**: Use the Model API identifier from LM Studio, prefixed with "openai/". For example: - - `openai/qwen/qwen-3-coder-30b-a3b` for Qwen 3 Coder 30B - - `openai/mistralai/devstral-small-2512` for Devstral Small 2512 + - `openai/qwen/qwen3-coder-30b-a3b-instruct` for Qwen3-Coder-30B-A3B-Instruct + - `openai/mistralai/devstral-small-2-24b-instruct-2512` for Devstral Small 2 - **Base URL**: `http://host.docker.internal:1234/v1` - **API Key**: `local-llm` @@ -136,33 +136,33 @@ This section describes how to run local LLMs with OpenHands using alternative ba - Install Ollama following [the official documentation](https://ollama.com/download). - Example launch commands: -**For Qwen 3 Coder 30B:** +**For Qwen3-Coder-30B-A3B-Instruct:** ```bash # ⚠️ WARNING: OpenHands requires a large context size to work properly. # When using Ollama, set OLLAMA_CONTEXT_LENGTH to at least 22000. # The default (4096) is way too small — not even the system prompt will fit, and the agent will not behave correctly. OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=-1 nohup ollama serve & -ollama pull qwen2.5-coder:32b-instruct +ollama pull qwen3-coder:30b ``` -**For Devstral Small 2512:** +**For Devstral Small 2:** ```bash OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=-1 nohup ollama serve & -ollama pull devstral:latest +ollama pull devstral-small-2 ``` ### Create an OpenAI-Compatible Endpoint with vLLM or SGLang First, download the model checkpoints: -**For Qwen 3 Coder 30B:** +**For Qwen3-Coder-30B-A3B-Instruct:** ```bash -huggingface-cli download Qwen/Qwen2.5-Coder-32B-Instruct --local-dir Qwen/Qwen2.5-Coder-32B-Instruct +huggingface-cli download Qwen/Qwen3-Coder-30B-A3B-Instruct --local-dir Qwen/Qwen3-Coder-30B-A3B-Instruct ``` -**For Devstral Small 2512:** +**For Devstral Small 2:** ```bash -huggingface-cli download mistralai/Devstral-Small-2512 --local-dir mistralai/Devstral-Small-2512 +huggingface-cli download mistralai/Devstral-Small-2-24B-Instruct-2512 --local-dir mistralai/Devstral-Small-2-24B-Instruct-2512 ``` #### Serving the model using SGLang @@ -170,22 +170,22 @@ huggingface-cli download mistralai/Devstral-Small-2512 --local-dir mistralai/Dev - Install SGLang following [the official documentation](https://docs.sglang.ai/start/install.html). - Example launch commands (with at least 2 GPUs): -**For Qwen 3 Coder 30B:** +**For Qwen3-Coder-30B-A3B-Instruct:** ```bash SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ - --model Qwen/Qwen2.5-Coder-32B-Instruct \ - --served-model-name Qwen2.5-Coder-32B-Instruct \ + --model Qwen/Qwen3-Coder-30B-A3B-Instruct \ + --served-model-name Qwen3-Coder-30B-A3B-Instruct \ --port 8000 \ --tp 2 --dp 1 \ --host 0.0.0.0 \ --api-key mykey --context-length 131072 ``` -**For Devstral Small 2512:** +**For Devstral Small 2:** ```bash SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ - --model mistralai/Devstral-Small-2512 \ - --served-model-name Devstral-Small-2512 \ + --model mistralai/Devstral-Small-2-24B-Instruct-2512 \ + --served-model-name Devstral-Small-2-24B-Instruct-2512 \ --port 8000 \ --tp 2 --dp 1 \ --host 0.0.0.0 \ @@ -197,23 +197,23 @@ SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ - Install vLLM following [the official documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html). - Example launch commands (with at least 2 GPUs): -**For Qwen 3 Coder 30B:** +**For Qwen3-Coder-30B-A3B-Instruct:** ```bash -vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \ +vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \ --host 0.0.0.0 --port 8000 \ --api-key mykey \ --tensor-parallel-size 2 \ - --served-model-name Qwen2.5-Coder-32B-Instruct \ + --served-model-name Qwen3-Coder-30B-A3B-Instruct \ --enable-prefix-caching ``` -**For Devstral Small 2512:** +**For Devstral Small 2:** ```bash -vllm serve mistralai/Devstral-Small-2512 \ +vllm serve mistralai/Devstral-Small-2-24B-Instruct-2512 \ --host 0.0.0.0 --port 8000 \ --api-key mykey \ --tensor-parallel-size 2 \ - --served-model-name Devstral-Small-2512 \ + --served-model-name Devstral-Small-2-24B-Instruct-2512 \ --enable-prefix-caching ``` @@ -227,14 +227,14 @@ which can achieve up to 2x speedup in some cases. pip install git+https://github.com/snowflakedb/ArcticInference.git ``` -2. Run the launch command with speculative decoding enabled (example for Qwen 3 Coder 30B): +2. Run the launch command with speculative decoding enabled (example for Qwen3-Coder-30B-A3B-Instruct): ```bash -vllm serve Qwen/Qwen2.5-Coder-32B-Instruct \ +vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \ --host 0.0.0.0 --port 8000 \ --api-key mykey \ --tensor-parallel-size 2 \ - --served-model-name Qwen2.5-Coder-32B-Instruct \ + --served-model-name Qwen3-Coder-30B-A3B-Instruct \ --speculative-config '{"method": "suffix"}' ``` @@ -258,8 +258,8 @@ Once OpenHands is running, open the Settings page in the UI and go to the `LLM` 2. Enable the **Advanced** toggle at the top of the page. 3. Set the following parameters, if you followed the examples above: - **Custom Model**: `openai/` - - For **Ollama**: `openai/qwen2.5-coder:32b-instruct` or `openai/devstral` - - For **SGLang/vLLM**: `openai/Qwen2.5-Coder-32B-Instruct` or `openai/Devstral-Small-2512` + - For **Ollama**: `openai/qwen3-coder:30b` or `openai/devstral-small-2` + - For **SGLang/vLLM**: `openai/Qwen3-Coder-30B-A3B-Instruct` or `openai/Devstral-Small-2-24B-Instruct-2512` - **Base URL**: `http://host.docker.internal:/v1` Use port `11434` for Ollama, or `8000` for SGLang and vLLM. - **API Key**: From b376294b06e60a43d15896778b4fc8948d34d87d Mon Sep 17 00:00:00 2001 From: openhands Date: Fri, 12 Dec 2025 19:57:20 +0000 Subject: [PATCH 3/3] Simplify instructions to use only Qwen3-Coder-30B-A3B-Instruct in examples - Removed Devstral from all example commands to keep instructions concise - Kept Devstral in News section as an alternative option - Updated all sections: LM Studio, Ollama, SGLang, vLLM - Simplified hardware requirements and configuration examples --- openhands/usage/llms/local-llms.mdx | 74 +++++++---------------------- 1 file changed, 16 insertions(+), 58 deletions(-) diff --git a/openhands/usage/llms/local-llms.mdx b/openhands/usage/llms/local-llms.mdx index 390d12b5..f002a08c 100644 --- a/openhands/usage/llms/local-llms.mdx +++ b/openhands/usage/llms/local-llms.mdx @@ -13,30 +13,26 @@ This guide explains how to serve a local LLM using [LM Studio](https://lmstudio. We recommend: - **LM Studio** as the local model server, which handles metadata downloads automatically and offers a simple, user-friendly interface for configuration. -- **Qwen3-Coder-30B-A3B-Instruct** or **Devstral Small 2 (24B)** as the LLM for software development. Both models are optimized for coding tasks and work excellently with agent-style workflows like OpenHands. +- **Qwen3-Coder-30B-A3B-Instruct** as the LLM for software development. This model is optimized for coding tasks and works excellently with agent-style workflows like OpenHands. ### Hardware Requirements -Running these models requires: -- **Qwen3-Coder-30B-A3B-Instruct**: A recent GPU with at least 12GB of VRAM (tested on RTX 3060 with 12GB VRAM + 64GB RAM), or a Mac with Apple Silicon with at least 32GB of RAM. -- **Devstral Small 2 (24B)**: A recent GPU with at least 16GB of VRAM, or a Mac with Apple Silicon with at least 32GB of RAM. +Running Qwen3-Coder-30B-A3B-Instruct requires: +- A recent GPU with at least 12GB of VRAM (tested on RTX 3060 with 12GB VRAM + 64GB RAM), or +- A Mac with Apple Silicon with at least 32GB of RAM ### 1. Install LM Studio Download and install the LM Studio desktop app from [lmstudio.ai](https://lmstudio.ai/). -### 2. Download a Model +### 2. Download the Model 1. Make sure to set the User Interface Complexity Level to "Power User", by clicking on the appropriate label at the bottom of the window. 2. Click the "Discover" button (Magnifying Glass icon) on the left navigation bar to open the Models download page. ![image](./screenshots/01_lm_studio_open_model_hub.png) -3. Search for either: - - **"Qwen3-Coder-30B-A3B-Instruct"** - Recommended for systems with 12GB+ VRAM - - **"Devstral Small 2"** or **"Devstral-Small-2-24B-Instruct-2512"** - Recommended for systems with 16GB+ VRAM - - Confirm you're downloading from the official publisher (Qwen or Mistral AI), then proceed to download. +3. Search for **"Qwen3-Coder-30B-A3B-Instruct"**, confirm you're downloading from the official Qwen publisher, then proceed to download. ![image](./screenshots/02_lm_studio_download_devstral.png) @@ -50,12 +46,12 @@ Download and install the LM Studio desktop app from [lmstudio.ai](https://lmstud ![image](./screenshots/03_lm_studio_open_load_model.png) 3. Enable the "Manually choose model load parameters" switch. -4. Select your downloaded model from the model list. +4. Select **Qwen3-Coder-30B-A3B-Instruct** from the model list. ![image](./screenshots/04_lm_studio_setup_devstral_part_1.png) 5. Enable the "Show advanced settings" switch at the bottom of the Model settings flyout to show all the available settings. -6. Set "Context Length" to at least 22000 (for Qwen 3 Coder on lower VRAM systems) or 32768 (recommended for better performance) and enable Flash Attention. +6. Set "Context Length" to at least 22000 (for lower VRAM systems) or 32768 (recommended for better performance) and enable Flash Attention. 7. Click "Load Model" to start loading the model. ![image](./screenshots/05_lm_studio_setup_devstral_part_2.png) @@ -113,9 +109,7 @@ When started for the first time, OpenHands will prompt you to set up the LLM pro 2. Enable the "Advanced" switch at the top of the page to show all the available settings. 3. Set the following values: - - **Custom Model**: Use the Model API identifier from LM Studio, prefixed with "openai/". For example: - - `openai/qwen/qwen3-coder-30b-a3b-instruct` for Qwen3-Coder-30B-A3B-Instruct - - `openai/mistralai/devstral-small-2-24b-instruct-2512` for Devstral Small 2 + - **Custom Model**: `openai/qwen/qwen3-coder-30b-a3b-instruct` (the Model API identifier from LM Studio, prefixed with "openai/") - **Base URL**: `http://host.docker.internal:1234/v1` - **API Key**: `local-llm` @@ -134,9 +128,8 @@ This section describes how to run local LLMs with OpenHands using alternative ba ### Create an OpenAI-Compatible Endpoint with Ollama - Install Ollama following [the official documentation](https://ollama.com/download). -- Example launch commands: +- Example launch command for Qwen3-Coder-30B-A3B-Instruct: -**For Qwen3-Coder-30B-A3B-Instruct:** ```bash # ⚠️ WARNING: OpenHands requires a large context size to work properly. # When using Ollama, set OLLAMA_CONTEXT_LENGTH to at least 22000. @@ -145,32 +138,19 @@ OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=-1 nohup ollama pull qwen3-coder:30b ``` -**For Devstral Small 2:** -```bash -OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=-1 nohup ollama serve & -ollama pull devstral-small-2 -``` - ### Create an OpenAI-Compatible Endpoint with vLLM or SGLang -First, download the model checkpoints: +First, download the model checkpoint: -**For Qwen3-Coder-30B-A3B-Instruct:** ```bash huggingface-cli download Qwen/Qwen3-Coder-30B-A3B-Instruct --local-dir Qwen/Qwen3-Coder-30B-A3B-Instruct ``` -**For Devstral Small 2:** -```bash -huggingface-cli download mistralai/Devstral-Small-2-24B-Instruct-2512 --local-dir mistralai/Devstral-Small-2-24B-Instruct-2512 -``` - #### Serving the model using SGLang - Install SGLang following [the official documentation](https://docs.sglang.ai/start/install.html). -- Example launch commands (with at least 2 GPUs): +- Example launch command (with at least 2 GPUs): -**For Qwen3-Coder-30B-A3B-Instruct:** ```bash SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ --model Qwen/Qwen3-Coder-30B-A3B-Instruct \ @@ -181,23 +161,11 @@ SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ --api-key mykey --context-length 131072 ``` -**For Devstral Small 2:** -```bash -SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python3 -m sglang.launch_server \ - --model mistralai/Devstral-Small-2-24B-Instruct-2512 \ - --served-model-name Devstral-Small-2-24B-Instruct-2512 \ - --port 8000 \ - --tp 2 --dp 1 \ - --host 0.0.0.0 \ - --api-key mykey --context-length 131072 -``` - #### Serving the model using vLLM - Install vLLM following [the official documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html). -- Example launch commands (with at least 2 GPUs): +- Example launch command (with at least 2 GPUs): -**For Qwen3-Coder-30B-A3B-Instruct:** ```bash vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \ --host 0.0.0.0 --port 8000 \ @@ -207,16 +175,6 @@ vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \ --enable-prefix-caching ``` -**For Devstral Small 2:** -```bash -vllm serve mistralai/Devstral-Small-2-24B-Instruct-2512 \ - --host 0.0.0.0 --port 8000 \ - --api-key mykey \ - --tensor-parallel-size 2 \ - --served-model-name Devstral-Small-2-24B-Instruct-2512 \ - --enable-prefix-caching -``` - If you are interested in further improved inference speed, you can also try Snowflake's version of vLLM, [ArcticInference](https://www.snowflake.com/en/engineering-blog/fast-speculative-decoding-vllm-arctic/), which can achieve up to 2x speedup in some cases. @@ -227,7 +185,7 @@ which can achieve up to 2x speedup in some cases. pip install git+https://github.com/snowflakedb/ArcticInference.git ``` -2. Run the launch command with speculative decoding enabled (example for Qwen3-Coder-30B-A3B-Instruct): +2. Run the launch command with speculative decoding enabled: ```bash vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct \ @@ -258,8 +216,8 @@ Once OpenHands is running, open the Settings page in the UI and go to the `LLM` 2. Enable the **Advanced** toggle at the top of the page. 3. Set the following parameters, if you followed the examples above: - **Custom Model**: `openai/` - - For **Ollama**: `openai/qwen3-coder:30b` or `openai/devstral-small-2` - - For **SGLang/vLLM**: `openai/Qwen3-Coder-30B-A3B-Instruct` or `openai/Devstral-Small-2-24B-Instruct-2512` + - For **Ollama**: `openai/qwen3-coder:30b` + - For **SGLang/vLLM**: `openai/Qwen3-Coder-30B-A3B-Instruct` - **Base URL**: `http://host.docker.internal:/v1` Use port `11434` for Ollama, or `8000` for SGLang and vLLM. - **API Key**: