Setup Guide: Preparing Your Backend for Benchmarking

The benchmark scenarios need enough context window to fit the full conversation history. If your context window is too small, the model will truncate input or refuse to generate, producing skewed or empty results.

Set your context window to at least 16K tokens before running the benchmark.

Minimum context per scenario

Scenario	Max context (tokens)	+ output buffer	Minimum needed
prefill-test	~8,500	150	~9,000
doc-summary	~6,000	150	~6,500
ops-agent	~4,500	500	~5,500
creative-writing	~60	2,000	~2,500

The prefill-test scenario is the most demanding — its 4th turn sends ~8,500 tokens of context to test how prefill scales. If you only want to run a single scenario with lower requirements, use --scenario scenarios/creative-writing.json.

Symptoms of insufficient context

Turns producing very short or empty output
Errors mid-benchmark (model refuses to generate)
Wildly inconsistent results between turns
Generation speed looks normal but effective tok/s is near zero

Ollama

Check context size

ollama show <model>

Look for context_length in the model details. Ollama auto-sizes context based on available memory, so it's usually fine out of the box.

Set context size

If you need to increase it, create a Modelfile:

FROM qwen3.5:35b-a3b
PARAMETER num_ctx 16384

Then create and use the custom model:

ollama create qwen3.5-16k -f Modelfile
python3 bench.py --model qwen3.5-16k

Alternatively, Ollama respects num_ctx in API requests. The benchmark doesn't currently pass this, but Ollama's auto-sizing usually picks a large enough value.

LM Studio

Check context size

Open the model settings in the LM Studio UI. The context length is shown in the model's configuration panel.

Set context size

Open LM Studio
Select the loaded model
Go to the model settings (gear icon)
Set Context Length to at least 16384
Restart the server if it was already running

Turning off thinking for Qwen3.5 tests:

Go to the My Models page in LM Studio.
Select the model
Locate Inference on the right-hand sidebar.
Scroll down to find the Prompt Template and enter into template(Jinja ) section.
Add {%- set enable_thinking = false %} to the first line of the template.
Reload your model.

oMLX

Check context size

Open the admin panel (default: http://localhost:8000/admin). Go to Settings -> Generation Defaults. Look for:

Max Context Window — reject prompts exceeding this token limit
Max Tokens — maximum output tokens per request

Set context size

Open the oMLX admin panel
Go to Settings -> Generation Defaults
Set Max Context Window to at least 16384
Set Max Tokens to at least 2000 (the creative-writing scenario needs this)

llama-server (raw llama.cpp)

Check context size

The context size is set at startup via the -c flag. Check how the server was launched.

Set context size

Start (or restart) the server with a sufficient context size:

llama-server -m model.gguf -c 16384 --port 8090

The -c flag sets the maximum context length in tokens.

Disabling thinking for Qwen3.5

Qwen3.5 models default to thinking enabled, which adds significant latency and distorts benchmark results. The /no_think soft switch that worked for Qwen3 has been removed in Qwen3.5.

Ollama

Ollama handles this automatically — the benchmark sends "think": false via the native API.

LM Studio / oMLX

For LM Studio and oMLX, you need to patch the model's chat template. The benchmark has a --no-think flag that does this automatically:

python3 bench.py --backend lmstudio --model mlx-community/qwen3.5-35b-a3b --no-think

This backs up the template, disables thinking, runs the benchmark, and always restores the original — even on errors or Ctrl+C.

There's also a standalone toggle script:

# Check current state
python3 scripts/qwen3.5-35b-a3b-toggle-thinking.py status

# Disable thinking
python3 scripts/qwen3.5-35b-a3b-toggle-thinking.py off

# Verify via API
python3 scripts/qwen3.5-35b-a3b-toggle-thinking.py verify

See scripts/README.md for details on how the template patch works and why /no_think doesn't work for Qwen3.5.

Step-by-step: running your first benchmark

Start your backend (Ollama, LM Studio, oMLX, or llama-server)

Load a model and verify it's running:

# Ollama (default port 11434)
curl http://localhost:11434/api/tags

# LM Studio (default port 1234)
curl http://localhost:1234/v1/models

# oMLX (default port 8000)
curl -H "Authorization: Bearer your-key" http://localhost:8000/v1/models

# llama-server (default port 8090)
curl http://localhost:8090/v1/models

Check your context window is at least 16K (see backend-specific instructions above)

Run the benchmark (pick your backend):

Ollama:

python3 bench.py --model qwen3.5:35b-a3b

LM Studio:

# Add --no-think for Qwen3.5 to disable thinking mode
python3 bench.py --backend lmstudio --model mlx-community/qwen3.5-35b-a3b --no-think

oMLX:

OPENAI_API_KEY=your-key python3 bench.py --backend openai --backend-label omlx \
  --base-url http://localhost:8000 --model "Qwen3.5-35B-A3B-4bit" --label "oMLX MLX" --no-think

llama-server (raw llama.cpp):

python3 bench.py --backend llama-server --base-url http://localhost:8090 \
  --model qwen3.5:35b-a3b --label "llama-server"

MiniMax (cloud):

export MINIMAX_API_KEY=your-key-here
python3 bench.py --backend minimax --model MiniMax-M2.5

Check the results — if you see empty turns or errors, check context size and thinking mode

Compare against other backends or hardware:

python3 compare.py results/<model>/<scenario>/*.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup Guide: Preparing Your Backend for Benchmarking

Minimum context per scenario

Symptoms of insufficient context

Ollama

Check context size

Set context size

LM Studio

Check context size

Set context size

Turning off thinking for Qwen3.5 tests:

oMLX

Check context size

Set context size

llama-server (raw llama.cpp)

Check context size

Set context size

Disabling thinking for Qwen3.5

Ollama

LM Studio / oMLX

Step-by-step: running your first benchmark

FilesExpand file tree

setup-guide.md

Latest commit

History

setup-guide.md

File metadata and controls

Setup Guide: Preparing Your Backend for Benchmarking

Minimum context per scenario

Symptoms of insufficient context

Ollama

Check context size

Set context size

LM Studio

Check context size

Set context size

Turning off thinking for Qwen3.5 tests:

oMLX

Check context size

Set context size

llama-server (raw llama.cpp)

Check context size

Set context size

Disabling thinking for Qwen3.5

Ollama

LM Studio / oMLX

Step-by-step: running your first benchmark