Summary
Stacks generated by mlx-stack enable continuous_batching: true by default. With vllm-mlx v0.2.6 + mlx 0.31.1, this makes every inference request fail in the engine loop, while the server still accepts connections (so requests just hang).
Error (server log, repeats per request)
ERROR:vllm_mlx.engine_core:Engine loop error: ArraysCache.__init__() missing 1 required positional argument: 'size'
Isolation (scratch port, single flag at a time)
| flag |
result |
--continuous-batching |
❌ inference fails, 198 ArraysCache errors logged |
--use-paged-cache |
✅ inference OK, 0 errors |
So continuous_batching is the trigger; use_paged_cache is fine in isolation.
Repro
- Stack tier with
vllm_flags: { continuous_batching: true }
mlx-stack up
curl localhost:8000/v1/chat/completions -d '{"model":"...","messages":[{"role":"user","content":"hi"}],"max_tokens":8}' → hangs; log fills with the error above.
Notes / suggested fix
The root ArraysCache defect is almost certainly in vllm-mlx, but mlx-stack enabling continuous_batching by default means every generated stack is broken on this (current) vllm-mlx/mlx combo. Suggest one of: make it opt-in, gate the default behind a vllm-mlx/mlx version check, or validate it during up (see companion issue on health-check false positives). Removing continuous_batching from the tier flags fully resolves it.
Environment
- mlx-stack 0.3.8
- vllm-mlx v0.2.6
- mlx 0.31.1
- macOS 26.2 (arm64), Apple M4 Pro, 64 GB
- model:
mlx-community/Qwen3.5-9B-4bit
Summary
Stacks generated by mlx-stack enable
continuous_batching: trueby default. With vllm-mlx v0.2.6 + mlx 0.31.1, this makes every inference request fail in the engine loop, while the server still accepts connections (so requests just hang).Error (server log, repeats per request)
Isolation (scratch port, single flag at a time)
--continuous-batching--use-paged-cacheSo
continuous_batchingis the trigger;use_paged_cacheis fine in isolation.Repro
vllm_flags: { continuous_batching: true }mlx-stack upcurl localhost:8000/v1/chat/completions -d '{"model":"...","messages":[{"role":"user","content":"hi"}],"max_tokens":8}'→ hangs; log fills with the error above.Notes / suggested fix
The root
ArraysCachedefect is almost certainly in vllm-mlx, but mlx-stack enablingcontinuous_batchingby default means every generated stack is broken on this (current) vllm-mlx/mlx combo. Suggest one of: make it opt-in, gate the default behind a vllm-mlx/mlx version check, or validate it duringup(see companion issue on health-check false positives). Removingcontinuous_batchingfrom the tier flags fully resolves it.Environment
mlx-community/Qwen3.5-9B-4bit