Skip to content

Default continuous_batching breaks inference: ArraysCache.__init__() missing 'size' (vllm-mlx 0.2.6) #51

Description

@weklund

Summary

Stacks generated by mlx-stack enable continuous_batching: true by default. With vllm-mlx v0.2.6 + mlx 0.31.1, this makes every inference request fail in the engine loop, while the server still accepts connections (so requests just hang).

Error (server log, repeats per request)

ERROR:vllm_mlx.engine_core:Engine loop error: ArraysCache.__init__() missing 1 required positional argument: 'size'

Isolation (scratch port, single flag at a time)

flag result
--continuous-batching ❌ inference fails, 198 ArraysCache errors logged
--use-paged-cache ✅ inference OK, 0 errors

So continuous_batching is the trigger; use_paged_cache is fine in isolation.

Repro

  1. Stack tier with vllm_flags: { continuous_batching: true }
  2. mlx-stack up
  3. curl localhost:8000/v1/chat/completions -d '{"model":"...","messages":[{"role":"user","content":"hi"}],"max_tokens":8}' → hangs; log fills with the error above.

Notes / suggested fix

The root ArraysCache defect is almost certainly in vllm-mlx, but mlx-stack enabling continuous_batching by default means every generated stack is broken on this (current) vllm-mlx/mlx combo. Suggest one of: make it opt-in, gate the default behind a vllm-mlx/mlx version check, or validate it during up (see companion issue on health-check false positives). Removing continuous_batching from the tier flags fully resolves it.

Environment

  • mlx-stack 0.3.8
  • vllm-mlx v0.2.6
  • mlx 0.31.1
  • macOS 26.2 (arm64), Apple M4 Pro, 64 GB
  • model: mlx-community/Qwen3.5-9B-4bit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions