Batch inference requests (MLX) by hechibing · Pull Request #1474 · exo-explore/exo

hechibing · 2026-02-15T10:00:52Z

Fixes #1020
Refs #1019

Bounty: $200

What changed:

Add MLX batch generation helper (uses mlx-lm BatchGenerator)
Runner now opportunistically batches compatible non-stream chat requests that arrive close together, then multiplexes per-token chunks back to each request

Batching rules (safe subset):

stream=false
no tools / tool calling
no logprobs
no stop sequences
enable_thinking=false
identical temperature/top_p/top_k/seed/model across the batch

Tuning:

EXO_BATCH_MAX_SIZE (default: 4)
EXO_BATCH_MAX_WAIT_S (default: 0.005)
EXO_BATCH_COMPLETION_SIZE (default: 8)
EXO_BATCH_PREFILL_SIZE (default: 8)

Verification:

python -m py_compile src/exo/worker/runner/runner.py src/exo/worker/engines/mlx/generator/generate.py

AlexCheema · 2026-02-17T23:08:38Z

Code Review — PR #1474: Batch inference requests (MLX)

CI: No checks (fork PR — CI doesn't run automatically)

Overview

+338/-102 across 2 files. Adds opportunistic batching for non-streaming text generation using mlx_lm's BatchGenerator. When multiple compatible requests arrive within a short window (default 5ms), they're batched together. Batchability checks correctly exclude streaming, tools, logprobs, stop sequences, and thinking mode.

Critical issues

1. Blocker: PR will silently delete the cancellation system

Branch is based on pre-cancellation main. The cancellation system (cancel_receiver, cancelled_tasks, check_for_cancel_every, CANCEL_CURRENT_TASK) was merged in PR #1276 (commit 2759e92) after this PR was authored. Current main has ~15 lines of cancellation logic in runner.py — none of which exist in this PR's branch. Merging will silently remove all cancellation support.

Verified: current main has cancel_receiver referenced 7+ times in runner.py; this PR has zero references.

2. Exception handler only notifies first task in batch

except Exception as e:
    event_sender.send(ChunkGenerated(command_id=command_id, ...))  # only first task!
    raise

command_id comes from the case TextGeneration(command_id=command_id) match — the first task. Other batch members get no error notification and hang forever. The raise then kills the runner process, affecting everything.

3. Model-specific processing skipped in batch path

The batch path bypasses filter_kimi_tokens, patch_kimi_tokenizer, patch_glm_tokenizer, and parse_gpt_oss. These are NOT gated behind enable_thinking or tools, so a Kimi model with enable_thinking=false and no tools passes _is_batchable_text_task but gets raw unfiltered output.

Fix: exclude these models from batching, or apply post-processing to batch outputs.

Performance concerns

4. KV prefix cache completely bypassed — mlx_batch_generate() creates a fresh BatchGenerator each time, ignoring kv_prefix_cache. This is a performance regression for the common case of repeated conversations with shared system prompts.

5. No cancellation support in batch path — Even after rebase, the batch generation loop has no mechanism to check for cancellation. A long batch generation can't be interrupted.

Minor

_env_int uses __import__("os") when os is a standard import — unnecessary indirection
Single-token decode (tokenizer.decode([token_id])) may produce garbled text for multi-byte characters; stream_generate handles this properly
Same seed for all batch members — identical prompts in the same batch produce identical output (correct per semantics but may surprise users)
pending.pop(0) is O(n) — use collections.deque (minor, batch is small)

What's good

Batchability checks are conservative and correct
Non-matching tasks properly queued in pending — no task loss
running_notified set prevents duplicate status events
BatchGenerator integration is clean
Configurable via environment variables with sensible defaults

Verdict

Do not merge. Must rebase onto current main (cancellation system), fix batch error handling (notify all tasks on failure), and address model-specific processing gaps. The batching concept is sound and the BatchGenerator integration is well-implemented — this needs a second pass, not a redesign.

AlexCheema · 2026-02-17T23:26:01Z

Code Review: PR #1474 — Batch inference requests (MLX)

Summary

Adds opportunistic batching for compatible non-streaming text generation requests using mlx-lm's BatchGenerator. Requests that arrive within a configurable window are batched together and results are multiplexed back.

Review

Batching criteria (safe subset):

stream=false ✅ (streaming requires per-request event flow)
No tools/tool calling ✅
No logprobs ✅
No stop sequences ✅
enable_thinking=false ✅
Same temperature/top_p/top_k/seed/model ✅

These constraints are conservative and correct — only batch requests that produce equivalent sampling behavior.

mlx_batch_generate function:

Uses mlx-lm's BatchGenerator for efficient batched inference ✅
Returns (index, GenerationResponse) tuples to multiplex results ✅
Env-configurable batch sizes (EXO_BATCH_COMPLETION_SIZE, EXO_BATCH_PREFILL_SIZE) ✅

Runner batching logic:

Waits up to EXO_BATCH_MAX_WAIT_S (5ms default) for more requests ✅
Non-compatible requests go to pending list for sequential processing ✅
batch_max_size capped at 4 by default ✅

Issues

1. WouldBlock import from channels

from exo.utils.channels import MpReceiver, MpSender, WouldBlock

Is WouldBlock a new exception class? It's not in the diff. If it doesn't exist in the current codebase, this will be an import error.

2. tasks.receive_nowait() — is this a valid API?
The code calls tasks.receive_nowait() on what appears to be an MpReceiver iterator. The channel abstraction may not expose receive_nowait() through the iterator protocol. Need to verify this method exists.

3. Busy-wait loop in batching

while len(batch) < batch_max_size:
    if time.perf_counter() - start >= batch_max_wait_s:
        break
    try:
        nxt = tasks.receive_nowait()
    except WouldBlock:
        time.sleep(0.0005)
        continue

This is a busy-wait with 0.5ms sleeps. For 5ms max wait, that's ~10 iterations. Fine for the default config, but if EXO_BATCH_MAX_WAIT_S is set higher, this burns CPU. Consider using a proper async wait with timeout instead.

4. Batch path skips token-by-token features
The batch path doesn't apply:

Thinking model detection/parsing
Kimi/GLM tokenizer patching
Tool call parsing
KV prefix cache

The batching criteria (enable_thinking=false, no tools, no stop sequences) correctly exclude cases that need these features, so this is safe. But worth noting that the batch path is a simplified code path.

5. completion_tokens counter removed from batch path
The single-request path had a completion_tokens counter that was used for... nothing visible in the diff. If it was used elsewhere, the batch path should maintain it.

6. No tests
No automated tests for batch generation. At minimum, test:

_is_batchable_text_task with various task params
_same_batch_settings comparison
mlx_batch_generate with mocked model

Verdict

Good performance improvement for high-throughput non-streaming use cases. The batching criteria are conservative and correct. Main concerns are the missing WouldBlock/receive_nowait implementation and the lack of tests. The busy-wait loop is acceptable at default settings but could be improved.

LGTM with the caveats above.

Fixes exo-explore#1020 Refs exo-explore#1019 # Conflicts: # src/exo/worker/runner/runner.py

# Conflicts: # src/exo/worker/runner/runner.py

hechibing · 2026-02-18T05:55:39Z

Thanks for the detailed review. I rebased this branch on current main and pushed an update.

Addressed in this update:

Preserved cancellation-system integration from main (cancel_receiver, CANCEL_CURRENT_TASK, periodic cancel
checks).
Fixed batch failure handling so all batched command_ids receive ErrorChunk (not only the first).
Hardened batching eligibility to avoid single-path post-processing regressions (exclude Kimi/GLM/GPT-OSS and tool-
parser path from batching).
Kept non-batch path behavior aligned with main and retained existing parsing/cancellation flow.
Cleaned _env_int to use standard os.environ.

Please re-review the latest commit range on PR #1474

rltakashige · 2026-02-19T03:33:58Z

Hi @hechibing -- sorry for the review spam. There was a particular issue that caused these to get placed everywhere.

While I will review this tomorrow, I do believe how we will handle batching may get more complicated (especially considering prefix caching and making better use of pipeline parallelism).

(I can also resolve the merge conflicts while I'm at it if necessary)

Aside from the technical details, I don't believe we are running a bounty system anymore, unless you're linking to an old issue that used to have a bounty. Perhaps @AlexCheema can clarify this further.

rltakashige · 2026-04-16T14:48:30Z

Going to close this in favour of the merged #1642

hechibing added 2 commits February 18, 2026 13:39

feat: batch inference requests

3cdaafd

Fixes exo-explore#1020 Refs exo-explore#1019 # Conflicts: # src/exo/worker/runner/runner.py

fix: harden mlx batching error handling and eligibility

b251ce7

# Conflicts: # src/exo/worker/runner/runner.py

hechibing force-pushed the feat/inference-batching-1020 branch from b93d28f to b251ce7 Compare February 18, 2026 05:41

exo-explore deleted a comment from AlexCheema Feb 19, 2026

rltakashige closed this Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch inference requests (MLX)#1474

Batch inference requests (MLX)#1474
hechibing wants to merge 2 commits intoexo-explore:mainfrom
hechibing:feat/inference-batching-1020

hechibing commented Feb 15, 2026

Uh oh!

AlexCheema commented Feb 17, 2026

Uh oh!

AlexCheema commented Feb 17, 2026

Uh oh!

hechibing commented Feb 18, 2026

Uh oh!

rltakashige commented Feb 19, 2026 •

edited

Loading

Uh oh!

rltakashige commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hechibing commented Feb 15, 2026

Uh oh!

AlexCheema commented Feb 17, 2026

Code Review — PR #1474: Batch inference requests (MLX)

Overview

Critical issues

Performance concerns

Minor

What's good

Verdict

Uh oh!

AlexCheema commented Feb 17, 2026

Code Review: PR #1474 — Batch inference requests (MLX)

Summary

Review

Issues

Verdict

Uh oh!

hechibing commented Feb 18, 2026

Uh oh!

rltakashige commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rltakashige commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rltakashige commented Feb 19, 2026 •

edited

Loading