Skip to content

Batch inference requests (MLX)#1474

Closed
hechibing wants to merge 2 commits intoexo-explore:mainfrom
hechibing:feat/inference-batching-1020
Closed

Batch inference requests (MLX)#1474
hechibing wants to merge 2 commits intoexo-explore:mainfrom
hechibing:feat/inference-batching-1020

Conversation

@hechibing
Copy link
Copy Markdown

Fixes #1020
Refs #1019

Bounty: $200

What changed:

  • Add MLX batch generation helper (uses mlx-lm BatchGenerator)
  • Runner now opportunistically batches compatible non-stream chat requests that arrive close together, then multiplexes per-token chunks back to each request

Batching rules (safe subset):

  • stream=false
  • no tools / tool calling
  • no logprobs
  • no stop sequences
  • enable_thinking=false
  • identical temperature/top_p/top_k/seed/model across the batch

Tuning:

  • EXO_BATCH_MAX_SIZE (default: 4)
  • EXO_BATCH_MAX_WAIT_S (default: 0.005)
  • EXO_BATCH_COMPLETION_SIZE (default: 8)
  • EXO_BATCH_PREFILL_SIZE (default: 8)

Verification:

  • python -m py_compile src/exo/worker/runner/runner.py src/exo/worker/engines/mlx/generator/generate.py

@AlexCheema
Copy link
Copy Markdown
Contributor

Code Review — PR #1474: Batch inference requests (MLX)

CI: No checks (fork PR — CI doesn't run automatically)

Overview

+338/-102 across 2 files. Adds opportunistic batching for non-streaming text generation using mlx_lm's BatchGenerator. When multiple compatible requests arrive within a short window (default 5ms), they're batched together. Batchability checks correctly exclude streaming, tools, logprobs, stop sequences, and thinking mode.

Critical issues

1. Blocker: PR will silently delete the cancellation system

Branch is based on pre-cancellation main. The cancellation system (cancel_receiver, cancelled_tasks, check_for_cancel_every, CANCEL_CURRENT_TASK) was merged in PR #1276 (commit 2759e92) after this PR was authored. Current main has ~15 lines of cancellation logic in runner.py — none of which exist in this PR's branch. Merging will silently remove all cancellation support.

Verified: current main has cancel_receiver referenced 7+ times in runner.py; this PR has zero references.

2. Exception handler only notifies first task in batch

except Exception as e:
    event_sender.send(ChunkGenerated(command_id=command_id, ...))  # only first task!
    raise

command_id comes from the case TextGeneration(command_id=command_id) match — the first task. Other batch members get no error notification and hang forever. The raise then kills the runner process, affecting everything.

3. Model-specific processing skipped in batch path

The batch path bypasses filter_kimi_tokens, patch_kimi_tokenizer, patch_glm_tokenizer, and parse_gpt_oss. These are NOT gated behind enable_thinking or tools, so a Kimi model with enable_thinking=false and no tools passes _is_batchable_text_task but gets raw unfiltered output.

Fix: exclude these models from batching, or apply post-processing to batch outputs.

Performance concerns

4. KV prefix cache completely bypassedmlx_batch_generate() creates a fresh BatchGenerator each time, ignoring kv_prefix_cache. This is a performance regression for the common case of repeated conversations with shared system prompts.

5. No cancellation support in batch path — Even after rebase, the batch generation loop has no mechanism to check for cancellation. A long batch generation can't be interrupted.

Minor

  • _env_int uses __import__("os") when os is a standard import — unnecessary indirection
  • Single-token decode (tokenizer.decode([token_id])) may produce garbled text for multi-byte characters; stream_generate handles this properly
  • Same seed for all batch members — identical prompts in the same batch produce identical output (correct per semantics but may surprise users)
  • pending.pop(0) is O(n) — use collections.deque (minor, batch is small)

What's good

  • Batchability checks are conservative and correct
  • Non-matching tasks properly queued in pending — no task loss
  • running_notified set prevents duplicate status events
  • BatchGenerator integration is clean
  • Configurable via environment variables with sensible defaults

Verdict

Do not merge. Must rebase onto current main (cancellation system), fix batch error handling (notify all tasks on failure), and address model-specific processing gaps. The batching concept is sound and the BatchGenerator integration is well-implemented — this needs a second pass, not a redesign.

@AlexCheema
Copy link
Copy Markdown
Contributor

Code Review: PR #1474 — Batch inference requests (MLX)

Summary

Adds opportunistic batching for compatible non-streaming text generation requests using mlx-lm's BatchGenerator. Requests that arrive within a configurable window are batched together and results are multiplexed back.

Review

Batching criteria (safe subset):

  • stream=false ✅ (streaming requires per-request event flow)
  • No tools/tool calling ✅
  • No logprobs ✅
  • No stop sequences ✅
  • enable_thinking=false
  • Same temperature/top_p/top_k/seed/model ✅

These constraints are conservative and correct — only batch requests that produce equivalent sampling behavior.

mlx_batch_generate function:

  • Uses mlx-lm's BatchGenerator for efficient batched inference ✅
  • Returns (index, GenerationResponse) tuples to multiplex results ✅
  • Env-configurable batch sizes (EXO_BATCH_COMPLETION_SIZE, EXO_BATCH_PREFILL_SIZE) ✅

Runner batching logic:

  • Waits up to EXO_BATCH_MAX_WAIT_S (5ms default) for more requests ✅
  • Non-compatible requests go to pending list for sequential processing ✅
  • batch_max_size capped at 4 by default ✅

Issues

1. WouldBlock import from channels

from exo.utils.channels import MpReceiver, MpSender, WouldBlock

Is WouldBlock a new exception class? It's not in the diff. If it doesn't exist in the current codebase, this will be an import error.

2. tasks.receive_nowait() — is this a valid API?
The code calls tasks.receive_nowait() on what appears to be an MpReceiver iterator. The channel abstraction may not expose receive_nowait() through the iterator protocol. Need to verify this method exists.

3. Busy-wait loop in batching

while len(batch) < batch_max_size:
    if time.perf_counter() - start >= batch_max_wait_s:
        break
    try:
        nxt = tasks.receive_nowait()
    except WouldBlock:
        time.sleep(0.0005)
        continue

This is a busy-wait with 0.5ms sleeps. For 5ms max wait, that's ~10 iterations. Fine for the default config, but if EXO_BATCH_MAX_WAIT_S is set higher, this burns CPU. Consider using a proper async wait with timeout instead.

4. Batch path skips token-by-token features
The batch path doesn't apply:

  • Thinking model detection/parsing
  • Kimi/GLM tokenizer patching
  • Tool call parsing
  • KV prefix cache

The batching criteria (enable_thinking=false, no tools, no stop sequences) correctly exclude cases that need these features, so this is safe. But worth noting that the batch path is a simplified code path.

5. completion_tokens counter removed from batch path
The single-request path had a completion_tokens counter that was used for... nothing visible in the diff. If it was used elsewhere, the batch path should maintain it.

6. No tests
No automated tests for batch generation. At minimum, test:

  • _is_batchable_text_task with various task params
  • _same_batch_settings comparison
  • mlx_batch_generate with mocked model

Verdict

Good performance improvement for high-throughput non-streaming use cases. The batching criteria are conservative and correct. Main concerns are the missing WouldBlock/receive_nowait implementation and the lack of tests. The busy-wait loop is acceptable at default settings but could be improved.

LGTM with the caveats above.

Fixes exo-explore#1020

Refs exo-explore#1019

# Conflicts:
#	src/exo/worker/runner/runner.py
# Conflicts:
#	src/exo/worker/runner/runner.py
@hechibing hechibing force-pushed the feat/inference-batching-1020 branch from b93d28f to b251ce7 Compare February 18, 2026 05:41
@hechibing
Copy link
Copy Markdown
Author

Thanks for the detailed review. I rebased this branch on current main and pushed an update.

Addressed in this update:

  • Preserved cancellation-system integration from main (cancel_receiver, CANCEL_CURRENT_TASK, periodic cancel
    checks).
  • Fixed batch failure handling so all batched command_ids receive ErrorChunk (not only the first).
  • Hardened batching eligibility to avoid single-path post-processing regressions (exclude Kimi/GLM/GPT-OSS and tool-
    parser path from batching).
  • Kept non-batch path behavior aligned with main and retained existing parsing/cancellation flow.
  • Cleaned _env_int to use standard os.environ.

Please re-review the latest commit range on PR #1474

@exo-explore exo-explore deleted a comment from AlexCheema Feb 19, 2026
@rltakashige
Copy link
Copy Markdown
Collaborator

rltakashige commented Feb 19, 2026

Hi @hechibing -- sorry for the review spam. There was a particular issue that caused these to get placed everywhere.

While I will review this tomorrow, I do believe how we will handle batching may get more complicated (especially considering prefix caching and making better use of pipeline parallelism).

(I can also resolve the merge conflicts while I'm at it if necessary)

Aside from the technical details, I don't believe we are running a bounty system anymore, unless you're linking to an old issue that used to have a bounty. Perhaps @AlexCheema can clarify this further.

@rltakashige
Copy link
Copy Markdown
Collaborator

Going to close this in favour of the merged #1642

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Concurrent inference / continuous batching.

3 participants