Batch inference requests (MLX)#1474
Conversation
Code Review — PR #1474: Batch inference requests (MLX)CI: No checks (fork PR — CI doesn't run automatically) Overview+338/-102 across 2 files. Adds opportunistic batching for non-streaming text generation using Critical issues1. Blocker: PR will silently delete the cancellation system Branch is based on pre-cancellation main. The cancellation system ( Verified: current main has 2. Exception handler only notifies first task in batch except Exception as e:
event_sender.send(ChunkGenerated(command_id=command_id, ...)) # only first task!
raise
3. Model-specific processing skipped in batch path The batch path bypasses Fix: exclude these models from batching, or apply post-processing to batch outputs. Performance concerns4. KV prefix cache completely bypassed — 5. No cancellation support in batch path — Even after rebase, the batch generation loop has no mechanism to check for cancellation. A long batch generation can't be interrupted. Minor
What's good
VerdictDo not merge. Must rebase onto current main (cancellation system), fix batch error handling (notify all tasks on failure), and address model-specific processing gaps. The batching concept is sound and the |
Code Review: PR #1474 — Batch inference requests (MLX)SummaryAdds opportunistic batching for compatible non-streaming text generation requests using ReviewBatching criteria (safe subset):
These constraints are conservative and correct — only batch requests that produce equivalent sampling behavior.
Runner batching logic:
Issues1. from exo.utils.channels import MpReceiver, MpSender, WouldBlockIs 2. 3. Busy-wait loop in batching while len(batch) < batch_max_size:
if time.perf_counter() - start >= batch_max_wait_s:
break
try:
nxt = tasks.receive_nowait()
except WouldBlock:
time.sleep(0.0005)
continueThis is a busy-wait with 0.5ms sleeps. For 5ms max wait, that's ~10 iterations. Fine for the default config, but if 4. Batch path skips token-by-token features
The batching criteria ( 5. 6. No tests
VerdictGood performance improvement for high-throughput non-streaming use cases. The batching criteria are conservative and correct. Main concerns are the missing LGTM with the caveats above. |
Fixes exo-explore#1020 Refs exo-explore#1019 # Conflicts: # src/exo/worker/runner/runner.py
# Conflicts: # src/exo/worker/runner/runner.py
b93d28f to
b251ce7
Compare
|
Thanks for the detailed review. I rebased this branch on current main and pushed an update. Addressed in this update:
Please re-review the latest commit range on PR #1474 |
|
Hi @hechibing -- sorry for the review spam. There was a particular issue that caused these to get placed everywhere. While I will review this tomorrow, I do believe how we will handle batching may get more complicated (especially considering prefix caching and making better use of pipeline parallelism). (I can also resolve the merge conflicts while I'm at it if necessary) Aside from the technical details, I don't believe we are running a bounty system anymore, unless you're linking to an old issue that used to have a bounty. Perhaps @AlexCheema can clarify this further. |
|
Going to close this in favour of the merged #1642 |
Fixes #1020
Refs #1019
Bounty: $200
What changed:
Batching rules (safe subset):
Tuning:
Verification: