Problem
When send_to_queue hits a STREAM_TIMEOUT, it puts a TimeoutError on the queue and returns early. For the HuggingFace backend, this leaves the model.generate() worker thread running until natural completion — the _cancel_event stopping criterion is never fired.
The consumer path determines whether the leak occurs:
stream_with_chunking (the high-level chunking orchestrator): calls await mot.cancel_generation(error=exc) when it catches the TimeoutError, which fires _cancel_hook → the thread stops. ✅
avalue() / astream() (direct access): the TimeoutError is raised without triggering cancel_generation(). The HF worker thread continues generating into an orphaned AsyncTextIteratorStreamer until it finishes naturally. ❌
This wastes GPU/CPU and holds the thread for the remainder of the generation.
Root cause
send_to_queue has no reference to the ModelOutputThunk or its _cancel_hook, so it cannot trigger cancellation. The hook is wired in the backend (output._cancel_hook = _cancel_event.set) but is only reachable via mot.cancel_generation().
Impact
Not a correctness bug for the consumer — the TimeoutError propagates correctly. The worker thread and GPU computation leak for the remainder of the generation after timeout on direct avalue()/astream() calls.
Possible approaches
- Thread a cancel callback into
send_to_queue so it can fire on timeout.
- Ensure all timeout-raising paths call
cancel_generation() before returning to the consumer.
- Route the
avalue()/astream() paths through stream_with_chunking so the existing mitigation covers them.
Related
Identified during review of #1236 (inter-chunk stream timeout). The aclose() cleanup path in send_to_queue was also ineffective for this reason (fixed in #1236).
Problem
When
send_to_queuehits aSTREAM_TIMEOUT, it puts aTimeoutErroron the queue and returns early. For the HuggingFace backend, this leaves themodel.generate()worker thread running until natural completion — the_cancel_eventstopping criterion is never fired.The consumer path determines whether the leak occurs:
stream_with_chunking(the high-level chunking orchestrator): callsawait mot.cancel_generation(error=exc)when it catches theTimeoutError, which fires_cancel_hook→ the thread stops. ✅avalue()/astream()(direct access): theTimeoutErroris raised without triggeringcancel_generation(). The HF worker thread continues generating into an orphanedAsyncTextIteratorStreameruntil it finishes naturally. ❌This wastes GPU/CPU and holds the thread for the remainder of the generation.
Root cause
send_to_queuehas no reference to theModelOutputThunkor its_cancel_hook, so it cannot trigger cancellation. The hook is wired in the backend (output._cancel_hook = _cancel_event.set) but is only reachable viamot.cancel_generation().Impact
Not a correctness bug for the consumer — the
TimeoutErrorpropagates correctly. The worker thread and GPU computation leak for the remainder of the generation after timeout on directavalue()/astream()calls.Possible approaches
send_to_queueso it can fire on timeout.cancel_generation()before returning to the consumer.avalue()/astream()paths throughstream_with_chunkingso the existing mitigation covers them.Related
Identified during review of #1236 (inter-chunk stream timeout). The
aclose()cleanup path insend_to_queuewas also ineffective for this reason (fixed in #1236).