Background
When a model has a timeout configured, it currently acts as a request timeout — i.e., it only covers the time until response headers are received. Once the SSE stream is established, individual chunk delivery is unbounded.
A separate, configurable per-chunk (or inter-token) timeout is needed to detect stalled upstream streams after the response headers have already been received.
Considerations
- Some providers under high load may return response headers immediately but queue the actual LLM computation, meaning the first token could arrive much later. An overly aggressive chunk timeout could hurt user experience and waste upstream cost.
- The timeout should be configurable independently from the request timeout, and should likely apply between chunks (idle timeout) rather than as an absolute deadline from stream open.
- On chunk timeout expiry, the handler should emit a structured Anthropic-style
error SSE event and terminate the stream cleanly, ensuring usage_rx is closed and request_ctx/span_ctx are finalized.
- Both
src/proxy/handlers/chat_completions/mod.rs and src/proxy/handlers/messages/mod.rs are affected.
References
/cc @bzp2010
Background
When a model has a
timeoutconfigured, it currently acts as a request timeout — i.e., it only covers the time until response headers are received. Once the SSE stream is established, individual chunk delivery is unbounded.A separate, configurable per-chunk (or inter-token) timeout is needed to detect stalled upstream streams after the response headers have already been received.
Considerations
errorSSE event and terminate the stream cleanly, ensuringusage_rxis closed andrequest_ctx/span_ctxare finalized.src/proxy/handlers/chat_completions/mod.rsandsrc/proxy/handlers/messages/mod.rsare affected.References
/cc @bzp2010