Skip to content

Commit 781d1ed

Browse files
author
Project Team
committed
Fix streaming timeout: use httpx.Timeout to separate connect from read
llama3.2-vision encodes the image before emitting any tokens, so first-token latency on a T4 can be 30-90s under VRAM pressure. Passing a plain integer to ollama.Client applied that value as the httpx read timeout on every individual chunk, which fired during the image-encoding phase (before the first token) even though Ollama was working correctly. Use httpx.Timeout(timeout=<configured>, connect=10) so the read timeout covers the full inference window, while the connect timeout still fails fast if Ollama is unreachable.
1 parent d4a71f6 commit 781d1ed

1 file changed

Lines changed: 17 additions & 4 deletions

File tree

app/ocr_backends.py

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -83,13 +83,26 @@ def __init__(self, model: str = "llama3.2-vision", host: str = "http://localhost
8383
self._is_available = False
8484
self._availability_error = None
8585

86-
# Import ollama library and create a client with the configured timeout.
87-
# The module-level ollama.chat() has no timeout parameter; the Client
88-
# constructor forwards **kwargs to httpx.Client, which does.
86+
# Build an httpx.Timeout that separates concerns:
87+
#
88+
# connect=10 — fail fast if Ollama isn't reachable at all
89+
# read=timeout — how long to wait for the *first* streaming token.
90+
#
91+
# llama3.2-vision does substantial image-encoding work before it emits
92+
# any tokens, so first-token latency on a T4 can be 30-90s depending on
93+
# VRAM pressure. Using a plain integer timeout applies the same value to
94+
# every chunk read, which fires prematurely on that initial encoding
95+
# phase even though the model is working fine. By setting read= to the
96+
# full configured timeout we preserve the ability to catch a genuinely
97+
# hung Ollama while not cutting off a legitimately slow first token.
8998
try:
99+
import httpx
90100
import ollama
91101
self.ollama = ollama
92-
self._client = ollama.Client(host=host, timeout=timeout)
102+
self._client = ollama.Client(
103+
host=host,
104+
timeout=httpx.Timeout(timeout=float(timeout), connect=10.0),
105+
)
93106
except ImportError:
94107
self._is_available = False
95108
self._availability_error = "ollama Python library not installed. Install with: pip install ollama"

0 commit comments

Comments
 (0)