Three related places where mellea has no bound on a resource. Prior reports: #650 (streaming hang confirmed in the wild), #591 (missing token cap caused flaky tests). Related: #1164 (structured cancellation for Ollama).
Streams can hang forever
send_to_queue in mellea/helpers/async_helpers.py feeds all backend streaming responses into an asyncio.Queue:
async for item in aresponse: # no timeout between chunks
await aqueue.put(item)
If the backend stalls — server stops sending without closing the connection — this loop blocks forever. The sentinel never reaches the queue, so every downstream consumer blocks with it (core/base.py:623, stdlib/streaming.py:294,361, stdlib/sampling/base.py:336). All four are the same root cause. This has been seen in practice (#650).
Unbounded download in retriever fixture
mellea/formatters/granite/retrievers/util.py downloads numbered parquet files until a 404 with no count guard. The corpus allowlist (4 names, hardcoded repo) limits exposure — the largest corpus currently has 20 parts — but there is nothing to catch a malformed or unexpectedly grown dataset.
MAX_NEW_TOKENS backend defaults vary widely
Backend defaults for MAX_NEW_TOKENS vary widely and some are very low — vLLM defaults to 16 tokens, which silently truncates most real responses (#591). Callers need to set ModelOption.MAX_NEW_TOKENS explicitly but there is nothing in the library to make that expectation clear.
What is deliberately unbounded (not a bug)
_chunk_queue and _event_queue in stdlib/streaming.py are unbounded by design — the inline comment notes that consumption is opt-in. Not proposing a change there.
Three related places where mellea has no bound on a resource. Prior reports: #650 (streaming hang confirmed in the wild), #591 (missing token cap caused flaky tests). Related: #1164 (structured cancellation for Ollama).
Streams can hang forever
send_to_queueinmellea/helpers/async_helpers.pyfeeds all backend streaming responses into an asyncio.Queue:If the backend stalls — server stops sending without closing the connection — this loop blocks forever. The sentinel never reaches the queue, so every downstream consumer blocks with it (
core/base.py:623,stdlib/streaming.py:294,361,stdlib/sampling/base.py:336). All four are the same root cause. This has been seen in practice (#650).Unbounded download in retriever fixture
mellea/formatters/granite/retrievers/util.pydownloads numbered parquet files until a 404 with no count guard. The corpus allowlist (4 names, hardcoded repo) limits exposure — the largest corpus currently has 20 parts — but there is nothing to catch a malformed or unexpectedly grown dataset.MAX_NEW_TOKENS backend defaults vary widely
Backend defaults for
MAX_NEW_TOKENSvary widely and some are very low — vLLM defaults to 16 tokens, which silently truncates most real responses (#591). Callers need to setModelOption.MAX_NEW_TOKENSexplicitly but there is nothing in the library to make that expectation clear.What is deliberately unbounded (not a bug)
_chunk_queueand_event_queueinstdlib/streaming.pyare unbounded by design — the inline comment notes that consumption is opt-in. Not proposing a change there.