When a minion finishes executing a job and attempts to return the result to the master via _return_pub, the return can fail silently or cause the worker thread to hang until its full timeout expires.
Root cause
The minion uses a two-process architecture for master communication: worker threads fire events to the minion's main process, which forwards them to the master over the request channel and fires a return event back. Two
bugs in this path:
-
Wrong exception type. _send_req_sync raised the bare Python builtin TimeoutError when no return event arrived, but _return_pub only catches SaltReqTimeoutError. The TimeoutError propagated uncaught through
_thread_return, crashing the worker thread with an unhandled exception instead of logging a warning and returning gracefully.
-
Worker has no way to know the main process already gave up. When req_channel.send() exhausted its retries (return_retry_tries, each with a random return_retry_timer delay), the main process silently returned without
notifying the worker thread. The worker then had to wait the full timeout (default 60 seconds) before it could give up — even though there was nothing left waiting for it.
-
Timeout mismatch. The worker's wait timeout (_return_pub default: 60s) was set independently of the main process's total retry budget (return_retry_timer_max × return_retry_tries). With non-default configuration these
can diverge, causing either premature worker timeouts or unnecessary waiting.
Observed symptom
File "salt/minion.py", line N, in _send_req_sync
raise TimeoutError("Request timed out")
TimeoutError: Request timed out
Worker thread exits with an unhandled exception; the job return is never delivered to the master.
Fix
• Fire an error sentinel event ({"ret": None, "error": "timeout"}) from the main process when req_channel.send() times out, so the worker wakes up immediately rather than waiting out its full timeout.
• Replace the bare TimeoutError raise in _send_req_sync with SaltReqTimeoutError so _return_pub's existing handler catches it correctly.
• Compute effective_timeout = max(timeout, return_retry_timer_max × return_retry_tries) in _return_pub so the worker always waits at least as long as the main process may spend retrying.
Affected versions: 3006.x, 3008.x
When a minion finishes executing a job and attempts to return the result to the master via _return_pub, the return can fail silently or cause the worker thread to hang until its full timeout expires.
Root cause
The minion uses a two-process architecture for master communication: worker threads fire events to the minion's main process, which forwards them to the master over the request channel and fires a return event back. Two
bugs in this path:
Wrong exception type. _send_req_sync raised the bare Python builtin TimeoutError when no return event arrived, but _return_pub only catches SaltReqTimeoutError. The TimeoutError propagated uncaught through
_thread_return, crashing the worker thread with an unhandled exception instead of logging a warning and returning gracefully.
Worker has no way to know the main process already gave up. When req_channel.send() exhausted its retries (return_retry_tries, each with a random return_retry_timer delay), the main process silently returned without
notifying the worker thread. The worker then had to wait the full timeout (default 60 seconds) before it could give up — even though there was nothing left waiting for it.
Timeout mismatch. The worker's wait timeout (_return_pub default: 60s) was set independently of the main process's total retry budget (return_retry_timer_max × return_retry_tries). With non-default configuration these
can diverge, causing either premature worker timeouts or unnecessary waiting.
Observed symptom
File "salt/minion.py", line N, in _send_req_sync
raise TimeoutError("Request timed out")
TimeoutError: Request timed out
Worker thread exits with an unhandled exception; the job return is never delivered to the master.
Fix
• Fire an error sentinel event ({"ret": None, "error": "timeout"}) from the main process when req_channel.send() times out, so the worker wakes up immediately rather than waiting out its full timeout.
• Replace the bare TimeoutError raise in _send_req_sync with SaltReqTimeoutError so _return_pub's existing handler catches it correctly.
• Compute effective_timeout = max(timeout, return_retry_timer_max × return_retry_tries) in _return_pub so the worker always waits at least as long as the main process may spend retrying.
Affected versions: 3006.x, 3008.x