[Bug] Timeout traceback from _send_req_sync

When a minion finishes executing a job and attempts to return the result to the master via _return_pub, the return can fail silently or cause the worker thread to hang until its full timeout expires.

#  Root cause

  The minion uses a two-process architecture for master communication: worker threads fire events to the minion's main process, which forwards them to the master over the request channel and fires a return event back. Two
  bugs in this path:

  1. Wrong exception type. _send_req_sync raised the bare Python builtin TimeoutError when no return event arrived, but _return_pub only catches SaltReqTimeoutError. The TimeoutError propagated uncaught through
     _thread_return, crashing the worker thread with an unhandled exception instead of logging a warning and returning gracefully.

  2. Worker has no way to know the main process already gave up. When req_channel.send() exhausted its retries (return_retry_tries, each with a random return_retry_timer delay), the main process silently returned without
     notifying the worker thread. The worker then had to wait the full timeout (default 60 seconds) before it could give up — even though there was nothing left waiting for it.

  3. Timeout mismatch. The worker's wait timeout (_return_pub default: 60s) was set independently of the main process's total retry budget (return_retry_timer_max × return_retry_tries). With non-default configuration these
     can diverge, causing either premature worker timeouts or unnecessary waiting.

 # Observed symptom

  File "salt/minion.py", line N, in _send_req_sync
      raise TimeoutError("Request timed out")
  TimeoutError: Request timed out

  Worker thread exits with an unhandled exception; the job return is never delivered to the master.

#  Fix

  • Fire an error sentinel event ({"ret": None, "error": "timeout"}) from the main process when req_channel.send() times out, so the worker wakes up immediately rather than waiting out its full timeout.
  • Replace the bare TimeoutError raise in _send_req_sync with SaltReqTimeoutError so _return_pub's existing handler catches it correctly.
  • Compute effective_timeout = max(timeout, return_retry_timer_max × return_retry_tries) in _return_pub so the worker always waits at least as long as the main process may spend retrying.

  Affected versions: 3006.x, 3008.x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Timeout traceback from _send_req_sync #69416

Root cause

Observed symptom

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Timeout traceback from _send_req_sync #69416

Description

Root cause

Observed symptom

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions