Skip to content

rec: add watchdog to detect tlog_rec_transfer spin loop after child exit#378

Open
uncaney wants to merge 1 commit into
Scribery:mainfrom
uncaney:fix/rec-spin-loop-watchdog
Open

rec: add watchdog to detect tlog_rec_transfer spin loop after child exit#378
uncaney wants to merge 1 commit into
Scribery:mainfrom
uncaney:fix/rec-spin-loop-watchdog

Conversation

@uncaney
Copy link
Copy Markdown

@uncaney uncaney commented May 6, 2026

Summary

When the wrapped child has exited but buffered packets remain in the position state machine, tlog_rec_transfer (lib/tlog/rec.c) can spin indefinitely in tlog_pkt_pos_cmp / tlog_sink_write without making any syscalls.

This patch adds a 30-second watchdog: if tlog_rec_child_exited is true AND no successful read/write has happened for 30s, the loop breaks with ETIMEDOUT and a clear error message. 21 lines added, 0 removed.

Reproduction (production)

Observed on Ubuntu 24.04 (tlog 14, package libtlog0):

  • A tlog-recorded interactive shell ran a pipeline like find / -path "*forgejo*" 2>/dev/null | head -5
  • After head exits early and find gets SIGPIPE, the wrapped tlog-rec-session was left burning 99.6% CPU forever.
  • strace -c -p $PID over 5 seconds: zero syscalls. Pure user-space compute loop.
  • perf top -p $PID: 99% in tlog_rectlog_sink_writetlog_json_chunk_writetlog_pkt_pos_cmp.
  • /proc/$PID/stat: state=R, utime/etime ≈ 1.0 (one full core, sustained).
  • Multiple instances accumulated 22h+ of CPU waste before being killed by an external watchdog.

Root cause analysis

In the existing main loop, the only break paths after the child exits dont fire when:

  • The child has exited
  • BUT tlog_pkt_is_eof(&pkt) is false (buffered data remains)
  • AND grc == TLOG_RC_OK (writes succeed but dont advance tty_pos past log_pos)

Result: infinite loop between two near-done states.

This watchdog prevents the symptom without behavior change for healthy sessions.

Testing

  • Built cleanly on tlog 14 / Ubuntu 24.04.
  • Watchdog string verified in compiled lib.
  • Smoke test passes on healthy sessions.
  • Reproducing the spin in a controlled test rig is non-trivial (requires specific TTY + buffer pattern). External tlog-orphan-killer systemd timer catches runaways in 5 min as defense-in-depth meanwhile.

30-second threshold is conservative; happy to make configurable.

When the wrapped child has exited but buffered packets remain in the
position state machine, the main loop in tlog_rec_transfer can spin
indefinitely in tlog_pkt_pos_cmp / tlog_sink_write without making any
syscalls. The kernel sees pure compute and cannot intervene.

Reproduced in production on Ubuntu 24.04 (tlog 14): a tlog-recorded
shell running `find / | head -5` left orphan tlog-rec-session processes
burning 99.6% CPU forever (verified via strace -c showing 0 syscalls
over 5 seconds, perf top showing tlog_pkt_pos_cmp on top).

This patch adds a 30-second watchdog inside the loop:
- watchdog_last_progress is updated after each successful sink_write
  or source_read.
- If tlog_rec_child_exited && time(NULL) - watchdog_last_progress > 30,
  the loop breaks with ETIMEDOUT and an error message.

No behavior change for healthy sessions — the watchdog only fires when
the bug condition is hit. Tested via build + smoke runs on a real
session.

The deeper bug (why tlog_pkt_pos_cmp never reaches a terminal state
when there are buffered packets after EOF on input) is separate and
would deserve its own investigation. This patch is a defense-in-depth
measure that prevents the visible symptom (CPU burn forever).

Signed-off-by: Frédéric Chauvat <frederic@chauvat.com>
Signed-off-by: Frédéric Chauvat (uncaney) <frederic@chauvat.com>
@justin-stephenson
Copy link
Copy Markdown
Collaborator

Thank you for the PR @uncaney - Can you please rebase this PR on top of master? I just now pushed a commit to fix the rpmbuild failure in Integration tests / fedora-build PRCI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants