rec: add watchdog to detect tlog_rec_transfer spin loop after child exit#378
Open
uncaney wants to merge 1 commit into
Open
rec: add watchdog to detect tlog_rec_transfer spin loop after child exit#378uncaney wants to merge 1 commit into
uncaney wants to merge 1 commit into
Conversation
When the wrapped child has exited but buffered packets remain in the position state machine, the main loop in tlog_rec_transfer can spin indefinitely in tlog_pkt_pos_cmp / tlog_sink_write without making any syscalls. The kernel sees pure compute and cannot intervene. Reproduced in production on Ubuntu 24.04 (tlog 14): a tlog-recorded shell running `find / | head -5` left orphan tlog-rec-session processes burning 99.6% CPU forever (verified via strace -c showing 0 syscalls over 5 seconds, perf top showing tlog_pkt_pos_cmp on top). This patch adds a 30-second watchdog inside the loop: - watchdog_last_progress is updated after each successful sink_write or source_read. - If tlog_rec_child_exited && time(NULL) - watchdog_last_progress > 30, the loop breaks with ETIMEDOUT and an error message. No behavior change for healthy sessions — the watchdog only fires when the bug condition is hit. Tested via build + smoke runs on a real session. The deeper bug (why tlog_pkt_pos_cmp never reaches a terminal state when there are buffered packets after EOF on input) is separate and would deserve its own investigation. This patch is a defense-in-depth measure that prevents the visible symptom (CPU burn forever). Signed-off-by: Frédéric Chauvat <frederic@chauvat.com> Signed-off-by: Frédéric Chauvat (uncaney) <frederic@chauvat.com>
Collaborator
|
Thank you for the PR @uncaney - Can you please rebase this PR on top of master? I just now pushed a commit to fix the rpmbuild failure in |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the wrapped child has exited but buffered packets remain in the position state machine,
tlog_rec_transfer(lib/tlog/rec.c) can spin indefinitely intlog_pkt_pos_cmp/tlog_sink_writewithout making any syscalls.This patch adds a 30-second watchdog: if
tlog_rec_child_exitedis true AND no successful read/write has happened for 30s, the loop breaks withETIMEDOUTand a clear error message. 21 lines added, 0 removed.Reproduction (production)
Observed on Ubuntu 24.04 (tlog 14, package
libtlog0):find / -path "*forgejo*" 2>/dev/null | head -5headexits early andfindgetsSIGPIPE, the wrappedtlog-rec-sessionwas left burning 99.6% CPU forever.strace -c -p $PIDover 5 seconds: zero syscalls. Pure user-space compute loop.perf top -p $PID: 99% intlog_rec→tlog_sink_write→tlog_json_chunk_write→tlog_pkt_pos_cmp./proc/$PID/stat:state=R,utime/etime ≈ 1.0(one full core, sustained).Root cause analysis
In the existing main loop, the only
breakpaths after the child exits dont fire when:tlog_pkt_is_eof(&pkt)is false (buffered data remains)grc == TLOG_RC_OK(writes succeed but dont advancetty_pospastlog_pos)Result: infinite loop between two near-done states.
This watchdog prevents the symptom without behavior change for healthy sessions.
Testing
30-second threshold is conservative; happy to make configurable.