Fix bg fill teardown crash in update_size_and_time_stats#13114
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a production crash in the HTTP transaction teardown path by ensuring background-fill accounting is left in a terminal state (and no longer triggers an ink_assert(0)) when the state machine is destroyed mid-fill.
Changes:
- Treat
BackgroundFill_t::STARTEDas anABORTED-equivalent inHttpTransact::update_size_and_time_stats()to avoid asserting and to record aborted bytes instead. - Normalize
HttpSM::background_fillfromSTARTEDtoABORTEDduringHttpSM::update_stats()teardown, and decrementproxy.process.http.background_fill_current_countthere to prevent gauge leakage on early teardown.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/proxy/http/HttpTransact.cc |
Makes stats update resilient to STARTED by folding it into the aborted accounting path instead of asserting. |
src/proxy/http/HttpSM.cc |
Normalizes leftover STARTED background-fill state during teardown and decrements the active-fill gauge to keep metrics balanced. |
e700ff2 to
82a1557
Compare
|
Just to follow up: this has been stable on docs for the last few weeks, so it addresses the crash. But @masaori335 reached out offline and suggested that it would be better to maybe keep the HttpSM open for the bg fill. That is a good point and something we should look into. |
82a1557 to
129aa81
Compare
Production crash logs show ink_assert(0) firing in HttpTransact::update_size_and_time_stats when an HTTP/2 stream close re-enters HttpSM after the user-agent side was detached for background fill. The server-to-cache tunnel should continue, but the stale user-agent event misses the VC table and falls through to the default tunnel handler, which tears down the SM with background_fill still STARTED. This ignores stale user-agent VIO and close events while a background fill tunnel is active, so tunnel_handler_server remains responsible for driving the fill to COMPLETED or ABORTED. This also keeps an unconditional kill_this cleanup as a last-resort guard to balance background_fill_current_count if any unexpected path still tears down mid-fill.
129aa81 to
15d0fbb
Compare
Hi @masaori335 : when you have a moment, can you check the patch again? I have this latest patch running on docs now and it has been stable since this morning. |
This fixes a crash seen on docs.
Production crash logs show ink_assert(0) firing in
HttpTransact::update_size_and_time_stats from HttpSM::update_stats via
HttpSM::kill_this, triggered by an Http2Stream event re-entering the
state machine while background_fill was still in the STARTED state.
The background_fill state is normally driven to a terminal value
(COMPLETED or ABORTED) by tunnel_handler_server, but when the SM is
torn down before that handler runs the state stays STARTED and the
unreachable default branch in update_size_and_time_stats asserts. The
same path also leaks proxy.process.http.background_fill_current_count
because tunnel_handler_server is the only site that decrements the
gauge after tunnel_handler_ua incremented it.
This normalizes STARTED to ABORTED inside HttpSM::kill_this, before
the enable_http_stats gate that guards update_stats, and decrements
the gauge there. Doing the normalization in kill_this rather than in
update_stats keeps the gauge accounting balanced regardless of whether
http stats are enabled, and ensures the downstream stats helper sees a
terminal state. As a defensive backstop this also folds the STARTED
case into the ABORTED branch of update_size_and_time_stats so a future
caller that reaches this point mid-fill records the bytes against
background_fill_bytes_aborted instead of crashing the server.