Skip to content

Fix bg fill teardown crash in update_size_and_time_stats#13114

Open
bneradt wants to merge 1 commit into
apache:masterfrom
bneradt:fix_update_size_and_time_stats
Open

Fix bg fill teardown crash in update_size_and_time_stats#13114
bneradt wants to merge 1 commit into
apache:masterfrom
bneradt:fix_update_size_and_time_stats

Conversation

@bneradt
Copy link
Copy Markdown
Contributor

@bneradt bneradt commented Apr 21, 2026

This fixes a crash seen on docs.


Production crash logs show ink_assert(0) firing in
HttpTransact::update_size_and_time_stats from HttpSM::update_stats via
HttpSM::kill_this, triggered by an Http2Stream event re-entering the
state machine while background_fill was still in the STARTED state.
The background_fill state is normally driven to a terminal value
(COMPLETED or ABORTED) by tunnel_handler_server, but when the SM is
torn down before that handler runs the state stays STARTED and the
unreachable default branch in update_size_and_time_stats asserts. The
same path also leaks proxy.process.http.background_fill_current_count
because tunnel_handler_server is the only site that decrements the
gauge after tunnel_handler_ua incremented it.

This normalizes STARTED to ABORTED inside HttpSM::kill_this, before
the enable_http_stats gate that guards update_stats, and decrements
the gauge there. Doing the normalization in kill_this rather than in
update_stats keeps the gauge accounting balanced regardless of whether
http stats are enabled, and ensures the downstream stats helper sees a
terminal state. As a defensive backstop this also folds the STARTED
case into the ABORTED branch of update_size_and_time_stats so a future
caller that reaches this point mid-fill records the bytes against
background_fill_bytes_aborted instead of crashing the server.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a production crash in the HTTP transaction teardown path by ensuring background-fill accounting is left in a terminal state (and no longer triggers an ink_assert(0)) when the state machine is destroyed mid-fill.

Changes:

  • Treat BackgroundFill_t::STARTED as an ABORTED-equivalent in HttpTransact::update_size_and_time_stats() to avoid asserting and to record aborted bytes instead.
  • Normalize HttpSM::background_fill from STARTED to ABORTED during HttpSM::update_stats() teardown, and decrement proxy.process.http.background_fill_current_count there to prevent gauge leakage on early teardown.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/proxy/http/HttpTransact.cc Makes stats update resilient to STARTED by folding it into the aborted accounting path instead of asserting.
src/proxy/http/HttpSM.cc Normalizes leftover STARTED background-fill state during teardown and decrements the active-fill gauge to keep metrics balanced.

Comment thread src/proxy/http/HttpSM.cc Outdated
@bneradt bneradt force-pushed the fix_update_size_and_time_stats branch from e700ff2 to 82a1557 Compare April 21, 2026 23:43
@masaori335 masaori335 self-requested a review April 23, 2026 22:32
@bneradt
Copy link
Copy Markdown
Contributor Author

bneradt commented May 6, 2026

Just to follow up: this has been stable on docs for the last few weeks, so it addresses the crash.

But @masaori335 reached out offline and suggested that it would be better to maybe keep the HttpSM open for the bg fill. That is a good point and something we should look into.

@bneradt bneradt force-pushed the fix_update_size_and_time_stats branch from 82a1557 to 129aa81 Compare May 11, 2026 13:56
Production crash logs show ink_assert(0) firing in
HttpTransact::update_size_and_time_stats when an HTTP/2 stream close
re-enters HttpSM after the user-agent side was detached for background
fill. The server-to-cache tunnel should continue, but the stale
user-agent event misses the VC table and falls through to the default
tunnel handler, which tears down the SM with background_fill still
STARTED.

This ignores stale user-agent VIO and close events while a background
fill tunnel is active, so tunnel_handler_server remains responsible for
driving the fill to COMPLETED or ABORTED. This also keeps an
unconditional kill_this cleanup as a last-resort guard to balance
background_fill_current_count if any unexpected path still tears down
mid-fill.
@bneradt bneradt force-pushed the fix_update_size_and_time_stats branch from 129aa81 to 15d0fbb Compare May 11, 2026 14:44
@bneradt
Copy link
Copy Markdown
Contributor Author

bneradt commented May 11, 2026

Just to follow up: this has been stable on docs for the last few weeks, so it addresses the crash.

But @masaori335 reached out offline and suggested that it would be better to maybe keep the HttpSM open for the bg fill. That is a good point and something we should look into.

Hi @masaori335 : when you have a moment, can you check the patch again? I have this latest patch running on docs now and it has been stable since this morning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants