Skip to content

Conversation

@fhanau
Copy link
Contributor

@fhanau fhanau commented Dec 3, 2025

This will avoid excessive memory overhead in a few edge cases – usually the tail worker will be able to keep up with reporting events but we still need to put a limit to the queue size.

@fhanau fhanau force-pushed the felix/112625-stw-load-shed branch from 3a09407 to 7692d3e Compare December 3, 2025 22:01
tracing::SpanOpen(span.spanId, span.operationName.clone()), span.startTime, spanNameSize);
// If a span manages to exceed the size limit, truncate it by not providing span attributes.
if (span.tags.size() && messageSize <= MAX_TRACE_BYTES) {
if (span.tags.size() && spanTagsSize <= MAX_TRACE_BYTES) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the watchful reviewer will notice, this is a minor functional change: We're no longer including the size of the span name in this check since it is included in a different tail event, so we're slightly relaxing the size limit for emitting span tags here.

// The TailStreamWriterState holds the current client-side state for a collection
// of streaming tail workers that a worker is reporting events to.
struct TailStreamWriterState {
// The maximum size of the queue, in bytes.
Copy link
Contributor Author

@fhanau fhanau Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to different names for these constants/defining them somewhere else.

Note that testing that these checks take effect at some point will be provided in a downstream PR (to follow). This will also include a discussion of why they are sufficient for almost all use cases.

@fhanau fhanau marked this pull request as ready for review December 3, 2025 22:08
@fhanau fhanau requested review from a team as code owners December 3, 2025 22:08
@fhanau fhanau requested a review from mar-cf December 3, 2025 22:08
@fhanau
Copy link
Contributor Author

fhanau commented Dec 3, 2025

This is now feature-complete. As noted in a PR comment – tests for this and rationale will be provided in a downstream PR, but I think we can already discuss the merits of the code changes here.

@mar-cf
Copy link
Contributor

mar-cf commented Dec 10, 2025

Add tests

@fhanau fhanau force-pushed the felix/112625-stw-load-shed branch 2 times, most recently from 70cfa7f to 51d612b Compare December 16, 2025 02:25
Copy link
Contributor

@mar-cf mar-cf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concern is if it's possible to produce inconsistent traces due to dropping.

@mar-cf
Copy link
Contributor

mar-cf commented Jan 2, 2026

Offline the team discussed some concerns. I'm going to assume this isn't blocked on me until I see those updates. Feel free to ping if I'm needed.

@fhanau
Copy link
Contributor Author

fhanau commented Jan 8, 2026

Add tests

Tests for this are available downstream, in #11842 and #12029.

@fhanau fhanau changed the title [o11y] Drop tail stream events when reaching excessive queue size EW-9735 EW-9736 [o11y] Drop tail stream events when reaching excessive queue size Jan 9, 2026
@fhanau fhanau force-pushed the felix/112625-stw-load-shed branch 2 times, most recently from a0cc9a9 to db917db Compare January 9, 2026 17:02
@codspeed-hq

This comment was marked as outdated.

@fhanau fhanau force-pushed the felix/112625-stw-load-shed branch 3 times, most recently from 1aecc65 to 0074510 Compare January 13, 2026 19:50
auto log = kj::str(
"[\"Dropped ", active->droppedEvents, " tail events due to excessive queueing\"]");
TailEvent droppedEventsLog(event.spanContext.clone(), event.invocationId, event.timestamp,
event.sequence, tracing::Log(event.timestamp, LogLevel::WARN, kj::mv(log)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're injecting a synthetic log just before the outcome, why not just report it as a field in the outcome event? There's no chance for the handler to act on a drop signal at this point anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with @zebp at the last or second last meeting – he suggested to just have a log for now, rationale is that we'll likely want something more complex (perhaps a new event type for stream events) but this is not something we need at this stage, having a temporary log is better than changing the Outcome event API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching against this exact string and parsing out the number dropped in the log handler, could be done but doesn't seem very robust. Also this string could change and silently break, and it could be crafted artificially.

I'm going to assume @zebp confirmed, but we should reconsider that decision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A log is fine for now, but we should extract this out into a separate event before STWs are shipped publicly. As @fhanau mentioned I'd rather not complicate the Outcome event by adding a variant that is specific to STWs and would cause STW outcomes to not match regular outcomes.

@mar-cf does make a good point that this does make it a bit hard to parse, can we instead emit a JSON structured log (with a special value to prevent colisions with customer logs) so we can not rely on brittle string parsing?

{"$":"cloudflare-streaming-tail-workers-internal","type":"dropped","count":0} would be fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it take a day to just add an event type? Why are we hacking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mar-cf updated the PR based on Zeb's suggestion – this produces a structured log now, the downstream test has been updated to expect that log.

Copy link
Contributor

@mar-cf mar-cf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My objections are noted. I won't block on them since this gets us functionally there. Maybe down the line, I'll get around to them.

@fhanau fhanau force-pushed the felix/112625-stw-load-shed branch 2 times, most recently from 3f788be to 9e67080 Compare January 21, 2026 04:52
@fhanau fhanau requested a review from a team as a code owner January 21, 2026 04:52
@github-actions
Copy link

github-actions bot commented Jan 21, 2026

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@fhanau fhanau force-pushed the felix/112625-stw-load-shed branch from 9e67080 to 61386e4 Compare January 21, 2026 19:46
@fhanau fhanau force-pushed the felix/112625-stw-load-shed branch from 61386e4 to 922fcde Compare January 22, 2026 22:17
@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 52.88462% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.65%. Comparing base (2752663) to head (a5ae1f8).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/workerd/io/trace-stream.c++ 29.41% 21 Missing and 3 partials ⚠️
src/workerd/io/trace.c++ 48.48% 11 Missing and 6 partials ⚠️
src/workerd/io/tracer.c++ 76.92% 5 Missing and 1 partial ⚠️
src/workerd/io/trace-test.c++ 81.81% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #5637       +/-   ##
===========================================
+ Coverage   68.10%   69.65%    +1.54%     
===========================================
  Files          29      397      +368     
  Lines        2593   106075   +103482     
  Branches       15    17978    +17963     
===========================================
+ Hits         1766    73885    +72119     
- Misses        827    21386    +20559     
- Partials        0    10804    +10804     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@fhanau fhanau force-pushed the felix/112625-stw-load-shed branch from 922fcde to ccf98d3 Compare January 22, 2026 23:26
# events diagnostic is supported. We do not need to include the diagnostic type here (it is
# always the same for a given diagnostic) so droppedEvents can be represented using just the
# count variable.
diagnostic :union {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capnp does not allow a single member union – is there a better way to make this type easily extensible than having an unused "undefined" value? In practice, this should be fine as undefined should only take up one bit in the serialized message.

@fhanau fhanau force-pushed the felix/112625-stw-load-shed branch from ccf98d3 to a5ae1f8 Compare January 23, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants