-
Notifications
You must be signed in to change notification settings - Fork 517
EW-9735 EW-9736 [o11y] Drop tail stream events when reaching excessive queue size #5637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3a09407 to
7692d3e
Compare
| tracing::SpanOpen(span.spanId, span.operationName.clone()), span.startTime, spanNameSize); | ||
| // If a span manages to exceed the size limit, truncate it by not providing span attributes. | ||
| if (span.tags.size() && messageSize <= MAX_TRACE_BYTES) { | ||
| if (span.tags.size() && spanTagsSize <= MAX_TRACE_BYTES) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the watchful reviewer will notice, this is a minor functional change: We're no longer including the size of the span name in this check since it is included in a different tail event, so we're slightly relaxing the size limit for emitting span tags here.
src/workerd/io/trace-stream.c++
Outdated
| // The TailStreamWriterState holds the current client-side state for a collection | ||
| // of streaming tail workers that a worker is reporting events to. | ||
| struct TailStreamWriterState { | ||
| // The maximum size of the queue, in bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to different names for these constants/defining them somewhere else.
Note that testing that these checks take effect at some point will be provided in a downstream PR (to follow). This will also include a discussion of why they are sufficient for almost all use cases.
|
This is now feature-complete. As noted in a PR comment – tests for this and rationale will be provided in a downstream PR, but I think we can already discuss the merits of the code changes here. |
|
Add tests |
70cfa7f to
51d612b
Compare
mar-cf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern is if it's possible to produce inconsistent traces due to dropping.
|
Offline the team discussed some concerns. I'm going to assume this isn't blocked on me until I see those updates. Feel free to ping if I'm needed. |
Tests for this are available downstream, in #11842 and #12029. |
a0cc9a9 to
db917db
Compare
This comment was marked as outdated.
This comment was marked as outdated.
1aecc65 to
0074510
Compare
src/workerd/io/trace-stream.c++
Outdated
| auto log = kj::str( | ||
| "[\"Dropped ", active->droppedEvents, " tail events due to excessive queueing\"]"); | ||
| TailEvent droppedEventsLog(event.spanContext.clone(), event.invocationId, event.timestamp, | ||
| event.sequence, tracing::Log(event.timestamp, LogLevel::WARN, kj::mv(log))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're injecting a synthetic log just before the outcome, why not just report it as a field in the outcome event? There's no chance for the handler to act on a drop signal at this point anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed this with @zebp at the last or second last meeting – he suggested to just have a log for now, rationale is that we'll likely want something more complex (perhaps a new event type for stream events) but this is not something we need at this stage, having a temporary log is better than changing the Outcome event API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching against this exact string and parsing out the number dropped in the log handler, could be done but doesn't seem very robust. Also this string could change and silently break, and it could be crafted artificially.
I'm going to assume @zebp confirmed, but we should reconsider that decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A log is fine for now, but we should extract this out into a separate event before STWs are shipped publicly. As @fhanau mentioned I'd rather not complicate the Outcome event by adding a variant that is specific to STWs and would cause STW outcomes to not match regular outcomes.
@mar-cf does make a good point that this does make it a bit hard to parse, can we instead emit a JSON structured log (with a special value to prevent colisions with customer logs) so we can not rely on brittle string parsing?
{"$":"cloudflare-streaming-tail-workers-internal","type":"dropped","count":0} would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it take a day to just add an event type? Why are we hacking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mar-cf updated the PR based on Zeb's suggestion – this produces a structured log now, the downstream test has been updated to expect that log.
mar-cf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My objections are noted. I won't block on them since this gets us functionally there. Maybe down the line, I'll get around to them.
3f788be to
9e67080
Compare
|
The generated output of |
9e67080 to
61386e4
Compare
This will avoid excessive memory overhead in a few edge cases.
61386e4 to
922fcde
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5637 +/- ##
===========================================
+ Coverage 68.10% 69.65% +1.54%
===========================================
Files 29 397 +368
Lines 2593 106075 +103482
Branches 15 17978 +17963
===========================================
+ Hits 1766 73885 +72119
- Misses 827 21386 +20559
- Partials 0 10804 +10804 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
922fcde to
ccf98d3
Compare
| # events diagnostic is supported. We do not need to include the diagnostic type here (it is | ||
| # always the same for a given diagnostic) so droppedEvents can be represented using just the | ||
| # count variable. | ||
| diagnostic :union { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capnp does not allow a single member union – is there a better way to make this type easily extensible than having an unused "undefined" value? In practice, this should be fine as undefined should only take up one bit in the serialized message.
ccf98d3 to
a5ae1f8
Compare
This will avoid excessive memory overhead in a few edge cases – usually the tail worker will be able to keep up with reporting events but we still need to put a limit to the queue size.