Add PerfdatawriterConnection to handle network requests for Perfdata Writers#10668
Add PerfdatawriterConnection to handle network requests for Perfdata Writers#10668julianbrost merged 9 commits intomasterfrom
Conversation
cb64ef1 to
21c2575
Compare
21c2575 to
29f91c9
Compare
d018cae to
a66f9ed
Compare
6391859 to
8ddd29d
Compare
40c5481 to
4acf0b3
Compare
12ff9d1 to
bf462e5
Compare
bf462e5 to
9afa40a
Compare
|
Unfortunately, despite everything having been ready and tested, I noticed a problem with the previous iteration where a
I've not retested every writer with these changes, just a quick test with OpenTSDB, because no writer code was changed since the previous iteration, only the connection code, which should be covered by the unit-tests. Edit: And it immediately fails on amazonlinux:2. Shocker... |
9afa40a to
17c1f2e
Compare
17c1f2e to
da2a396
Compare
yhabteab
left a comment
There was a problem hiding this comment.
Looks good to me now! Unless @julianbrost wants to have a look at it, feel free to merge it once the GHAs pass.
julianbrost
left a comment
There was a problem hiding this comment.
I selected "request changes" mostly that this isn't accidentally merged given the already existing approval until the discussion on OpenSSL context initialization errors is resolved.
da2a396 to
f872cd0
Compare
f872cd0 to
b6aecaf
Compare
b6aecaf to
a1f61aa
Compare
The things I requested were addressed, so that review is obsolete. I probably won't do a full review of this PR myself though.
There's a set of two tests for each perfdatawriter, just to make sure they can connect and send data that looks reasonably correct, and to make sure pausing actually works while the connection is stuck. Then there's a more in-depth suite of tests for PerfdataWriterConnection itself, to verify that connection handling works well in all types of scenarios. Co-authored-by: Yonas Habteab <yonas.habteab@icinga.com>
a1f61aa to
75b2ec6
Compare
Description
This unifies the connection handling for all perfdata writers into a single class
PerfdataWriterConnectionthat provides a blocking interface (using promises) to the underlying asynchronous operations.All in all this is a huge code reduction and deduplication (as long as you don't count the added unit-tests) and should fix the issues with the work-queues being stuck on shutdown.
Fixes #10159, possibly fixes #10629
Connection handling
CancelWithTimeout()function that waits on a future for a specified timeout. When either that timeout expires or the future is ready, the connection is stopped by canceling outstanding operations and disconnecting the stream/socket.OTLPMetricsWriter#10685. The writers obviously still need to handle the HTTP status codes from the response, which the connection class doesn't touch in any way.Rationale
A simpler solution to the disconnect problem would have been possible. Because a cancelled send or handshake don't allow for a graceful shutdown of the TLS connection anyway, especially when the server is unresponsive, a simple close on the stream's socket would be enough to cancel all outstanding operations. However, many writers only keep temporary stream objects in the functions where the messages are sent and currently don't track the state of the connection, so this would also need some serious refactoring but different for each writer.
Instead of doing the same thing over and over for each writer, I chose to reduce code duplication and abstract the connection handling out of the individual writers and only fix it in one place. Using async operations and an asio strand was convenient, because now every yield leaves the connection object in a defined state, without needing any atomic variables or mutexes, which makes the disconnect handling much simpler.
Other changes
In addition to the changes to connection handling some other minor refactoring has been done:
CheckResultHandlerwhich meant that if a server was slow or unresponsive it could have blocked check-result processing and slowed down the whole process/cluster.ElasticsearchWriterlocked a mutex on eachFlush()so it could be called from both outside and inside the work queue. This was changed to always queuing theFlush()onto the work queue instead. This makes the behavior more similar to whatInfluxDbCommonWriterdoes.ElasticsearchWriterandInfluxDbCommonWriter's flush timer has been improved by setting an atomic boolean the first time the flush is queued and then skipping further queue entries until the flush has been processed.The two/three HTTP based writers could benefit from further refactoring to use the new HTTP message classes, but I didn't do this here, because it isn't necessary to solve the problem at hand and
ElasticsearchWriteris going to be deprecated anyway (see #10734). A refactor for the InfluxDB writers can be done at the point where we have a good reason to do it.Testing
Aside from the Unit-Tests we've manually tested with all the current backends:
Status
Done and ready to be merged.