Skip to content

Add PerfdatawriterConnection to handle network requests for Perfdata Writers#10668

Merged
julianbrost merged 9 commits intomasterfrom
perfdata-writers-connection-handling
Mar 17, 2026
Merged

Add PerfdatawriterConnection to handle network requests for Perfdata Writers#10668
julianbrost merged 9 commits intomasterfrom
perfdata-writers-connection-handling

Conversation

@jschmidt-icinga
Copy link
Contributor

@jschmidt-icinga jschmidt-icinga commented Dec 10, 2025

Description

This unifies the connection handling for all perfdata writers into a single class PerfdataWriterConnection that provides a blocking interface (using promises) to the underlying asynchronous operations.

All in all this is a huge code reduction and deduplication (as long as you don't count the added unit-tests) and should fix the issues with the work-queues being stuck on shutdown.

Fixes #10159, possibly fixes #10629

Connection handling

  • Connections are now established lazily whenever a message is being sent to the server. Some writers already worked that way, while others connected at the start and kept their connections around for as long as they needed them or until the server disconected.
  • HTTP-based writers will disconnect after sending a message and receiving the response unless the keep-alive flag is set by the server. Currently we do not request keep-alive on our side, but that could easily be done on the side of the writers if we want to.
  • The writers can use a (slightly awkwardly named) CancelWithTimeout() function that waits on a future for a specified timeout. When either that timeout expires or the future is ready, the connection is stopped by canceling outstanding operations and disconnecting the stream/socket.
  • All system errors are handled by the connection class internally and lead to a retry after an exponentially increasing timeout similar to the backoff strategy implemented by Add OTLPMetricsWriter #10685. The writers obviously still need to handle the HTTP status codes from the response, which the connection class doesn't touch in any way.

Rationale

A simpler solution to the disconnect problem would have been possible. Because a cancelled send or handshake don't allow for a graceful shutdown of the TLS connection anyway, especially when the server is unresponsive, a simple close on the stream's socket would be enough to cancel all outstanding operations. However, many writers only keep temporary stream objects in the functions where the messages are sent and currently don't track the state of the connection, so this would also need some serious refactoring but different for each writer.

Instead of doing the same thing over and over for each writer, I chose to reduce code duplication and abstract the connection handling out of the individual writers and only fix it in one place. Using async operations and an asio strand was convenient, because now every yield leaves the connection object in a defined state, without needing any atomic variables or mutexes, which makes the disconnect handling much simpler.

Other changes

In addition to the changes to connection handling some other minor refactoring has been done:

  • OpenTsdbWriter now also uses a work queue like all the other writers. Previously this writer would send its data directly in its CheckResultHandler which meant that if a server was slow or unresponsive it could have blocked check-result processing and slowed down the whole process/cluster.
  • ElasticsearchWriter locked a mutex on each Flush() so it could be called from both outside and inside the work queue. This was changed to always queuing the Flush() onto the work queue instead. This makes the behavior more similar to what InfluxDbCommonWriter does.
  • Both ElasticsearchWriter and InfluxDbCommonWriter's flush timer has been improved by setting an atomic boolean the first time the flush is queued and then skipping further queue entries until the flush has been processed.

The two/three HTTP based writers could benefit from further refactoring to use the new HTTP message classes, but I didn't do this here, because it isn't necessary to solve the problem at hand and ElasticsearchWriter is going to be deprecated anyway (see #10734). A refactor for the InfluxDB writers can be done at the point where we have a good reason to do it.

Testing

Aside from the Unit-Tests we've manually tested with all the current backends:

Status

Done and ready to be merged.

@cla-bot cla-bot bot added the cla/signed label Dec 10, 2025
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 9 times, most recently from cb64ef1 to 21c2575 Compare December 12, 2025 07:45
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from 21c2575 to 29f91c9 Compare December 15, 2025 11:08
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 2 times, most recently from d018cae to a66f9ed Compare December 17, 2025 14:15
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 4 times, most recently from 6391859 to 8ddd29d Compare January 28, 2026 11:35
@jschmidt-icinga jschmidt-icinga added this to the 2.16.0 milestone Jan 29, 2026
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 8 times, most recently from 40c5481 to 4acf0b3 Compare February 4, 2026 07:50
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 4 times, most recently from 12ff9d1 to bf462e5 Compare March 2, 2026 13:10
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from bf462e5 to 9afa40a Compare March 4, 2026 07:06
@jschmidt-icinga
Copy link
Contributor Author

jschmidt-icinga commented Mar 4, 2026

Unfortunately, despite everything having been ready and tested, I noticed a problem with the previous iteration where a Send() operation would be retried infinitely without any wait in between if the connection was established successfully but the send failed immediately, which could for example happen when the client was configured to use bare TCP and the server expected a TLS connection. This made another medium size update necessary.

  • Moved the wait between retry attempts from the connection step to the send step, so now when send fails it will also wait an increasing amount of time before retrying, up to 32s. This doesn't magically solve the misconfiguration error, but it prevents the log being spammed with errors and possible increase in cpu load that this would cause.
  • Added a test-case written by @yhabteab to ensure the http messages are not mutated during repeated Send()s. This shouldn't normally happen especially since I also made Send() take them by const-reference now, but older boost versions don't have const overloads for their async_write() functions, so it doesn't hurt to make sure this is true across all supported platforms.

I've not retested every writer with these changes, just a quick test with OpenTSDB, because no writer code was changed since the previous iteration, only the connection code, which should be covered by the unit-tests.

Edit: And it immediately fails on amazonlinux:2. Shocker...

@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from 9afa40a to 17c1f2e Compare March 4, 2026 07:31
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from 17c1f2e to da2a396 Compare March 4, 2026 09:18
yhabteab
yhabteab previously approved these changes Mar 4, 2026
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now! Unless @julianbrost wants to have a look at it, feel free to merge it once the GHAs pass.

@yhabteab yhabteab added area/graphite Metrics to Graphite area/opentsdb Metrics to OpenTSDB area/influxdb Metrics to InfluxDB labels Mar 4, 2026
Copy link
Member

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I selected "request changes" mostly that this isn't accidentally merged given the already existing approval until the discussion on OpenSSL context initialization errors is resolved.

@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from f872cd0 to b6aecaf Compare March 9, 2026 07:43
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from b6aecaf to a1f61aa Compare March 17, 2026 10:53
@julianbrost julianbrost dismissed their stale review March 17, 2026 10:55

The things I requested were addressed, so that review is obsolete. I probably won't do a full review of this PR myself though.

jschmidt-icinga and others added 3 commits March 17, 2026 12:11
There's a set of two tests for each perfdatawriter, just
to make sure they can connect and send data that looks reasonably
correct, and to make sure pausing actually works while the connection
is stuck.

Then there's a more in-depth suite of tests for PerfdataWriterConnection
itself, to verify that connection handling works well in all types
of scenarios.

Co-authored-by: Yonas Habteab <yonas.habteab@icinga.com>
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from a1f61aa to 75b2ec6 Compare March 17, 2026 11:11
@julianbrost julianbrost enabled auto-merge March 17, 2026 12:34
@julianbrost julianbrost merged commit 6592eae into master Mar 17, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/graphite Metrics to Graphite area/influxdb Metrics to InfluxDB area/opentsdb Metrics to OpenTSDB cla/signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"systemctl reload icinga2" hangs InfluxDB2 Writer (and quite possibly all other Data Outputs) may inhibit core functionality

3 participants