Add JSON-RPC utilization metrics and troubleshooting documentation#10553
Add JSON-RPC utilization metrics and troubleshooting documentation#10553julianbrost merged 4 commits intomasterfrom
Conversation
|
In other words, this PR closes #10266 ? |
My motivation for creating this PR was that I find it more useful to provide a metric with an explanation what it says or how it can be interpreted instead of just dumping a bunch of metrics without documentation where you need to read the source to understand them. So I wrote that documentation for the metric that was the reason for this task in the first place. If you want to provide something similar for the other metrics, feel free to do so. Note that I found writing that documentation useful as it made me think more about the details, especially what can already be derived from the TCP send/receive queue sizes. So for example, your PR has a "time spent reading" metric, what can I actually learn from it? It would allow me to figure out if the connection is the bottleneck as in the link between the machines doesn't provide enough throughput, but this would already show in the TCP buffers as the send buffer would be full but the read buffer empty. |
Co-authored-by: Alexander A. Klimov <alexander.klimov@icinga.com>
Co-authored-by: Alexander A. Klimov <alexander.klimov@icinga.com>
e6ecc02 to
2a7fb13
Compare
This PR is based on #10266, reducing it to the part of it I find most relevant, allowing for a simpler implementation, and adding some troubleshooting documentation from which users can learn how to actually use it.
First of all, the troubleshooting documentation explains that a full TCP receive queue on a JSON-RPC connection socket can be a sign of the connection being overloaded and Icinga 2 processing the messages slower than they are coming in.
Second, it adds the
seconds_processing_messagesattribute toEndpointobjects. Observing the rate at which this changes allows to estimate how close the connection is to being saturated, which is also explained in the troubleshooting documentation.This PR just uses the already measured
totalduration, which currently includes the time taken byCpuBoundWork, which is a difference from #10266 which added multiple metrics. I opted for the single metrics for two reasons: First, it's enough to derive how busy the connection is. Second, I have ideas to removeCpuBoundWorkfrom here soon (as in this or maybe next week) anyways, so for the moment I don't want to spend time adding and documenting metrics here that will probably become obsolete soon.