Conversation
|
Very nice work! I'll have a look at it. This should also resolve #9900 |
|
Hi, I spent some time testing the new Writer. Here's some first feedback:
Maybe the name of the writer could be OTLPMetricWriter instead of OTelWriter. Just an idea. OTelWriter is also fine.
When I set
Maybe the resource attribute service.namespace should not be hard coded to "icinga". From the OpenTelemetry Conventions:
I think the namespace is more akin to a "Kubernetes Namespace". I'm not 100% sure if this is something that should be set via the actual Icinga Service Objects or
Currently the Icinga Host/Service information is set as an attribute, maybe these should be resource attributes. From the OpenTelemetry Conventions:
There are Host and Service conventions for this:
I think it's ok if they are "namespaced" like this "icinga2.host.name" and "icinga2.service.name".
Maybe we want a more generic metric name than "icinga2_perfdata".
For example there are proposals for a See also open-telemetry/semantic-conventions#1106
When
This makes it hard to work with them for example when plotting them. They should be encoded as metrics with the same attributes as the perfdata metric. For example:
When I send the data to the Prometheus OTLP Writer. The Icinga2 daemon logs some warnings due to the response headers I think: Here is my test setupOTel Collectorprometheus.yml |
|
Also tested it with Grafana Mimir. Works great Grafana Mimir |
|
Another small note, I think
|
|
Thanks for the feedback and compose file you provided!
I will bring this into our weekly meetings next week for discussion.
Aye, that was an oversight on my part. Though, I've dropped it completely now, so there shouldn't be any daemon crashes now :).
We must look into this from the Icinga 2 side and not from an OTel perspective. There is no way we're going to introduce
Ack! Will change that.
That makes sense. I'll update the metric names accordingly, especially since there's a proposal for this.
Good point! I'm going transform all thresholds into separate metric streams then, i.e,
I didn't know about this, so thanks for testing it out! I'll fix it. |
|
I've addressed all your feedbacks apart from the |
|
Sorry. I had to fix one issue introduced with my last push (thanks @martialblog for testing!). Apparently, namespace is a reserved keyword in Icinga 2 DSL, so can't be used as an attribute name. |
|
Hold my beer. 😉 |
|
Thanks! I'm aware that DSL users can escape keywords but there's no point in using a reserved word as an attribute for a built-in config object. |
|
I encountered a strange issue with the new code The deamon did tell me it was flushing data, but then never did.
@yhabteab and I did some debugging and isolated this area. When replacing the ASSERT with VERIFY is seems to work again. Yonas has the details. |
Thanks for your help! Apparently, AsioProtobufOutStream outputS{*m_Stream, m_ConnInfo, yc};
ASSERT(m_Request->SerializeToZeroCopyStream(&outputS));C++ code after the C++ preprocessor has run ( AsioProtobufOutStream outputS{*m_Stream, m_ConnInfo, yc};
((void)0);On the other hand, when using AsioProtobufOutStream outputS{*m_Stream, m_ConnInfo, yc};
((m_Request->SerializeToZeroCopyStream(&outputS)) ? void(0) : icinga_assert_fail("m_Request->SerializeToZeroCopyStream(&outputS)", "otel.cpp", 340));I don't know what I was thinking when I used Lines 8 to 9 in 422f116 I've fixed it now and should behave normally. |
|
Did some testing with OpenSearch Data-Prepper, I did manage to send data successfully to OpenSearch like this: However, I did see some "critical" errors in the Icinga2 logs: Compose with OpenSearch Data-Prepper# cat metric_pipeline.yaml
metric-pipeline:
source:
otlp:
unframed_requests: true
health_check_service: true
authentication:
unauthenticated:
ssl: false
sink:
- stdout:
- opensearch:
hosts: [ "https://opensearch:9200" ]
insecure: true
username: admin
password: Developer@123
index: otel_metrics
# cat data-prepper-config.yaml
ssl: false |
I don't know, how OpenSearch behave and whether their OTELP receiver fully conforms to the OTELP specs but that looks like OpenSearch is closing the connection after some time (no persistent HTTP connection support?). I'll try to go through their docs and see if I can find something about that. |
I can't find anything about persistent connections in the Data Prepper1 docs so far, but the OTel spec2 clearly says:
However, OpenSearch doesn't seem to honor that, so it closes the connection after each request, I guess it's because the sentence is rephrased as SHOULD and not MUST. Nonetheless, I will try to detect such cases and degrade from a critical to some other log severity instead. $ netstat -ant | grep 21893
tcp4 0 0 127.0.0.1.21893 127.0.0.1.51712 FIN_WAIT_2 # OpenSearch closed the connection is waiting for remote peer to close it.
tcp4 0 0 127.0.0.1.51712 127.0.0.1.21893 CLOSE_WAIT # OpenSearch has closed the conn but Icinga 2 didn't close it.
tcp4 0 0 *.21893 *.* LISTENFootnotes |
|
Yeah that makes sense, if the client only SHOULD keep the connection alive then a less severe log level is alright. |
|
I've fixed the critical logs shown in #10685 (comment) by degrading that specific |
|
The newly pushed commits include the following changes:
|
|
Since I had to rebase this, force push was unavoidable, so while force-pushing anyway, I've cleaned up the commits a bit. |
jschmidt-icinga
left a comment
There was a problem hiding this comment.
Looks good to me.
Also gave it a quick test in my docker environment targeting Elasticsearch 9.3.1 directly and it worked as expected. 👍 Great feature.
|
Just rebased and squashed some commits and it seems I've changed nothing in the code, otherwise, GitHub would've dismissed the approval. |
This module is copied from CMake's official module repository[^1] and
contains only minor changes as outlined below.
```diff
--- a/third-party/cmake/protobuf/FindProtobuf.cmake
+++ b/third-party/cmake/protobuf/FindProtobuf.cmake
@@ -218,9 +218,6 @@ Example:
GENERATE_EXTENSIONS .grpc.pb.h .grpc.pb.cc)
#]=======================================================================]
-cmake_policy(PUSH)
-cmake_policy(SET CMP0159 NEW) # file(STRINGS) with REGEX updates CMAKE_MATCH_<n>
-
function(protobuf_generate)
set(_options APPEND_PATH DESCRIPTORS)
set(_singleargs LANGUAGE OUT_VAR EXPORT_MACRO PROTOC_OUT_DIR PLUGIN PLUGIN_OPTIONS DEPENDENCIES)
@@ -503,7 +500,7 @@ if( Protobuf_USE_STATIC_LIBS )
endif()
endif()
-include(${CMAKE_CURRENT_LIST_DIR}/SelectLibraryConfigurations.cmake)
+include(SelectLibraryConfigurations)
# Internal function: search for normal library as well as a debug one
# if the debug one is specified also include debug/optimized keywords
@@ -768,7 +765,7 @@ if(Protobuf_INCLUDE_DIR)
endif()
endif()
-include(${CMAKE_CURRENT_LIST_DIR}/FindPackageHandleStandardArgs.cmake)
+include(FindPackageHandleStandardArgs)
FIND_PACKAGE_HANDLE_STANDARD_ARGS(Protobuf
REQUIRED_VARS Protobuf_LIBRARIES Protobuf_INCLUDE_DIR
VERSION_VAR Protobuf_VERSION
@@ -805,5 +802,3 @@ foreach(Camel
string(TOUPPER ${Camel} UPPER)
set(${UPPER} ${${Camel}})
endforeach()
-
-cmake_policy(POP)
```
[^1]: https://github.com/Kitware/CMake/blob/v3.31.0/Modules/FindProtobuf.cmake
|
Made two minor changes:
|
|
Oh, and I don't know why GitHub is rendering the GHAs like that but they're definitely not cancelled but running in the background https://github.com/Icinga/icinga2/pull/10685/checks. |
julianbrost
left a comment
There was a problem hiding this comment.
I have a bit of trouble mapping Icinga 2 concepts to OpenTelemetry concepts. I'm not yet sure if that's just confusion from some parts of the implementation or of that might actually be useful to have in the documentation. That is how Icinga 2 Host and Service objects map to OpenTelemetry {Resource,Entity,Service,...}.
In particular, my confusion originates from how service.* attributes are set, some (service.name and service.version) are set as if OpenTelemetry Service refers to Icinga 2 (either the node writing, or even the cluster as a whole) whereas others (service.instance.id) are set as if OpenTelemetry Service refers to an Icinga 2 checkable (so Icinga 2 Host or Service object):
Lines 191 to 198 in 5c6a70d
icinga2/lib/perfdata/otlpmetricswriter.cpp
Lines 310 to 318 in 5c6a70d
Unfortunately, I didn't find the OpenTelemetry documentation particularly helpful here. I've read several parts of their documentation, and went trough the Protobuf definitions, and I still don't have a picture clear enough so I could say what Icinga 2 concept should map to what. @martialblog Could you perhaps help shed some light on the matter?
Apart from that, see below for some other small findings in the code.
| # You can enable TLS encryption by uncommenting and configuring the following options. | ||
| # By default, the OTel writer uses unencrypted connections (plain HTTP requests). | ||
| // enable_tls = false | ||
| // tl_insecure_noverify = false |
There was a problem hiding this comment.
| // tl_insecure_noverify = false | |
| // tls_insecure_noverify = false |
| for (auto it{attrs.begin()}; it != attrs.end(); /* NOPE */) { | ||
| auto* attr = dataPoint->add_attributes(); | ||
| auto node = attrs.extract(it++); | ||
| SetAttribute(*attr, node.key(), node.mapped()); | ||
| } |
There was a problem hiding this comment.
The complexity of this loop suggests that you wanted to move here (otherwise you could have used a simple for (auto& x : attrs) loop here, but I don't think you're actually moving here.
| for (auto it{attrs.begin()}; it != attrs.end(); /* NOPE */) { | |
| auto* attr = dataPoint->add_attributes(); | |
| auto node = attrs.extract(it++); | |
| SetAttribute(*attr, node.key(), node.mapped()); | |
| } | |
| for (auto it{attrs.begin()}; it != attrs.end(); /* NOPE */) { | |
| auto* attr = dataPoint->add_attributes(); | |
| auto node = attrs.extract(it++); | |
| SetAttribute(*attr, std::move(node.key()), std::move(node.mapped())); | |
| } |
Also, "NOPE" is not really more useful than no comment at all. Maybe say that the iterator is advanced inside the loop as the loop body invalidates the iterator. Alternatively, I think this would be more readable when written like this (with no changes to the runtime complexity):
| for (auto it{attrs.begin()}; it != attrs.end(); /* NOPE */) { | |
| auto* attr = dataPoint->add_attributes(); | |
| auto node = attrs.extract(it++); | |
| SetAttribute(*attr, node.key(), node.mapped()); | |
| } | |
| while (!attrs.empty()) { | |
| auto* attr = dataPoint->add_attributes(); | |
| auto node = attrs.extract(attrs.begin()); | |
| SetAttribute(*attr, std::move(node.key()), std::move(node.mapped())); | |
| } |
There was a problem hiding this comment.
The complexity of this loop suggests that you wanted to move here (otherwise you could have used a simple
for (auto& x : attrs)loop here, but I don't think you're actually moving here.
My initial implementation for SetAttribute did move internally, but after the discussions we had when to and not to move lately, the implementation was changed to only move when the caller explicitly provides an rvalue reference. Then, I eventually forgot to update this loop to reflect that change, because I want to of course move the attributes here.
Also, "NOPE" is not really more useful than no comment at all. Maybe say that the iterator is advanced inside the loop as the loop body invalidates the iterator. Alternatively, I think this would be more readable when written like this
The body contains literally 3 trivial lines of code, what in those lines is not readable? I'm not defending anything here to keep the for loop as it is, just wonder what exactly is not readable about it. That comment was more like, no I don't increment the iterator here, so see two lines below instead, or do you expect me to write a full sentence explaining why it's not incremented there?
There was a problem hiding this comment.
The body contains literally 3 trivial lines of code, what in those lines is not readable? I'm not defending anything here to keep the
forloop as it is, just wonder what exactly is not readable about it.
To understand that code, you have to jump back and forth more often. for (auto it{attrs.begin()}; it != attrs.end(); /* NOPE */) immediately raises the question why there's no ++it and you have to look at the loop body to understand it. Whereas while (!attrs.empty()) more or less directly says that a loop is following that will clear/consume attrs (or is an endless loop if it's buggy).
or do you expect me to write a full sentence explaining why it's not incremented there?
I think my suggested while loop would need no further comment. For the for loop, I was thinking of something like "loop body increments the iterator as it invalidates the previous one".
| template<typename T> | ||
| std::size_t OTLPMetricsWriter::Record( | ||
| const Checkable::Ptr& checkable, | ||
| const CheckResult::Ptr& cr, | ||
| std::string_view metric, | ||
| T value, | ||
| double startTime, | ||
| double endTime, | ||
| OTel::AttrsMap attrs | ||
| ) |
There was a problem hiding this comment.
Am I missing something or is this only used with T = double?
There was a problem hiding this comment.
Yes it is for now but we might add more metrics like the other writers do that don't depend on the actual perfdata but on the checkable state. Do you see anything wrong with templating it? It's a private method anyway.
|
Hi @julianbrost My current understanding is, the This follows the OTel service semantic conventions: https://opentelemetry.io/docs/specs/semconv/resource/service/
This is to distinguish different Icinga2 instances that produce data. The Icinga2 checkables (Host or Service object) then use The Now that I see it again, that might not be correct:
Maybe this maps more to an "Icinga Service" that might have two masters "instances"? |
There's the (almost philosophical) question of what produces the metric. Is it Icinga 2 because that's where the metric first becomes an OTel metric? Or is it the actual check command being executed?1 This probably boils down to the question whether Icinga 2 should present itself as a single service with lots of metrics (where the
I'm not sure how well that would play together with HA in Icinga 2. When Footnotes
|
Al2Klimov
left a comment
There was a problem hiding this comment.
According to @yhabteab, we're currently writing one metric with values and one with thresholds.
@martialblog Is this separation good and if yes, what for?
I'm just asking and afraid, depending on the backend, it may be hard/impossible to join separate metrics together again for arithmetic ops. See also #7060.
@martialblog Wouldn't it be clever and smart to instead write just one metric with everything (that the user wants)?
|
The metrics are currently inspired by this proposal here:
Since there is no complete final semantic convention for the kind of data we have here, it's an approximation. I guess semantically the separation makes sense, since you have two different things 1) the actual value and 2) the thresholds. Having a single metric would conflate the meaning of both the value and the threshold. https://opentelemetry.io/docs/specs/semconv/system/system-metrics/#memory-metrics We had some discussion around performance data for Nagios-compatible monitoring here: open-telemetry/semantic-conventions#3148 I don't think "one metric with everything" maps well onto the OTel data model for metrics. But I could be wrong. I don't have definitive answer to what is the best solution. Bit of a stretch, BUT one could probably encode a "one entity with everything" in a log event with different fields: https://opentelemetry.io/docs/specs/otel/logs/data-model/#events Then you could encode a "performance data point" into one entity AND you could output the thresholds "properly" with ranges instead of a number (as is currently the case for other Icinga2 Perfdata Writers). But this all feels a bit awkward, since most use cases for performance data I encountered so far have to do with timeseries metrics. |
I think this distinction isn't even that important. The main point is that Icinga 2 should be able to export metrics in a way that should be easy to distinguish between the individual checkables, since we can't directly map all the checkables to OTel concepts anyway. So as long as we have a clear (and of course non-standardised) way to distinguish our checkables, it shouldn't matter what we populate all the standard
Why would that be a problem? Wouldn't the
Is this an Icinga 2 problem? Doesn't this apply to all the other writers too and even the IDO? If you let multiple |
I don't know if this would be a problem. This is basically based on my understanding of Prometheus and how these attributes are mapped to Prometheus. If I understand that correctly,
I'm just saying that this will make the difference between submitting identical metrics twice and duplicating every metric and submitting it as two different metrics. Back to my initial question: Would the following description work out in the OTel world and if my wording is correct? Icinga 2 (as in a whole cluster) presents itself as a service to OTel receivers. Icinga 2 checkables (hosts and services) are mapped to OTel resources. Each resource has multiple metrics that represent the individual perfdata values reported by the check plugin. That includes the assumption that a resource can have multiple metrics. One of the parts I still didn't fully grasp is what In the current implementation, icinga2/lib/perfdata/otlpmetricswriter.cpp Lines 330 to 333 in 5c6a70d As they can have multiple metrics, my description/suggestion from above should work out. |
| [config] bool enable_ha { | ||
| default {{{ return false; }}} | ||
| }; |
There was a problem hiding this comment.
Quoting from #10685 (comment):
If you let multiple Icinga 2 instance write to the same backend in parallel, then it's not going to end well either way, so I don't see how this is a problem of the OTel writer in particular.
Wouldn't that make true the sane default?
I don't see any issues with that, but mind you that I'm just as newbie as you are when it comes to OTel, so take my answer with a grain of salt.
Sorry, but if the descriptions given in this document or here don't make it clear, I don't think I can explain it better. As far as I understand it, |
|
It's just that the definitions are pretty vague, but that might also be intentional to allow for some flexibility. Thus, to me, my suggestion sounds plausible, but I can't really point to specific parts of the specification to base that argument on. If we decided to follow my suggestion, what would this imply?
Can anyone who was involved in this PR before reason what's the most OTel way from all the options (current state of the PR, my suggestions, possibly other suggestions)? |
|
I think the separation of "The Icinga Instance/Application/Service" and "The Icinga Checkables" makes sense. "Checkables" can then have their attributes in the
And for general attributes:
As for Endpoint name could be an idea. The DSL already makes sure they are unique right? That would also solve the "MUST be unique for each instance of the same service.namespace,service.name pair" constraint. |
This PR introduces the long-awaited
OTelWriter, a new Icinga 2 component that enables seamless integration with OpenTelemetry. I'm a newbie to OpenTelemetry, so bear with me if you spot any obvious mistakes ;), therefore I would highly appreciate any feedback from OpenTelemetry experts (cc @martialblog and all the other users who reacted to the referenced issue).First and foremost, this might surprise some of you, but this PR does not make use of the existing OpenTelemetry C++ SDK.
The reason for this is twofold:
Instead, I implemented a tiny OTLP HTTP client based on Boost.Beast that only supports OpenTelemetry metrics. That's right, no traces or logs, just metrics. Of course, it still uses Protocol Buffers for serialization as required by the OTLP specification, but without pulling in the entire OpenTelemetry C++ SDK. Also, since Icinga 2 just transforms the collected performance data into OpenTelemetry metrics (which means there's no way to know ahead of time which metrics with which names/units will be sent), the implementation doesn't provide any advanced aggregation features like the OpenTelemetry SDK does. Instead, it simply creates a single metric stream without any units or aggregation temporality, then appends each produced performance data transformed into an OTel Gauge metric data point to that stream. Here's how the OpenTelemetry collector debug printout looks like when sending some sample performance data to a local OpenTelemetry collector instance:
Expand Me
{ "resourceMetrics": [ { "resource": { "attributes": [ { "key": "service.name", "value": { "stringValue": "Icinga 2" } }, { "key": "service.instance.id", "value": { "stringValue": "547bc214-5b76-484e-833d-2de90da1bb74" } }, { "key": "service.version", "value": { "stringValue": "v2.15.0-235-gb35b335f2" } }, { "key": "telemetry.sdk.language", "value": { "stringValue": "cpp" } }, { "key": "telemetry.sdk.name", "value": { "stringValue": "Icinga 2 OTel Integration" } }, { "key": "telemetry.sdk.version", "value": { "stringValue": "v2.15.0-235-gb35b335f2" } }, { "key": "service.namespace", "value": { "stringValue": "icinga" } }, { "key": "icinga2.host.name", "value": { "stringValue": "something" } }, { "key": "icinga2.command.name", "value": { "stringValue": "icinga" } } ], "entityRefs": [ { "type": "host", "idKeys": [ "icinga2.host.name" ] } ] }, "scopeMetrics": [ { "scope": { "name": "icinga2", "version": "v2.15.0-235-gb35b335f2" }, "metrics": [ { "name": "state_check.perfdata", "gauge": { "dataPoints": [ { "attributes": [ { "key": "label", "value": { "stringValue": "api_num_conn_endpoints" } } ], "startTimeUnixNano": "1768762502097898752", "timeUnixNano": "1768762502101475072", "asDouble": 0 }, { "attributes": [ { "key": "label", "value": { "stringValue": "api_num_endpoints" } } ], "startTimeUnixNano": "1768762502097898752", "timeUnixNano": "1768762502101475072", "asDouble": 0 } ] } } ], "schemaUrl": "https://opentelemetry.io/schemas/1.39.0" } ], "schemaUrl": "https://opentelemetry.io/schemas/1.39.0" }, { "resource": { "attributes": [ { "key": "service.name", "value": { "stringValue": "Icinga 2" } }, { "key": "service.instance.id", "value": { "stringValue": "2d6c27cd-484d-436d-9542-b70abdaf2f76" } }, { "key": "service.version", "value": { "stringValue": "v2.15.0-235-gb35b335f2" } }, { "key": "telemetry.sdk.language", "value": { "stringValue": "cpp" } }, { "key": "telemetry.sdk.name", "value": { "stringValue": "Icinga 2 OTel Integration" } }, { "key": "telemetry.sdk.version", "value": { "stringValue": "v2.15.0-235-gb35b335f2" } }, { "key": "service.namespace", "value": { "stringValue": "icinga" } }, { "key": "icinga2.host.name", "value": { "stringValue": "something" } }, { "key": "icinga2.service.name", "value": { "stringValue": "something-service" } }, { "key": "icinga2.command.name", "value": { "stringValue": "icinga" } } ], "entityRefs": [ { "type": "service", "idKeys": [ "icinga2.host.name", "icinga2.service.name" ] } ] }, "scopeMetrics": [ { "scope": { "name": "icinga2", "version": "v2.15.0-235-gb35b335f2" }, "metrics": [ { "name": "state_check.perfdata", "gauge": { "dataPoints": [ { "attributes": [ { "key": "label", "value": { "stringValue": "api_num_conn_endpoints" } } ], "startTimeUnixNano": "1768762509990163200", "timeUnixNano": "1768762510002787072", "asDouble": 0 }, { "attributes": [ { "key": "label", "value": { "stringValue": "api_num_endpoints" } } ], "startTimeUnixNano": "1768762509990163200", "timeUnixNano": "1768762510002787072", "asDouble": 0 } ] } } ], "schemaUrl": "https://opentelemetry.io/schemas/1.39.0" } ], "schemaUrl": "https://opentelemetry.io/schemas/1.39.0" } ] }As already mentioned, this is a first implementation and everything is open for discussion but primarily about the
following aspects:
enable_send_metadataoption is set and include attributes likeicinga2.check.state,icinga2.check.latency, etc.). EDIT: There are now no such attributes anymore and theenable_send_metadatais gone as well.icinga2.to avoid collisions) and the overall metric naming (currently justicinga2.perfdatafor all performance data points). EDIT: The metrics have been renamed tostate_check.perfdata,state_check.threshold.warningetc. as suggested in AddOTLPMetricsWriter#10685 (comment).The high-level class overview is as higlighted in the following mermaid UML diagram:
--- title: OTel Integration --- classDiagram %% The two classes below are just a type alias definitions for better readability. note for OTelAttrVal "OTelAttrVal is implemented as a type alias not a class." note for OTelAttrsMap "OTelAttrsSet is implemented as a type alias not a class." class OTelAttrVal { <<type alias>> +std::variant~bool, int64_t, double, String> } class OTelAttrsMap { <<type alias>> +set~pair~String-AttrValue~~ } OTelAttrsMap --o OTelAttrVal : manages class Gauge { -std::unique_ptr~proto::Gauge~ ProtoGauge +Transform(metric: proto::Metric*) void +IsEmpty() bool +Record(value: double|int64_t, start_time: double, end_time: double, attributes: OTelAttrsMap) std::size_t } OTelAttrsMap <.. Gauge : uses class OTel { -proto::ExportMetricsServiceRequest Request -std::optional~StreamType~ Stream -asio::io_context::strand Strand -std::atomic_bool Exporting, Stopped +Start() void +Stop() void +Export(MetricsRequest& request) void +IsExporting() bool +Stopped() bool +ValidateName(name: string_view) bool$ +IsRetryableExportError(status: beast::http::status) bool$ +PopulateResourceAttrs(rm: const std::unique_ptr~opentelemetry::proto::metrics::v1::ResourceMetrics~&) void$ -Connect(yc: boost::asio::yield_context&) void -ExportLoop(yc: boost::asio::yield_context&) void -Export(yc: boost::asio::yield_context&) void -ExportingSet(exporting: bool, notifyAll: bool) void } AsioProtobufOutputStream <.. OTel : serializes via RetryableExportError <.. OTel : uses Backoff <.. OTel : uses `google::protobuf::io::ZeroCopyOutputStream` <|.. AsioProtobufOutputStream : implements class AsioProtobufOutputStream { -int64_t Pos -int64_t Buffered -HttpRequestWriter Writer -asio::yield_context& Yield +AsioProtobufOutputStream(stream: const StreamType&, info: const OTelConnInfo&, yield: asio::yield_context&) +Next(data: void**, size: int**) bool +BackUp(count: int) void +ByteCount() std::size_t -Flush(final: bool) bool } class RetryableExportError { -uint64_t Throttle +Throttle() uint64_t +what() const char* } class Backoff { +std::chrono::milliseconds MaxBackoff$ +std::chrono::milliseconds MinBackoff$ +operator()() std::chrono::milliseconds } class OTLPMetricsWriter { -unordered_set~shared_ptr~Metric~~ Metrics -OTel m_Exporter } Gauge "0...*" o-- "" OTLPMetricsWriter: produces OTelAttrsMap <.. OTLPMetricsWriter: creates OTel "1" o-- "" OTLPMetricsWriter: exports viaThe
OTelWriterby itself is pretty straightforward and doesn't contain any complex logic. The main OTel-related logic is encapsulated in a new library calledotel, which provides an HTTP client that conforms to the OTLP HTTP protocol specification. TheOTelclass is the one used by theOTelWriterto export metrics to the OpenTelemetry collector. TheOTelclass internally uses several helper classes to build the required Protocol Buffers messages as per the OpenTelemetry specification. Unlike the existing metric writers, this client doesn't create separate HTTP connections for each metric export. Instead, it maintains a persistent connection to the OpenTelemetry collector and reuses it for subsequent exports until the connection is closed by either side. The Protobuf message is serialized directly into HTTP connection without any intermediate buffering of the serialized message. This is possible only because the OpenTelemetry Collector supports HTTP/1.1 chunked transfer encoding, which allows sending the message in chunks without knowing the entire message size beforehand.That's it. Overall, this implementation is quite minimalistic and only implements the bare minimum required to send metrics to an OpenTelemetry collector.
Known Issues
Well, since the OpenTelemetry proto files require a
proto3language syntax, it turned out that not all our supported Distros provide a recent enough version ofprotocthat supportsproto3. Those Distros are:Also, due the
FindProtobufmodule shipped with CMake version <3.31.0being completely broken, I ended up having to import that very same module from CMake3.31.0into our CMake third-party modules directory. This is obviously not ideal, but I didn't find any other way around this issue. Once we bump our minimum required CMake version to3.31.0, we can remove this workaround again. So, the PR is being so huge partly because of this workaround.Testing
Testing this PR is a non-trivial task as it requires some knowledge about OpenTelemetry/Prometheus and setting up a local
collector instance. Here's a brief guide for anyone interested in testing this PR:
First, you need to set up an OpenTelemetry Collector instance. You can use the official Docker image for this purpose. I've included two exporters in the configuration: a standard output exporter for debugging and a Prometheus exporter to scrape the metrics via Prometheus (choose whatever you're comfortable with).
otel-collector-config.yaml
And then start the collector using the following command:
docker network create otel docker run --network otel -p 4318:4318 --rm -v $(pwd)/otel-collector-config.yml:/etc/otelcol/config.yaml otel/opentelemetry-collectorIf you chose the debug exporter instead of Prometheus, you will see the received metrics printed in the container logs, once Icinga 2 starts sending them. Otherwise, you have to add a Prometheus instance to scrape the metrics from the collector. For this, you can just use the following config and start Prometheus via Docker as well:
And start Prometheus in the background:
docker run --network otel --name prometheus -d -p 9090:9090 -v $(pwd)/config.yml:/etc/prometheus/prometheus.yml prom/prometheusNow, the only thing left to do is to build Icinga 2 with this PR applied and configure the
OTelWriter. If you're an experienced Icinga 2 user that knows how to manually build Icinga 2 Docker images, you can do your own thing and skip the following steps. For everyone else, here's a quick guide:git clone git@github.com:Icinga/icinga2.git cd icinga2 git checkout otelContainerfileprovided in the just cloned repository:docker build --tag icinga/icinga2:otel --file Containerfile .Afterwards, you can use this image
icinga/icinga2:otelto start an Icinga 2 container with the OTelWriter configured just like any other Icinga 2 component.Having done all of the above (especially if you chose the Prometheus exporter), how you verify that everything works as expected? Well, you have to do a few more things again :). Here's what I used to render some beautiful graphs in Icinga Web 2.
First,
icingaweb2-module-perfdatagraphs-prometheusmodule developed by @oxzi. Though, in order to make it work with the data sent by theOTelWriter, you need to perform some monkey patching.Monkey Patch
Next, well, you need to install and configure another module yet again, the
icingaweb2-module-perfdatagraphsmodule in your Icinga Web 2 instance. Follow the instructions in the repository to get it set up and use the above cloned module as a backend for this module.If everything is set up correctly, you should start seeing performance data metrics in Prometheus as well as beautiful
graphs in Icinga Web 2.
Update 19.02: Since the data format emitted by this writer has evolved over time, the above mentioned module from @oxzi is not compatible with the current version of the writer anymore. However, @martialblog has implemented a new module that is compatible with the current version of the writer, and it can be used to render even more nice looking graphs.
TODO
Missing documentation.Already added!resolves #10439
resolves #9900
resolves #10576