Skip to content

Add OTLPMetricsWriter#10685

Open
yhabteab wants to merge 9 commits intomasterfrom
otel
Open

Add OTLPMetricsWriter#10685
yhabteab wants to merge 9 commits intomasterfrom
otel

Conversation

@yhabteab
Copy link
Member

@yhabteab yhabteab commented Jan 15, 2026

This PR introduces the long-awaited OTelWriter, a new Icinga 2 component that enables seamless integration with OpenTelemetry. I'm a newbie to OpenTelemetry, so bear with me if you spot any obvious mistakes ;), therefore I would highly appreciate any feedback from OpenTelemetry experts (cc @martialblog and all the other users who reacted to the referenced issue).

First and foremost, this might surprise some of you, but this PR does not make use of the existing OpenTelemetry C++ SDK.
The reason for this is twofold:

  1. The OpenTelemetry C++ SDK is huge and complex, and none of the Icinga 2 developers (including myself) have experience with it. Nonetheless, I gave it a try, but gave up after a week of struggling to even get a simple example working. Also, when thinking debugging Icinga 2 issues related to OpenTelemetry, we would never be able to or at least it would be extremely hard to help our users if they run into problems with our OpenTelemetry integration. Furthermore, the SDK ABI version on my Mac (installed via Homebrew) is 1, which lacks many newer features and improvements that are only available in version 2. I also didn't even verify if the SDK is even available on all platforms we support but from my experience with this PR now, I doubt that it is.
  2. The default HTTP OpenTelemetry protocol (OTLP) implementation is based on curl and found it annoying to be greeted by mysterious crashes that I never encountered before. After some research, I found out that it is due to curl's multi-threading behavior clashing with Icinga 2's own multi-threading model. While this can surely be worked around, it's questionable whether the default OTLP client implementation using curl fits our requirements. Though, the SDK does provide a way to inject a custom client implementation, but honestly, I simply failed to get anything done that somehow aligns with Icinga 2's architecture, so I abandoned all hopes of using the OpenTelemetry C++ SDK.

Instead, I implemented a tiny OTLP HTTP client based on Boost.Beast that only supports OpenTelemetry metrics. That's right, no traces or logs, just metrics. Of course, it still uses Protocol Buffers for serialization as required by the OTLP specification, but without pulling in the entire OpenTelemetry C++ SDK. Also, since Icinga 2 just transforms the collected performance data into OpenTelemetry metrics (which means there's no way to know ahead of time which metrics with which names/units will be sent), the implementation doesn't provide any advanced aggregation features like the OpenTelemetry SDK does. Instead, it simply creates a single metric stream without any units or aggregation temporality, then appends each produced performance data transformed into an OTel Gauge metric data point to that stream. Here's how the OpenTelemetry collector debug printout looks like when sending some sample performance data to a local OpenTelemetry collector instance:

Expand Me
{
  "resourceMetrics": [
    {
      "resource": {
        "attributes": [
          {
            "key": "service.name",
            "value": {
              "stringValue": "Icinga 2"
            }
          },
          {
            "key": "service.instance.id",
            "value": {
              "stringValue": "547bc214-5b76-484e-833d-2de90da1bb74"
            }
          },
          {
            "key": "service.version",
            "value": {
              "stringValue": "v2.15.0-235-gb35b335f2"
            }
          },
          {
            "key": "telemetry.sdk.language",
            "value": {
              "stringValue": "cpp"
            }
          },
          {
            "key": "telemetry.sdk.name",
            "value": {
              "stringValue": "Icinga 2 OTel Integration"
            }
          },
          {
            "key": "telemetry.sdk.version",
            "value": {
              "stringValue": "v2.15.0-235-gb35b335f2"
            }
          },
          {
            "key": "service.namespace",
            "value": {
              "stringValue": "icinga"
            }
          },
          {
            "key": "icinga2.host.name",
            "value": {
              "stringValue": "something"
            }
          },
          {
            "key": "icinga2.command.name",
            "value": {
              "stringValue": "icinga"
            }
          }
        ],
        "entityRefs": [
          {
            "type": "host",
            "idKeys": [
              "icinga2.host.name"
            ]
          }
        ]
      },
      "scopeMetrics": [
        {
          "scope": {
            "name": "icinga2",
            "version": "v2.15.0-235-gb35b335f2"
          },
          "metrics": [
            {
              "name": "state_check.perfdata",
              "gauge": {
                "dataPoints": [
                  {
                    "attributes": [
                      {
                        "key": "label",
                        "value": {
                          "stringValue": "api_num_conn_endpoints"
                        }
                      }
                    ],
                    "startTimeUnixNano": "1768762502097898752",
                    "timeUnixNano": "1768762502101475072",
                    "asDouble": 0
                  },
                  {
                    "attributes": [
                      {
                        "key": "label",
                        "value": {
                          "stringValue": "api_num_endpoints"
                        }
                      }
                    ],
                    "startTimeUnixNano": "1768762502097898752",
                    "timeUnixNano": "1768762502101475072",
                    "asDouble": 0
                  }
                ]
              }
            }
          ],
          "schemaUrl": "https://opentelemetry.io/schemas/1.39.0"
        }
      ],
      "schemaUrl": "https://opentelemetry.io/schemas/1.39.0"
    },
    {
      "resource": {
        "attributes": [
          {
            "key": "service.name",
            "value": {
              "stringValue": "Icinga 2"
            }
          },
          {
            "key": "service.instance.id",
            "value": {
              "stringValue": "2d6c27cd-484d-436d-9542-b70abdaf2f76"
            }
          },
          {
            "key": "service.version",
            "value": {
              "stringValue": "v2.15.0-235-gb35b335f2"
            }
          },
          {
            "key": "telemetry.sdk.language",
            "value": {
              "stringValue": "cpp"
            }
          },
          {
            "key": "telemetry.sdk.name",
            "value": {
              "stringValue": "Icinga 2 OTel Integration"
            }
          },
          {
            "key": "telemetry.sdk.version",
            "value": {
              "stringValue": "v2.15.0-235-gb35b335f2"
            }
          },
          {
            "key": "service.namespace",
            "value": {
              "stringValue": "icinga"
            }
          },
          {
            "key": "icinga2.host.name",
            "value": {
              "stringValue": "something"
            }
          },
          {
            "key": "icinga2.service.name",
            "value": {
              "stringValue": "something-service"
            }
          },
          {
            "key": "icinga2.command.name",
            "value": {
              "stringValue": "icinga"
            }
          }
        ],
        "entityRefs": [
          {
            "type": "service",
            "idKeys": [
              "icinga2.host.name",
              "icinga2.service.name"
            ]
          }
        ]
      },
      "scopeMetrics": [
        {
          "scope": {
            "name": "icinga2",
            "version": "v2.15.0-235-gb35b335f2"
          },
          "metrics": [
            {
              "name": "state_check.perfdata",
              "gauge": {
                "dataPoints": [
                  {
                    "attributes": [
                      {
                        "key": "label",
                        "value": {
                          "stringValue": "api_num_conn_endpoints"
                        }
                      }
                    ],
                    "startTimeUnixNano": "1768762509990163200",
                    "timeUnixNano": "1768762510002787072",
                    "asDouble": 0
                  },
                  {
                    "attributes": [
                      {
                        "key": "label",
                        "value": {
                          "stringValue": "api_num_endpoints"
                        }
                      }
                    ],
                    "startTimeUnixNano": "1768762509990163200",
                    "timeUnixNano": "1768762510002787072",
                    "asDouble": 0
                  }
                ]
              }
            }
          ],
          "schemaUrl": "https://opentelemetry.io/schemas/1.39.0"
        }
      ],
      "schemaUrl": "https://opentelemetry.io/schemas/1.39.0"
    }
  ]
}

As already mentioned, this is a first implementation and everything is open for discussion but primarily about the
following aspects:

  • Eliminating all trivial attributes that don't add any value (these attributes are added only when the enable_send_metadata option is set and include attributes like icinga2.check.state, icinga2.check.latency, etc.). EDIT: There are now no such attributes anymore and the enable_send_metadata is gone as well.
  • Choosing better attribute names (currently prefixed with icinga2. to avoid collisions) and the overall metric naming (currently just icinga2.perfdata for all performance data points). EDIT: The metrics have been renamed to state_check.perfdata, state_check.threshold.warning etc. as suggested in Add OTLPMetricsWriter #10685 (comment).

The high-level class overview is as higlighted in the following mermaid UML diagram:

---
title: OTel Integration
---
classDiagram
    %% The two classes below are just a type alias definitions for better readability.
    note for OTelAttrVal "OTelAttrVal is implemented as a type alias not a class."
    note for OTelAttrsMap "OTelAttrsSet is implemented as a type alias not a class."

    class OTelAttrVal {
        <<type alias>>
        +std::variant~bool, int64_t, double, String>
    }

    class OTelAttrsMap {
        <<type alias>>
        +set~pair~String-AttrValue~~
    }
    OTelAttrsMap --o OTelAttrVal : manages

    class Gauge {
        -std::unique_ptr~proto::Gauge~ ProtoGauge

        +Transform(metric: proto::Metric*) void
        +IsEmpty() bool
        +Record(value: double|int64_t, start_time: double, end_time: double, attributes: OTelAttrsMap) std::size_t
    }
    OTelAttrsMap <.. Gauge : uses

    class OTel {
        -proto::ExportMetricsServiceRequest Request
        -std::optional~StreamType~ Stream
        -asio::io_context::strand Strand
        -std::atomic_bool Exporting, Stopped

        +Start() void
        +Stop() void
        +Export(MetricsRequest& request) void
        +IsExporting() bool
        +Stopped() bool

        +ValidateName(name: string_view) bool$
        +IsRetryableExportError(status: beast::http::status) bool$
        +PopulateResourceAttrs(rm: const std::unique_ptr~opentelemetry::proto::metrics::v1::ResourceMetrics~&) void$

        -Connect(yc: boost::asio::yield_context&) void
        -ExportLoop(yc: boost::asio::yield_context&) void
        -Export(yc: boost::asio::yield_context&) void
        -ExportingSet(exporting: bool, notifyAll: bool) void
    }
    AsioProtobufOutputStream <.. OTel : serializes via
    RetryableExportError <.. OTel : uses
    Backoff <.. OTel : uses

    `google::protobuf::io::ZeroCopyOutputStream` <|.. AsioProtobufOutputStream : implements
    class AsioProtobufOutputStream {
        -int64_t Pos
        -int64_t Buffered
        -HttpRequestWriter Writer
        -asio::yield_context& Yield

        +AsioProtobufOutputStream(stream: const StreamType&, info: const OTelConnInfo&, yield: asio::yield_context&)
        +Next(data: void**, size: int**) bool
        +BackUp(count: int) void
        +ByteCount() std::size_t
        -Flush(final: bool) bool
    }

    class RetryableExportError {
        -uint64_t Throttle
        +Throttle() uint64_t
        +what() const char*
    }

    class Backoff {
        +std::chrono::milliseconds MaxBackoff$
        +std::chrono::milliseconds MinBackoff$
        +operator()() std::chrono::milliseconds
    }
    class OTLPMetricsWriter {
        -unordered_set~shared_ptr~Metric~~ Metrics
        -OTel m_Exporter
    }
    Gauge "0...*" o-- "" OTLPMetricsWriter: produces
    OTelAttrsMap <.. OTLPMetricsWriter: creates
    OTel "1" o-- "" OTLPMetricsWriter: exports via
Loading

The OTelWriter by itself is pretty straightforward and doesn't contain any complex logic. The main OTel-related logic is encapsulated in a new library called otel, which provides an HTTP client that conforms to the OTLP HTTP protocol specification. The OTel class is the one used by the OTelWriter to export metrics to the OpenTelemetry collector. The OTel class internally uses several helper classes to build the required Protocol Buffers messages as per the OpenTelemetry specification. Unlike the existing metric writers, this client doesn't create separate HTTP connections for each metric export. Instead, it maintains a persistent connection to the OpenTelemetry collector and reuses it for subsequent exports until the connection is closed by either side. The Protobuf message is serialized directly into HTTP connection without any intermediate buffering of the serialized message. This is possible only because the OpenTelemetry Collector supports HTTP/1.1 chunked transfer encoding, which allows sending the message in chunks without knowing the entire message size beforehand.

That's it. Overall, this implementation is quite minimalistic and only implements the bare minimum required to send metrics to an OpenTelemetry collector.

Known Issues

Well, since the OpenTelemetry proto files require a proto3 language syntax, it turned out that not all our supported Distros provide a recent enough version of protoc that supports proto3. Those Distros are:

  • Amazon Linux 2 (will be EOL soon, so not a big deal)
  • Debian 11 (Bullseye) - will be EOL soon as well, but Ubuntu 22.04 LTS has the same issue, so I don't know yet how to deal with this one.
  • And finally, the big one: RHEL 8 and 9 also affected by this issue, so we will probably end up having to provide our own Protobuf packages for these Distros but that's a topic for another day.

Also, due the FindProtobuf module shipped with CMake version < 3.31.0 being completely broken, I ended up having to import that very same module from CMake 3.31.0 into our CMake third-party modules directory. This is obviously not ideal, but I didn't find any other way around this issue. Once we bump our minimum required CMake version to 3.31.0, we can remove this workaround again. So, the PR is being so huge partly because of this workaround.

Testing

Testing this PR is a non-trivial task as it requires some knowledge about OpenTelemetry/Prometheus and setting up a local
collector instance. Here's a brief guide for anyone interested in testing this PR:

First, you need to set up an OpenTelemetry Collector instance. You can use the official Docker image for this purpose. I've included two exporters in the configuration: a standard output exporter for debugging and a Prometheus exporter to scrape the metrics via Prometheus (choose whatever you're comfortable with).

otel-collector-config.yaml
receivers:
otlp:
  protocols:
    http:
      endpoint: 0.0.0.0:4318
#exporters:
#  debug:
#    verbosity: detailed
#service:
#  pipelines:
#    metrics:
#      receivers: [otlp]
#      processors: []
#      exporters: [debug]
exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    resource_to_telemetry_conversion:
      enabled: true
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: []
      exporters: [prometheus]

And then start the collector using the following command:

docker network create otel
docker run --network otel -p 4318:4318 --rm -v $(pwd)/otel-collector-config.yml:/etc/otelcol/config.yaml otel/opentelemetry-collector

If you chose the debug exporter instead of Prometheus, you will see the received metrics printed in the container logs, once Icinga 2 starts sending them. Otherwise, you have to add a Prometheus instance to scrape the metrics from the collector. For this, you can just use the following config and start Prometheus via Docker as well:

global:
  scrape_interval: 1m
scrape_configs:
  - job_name: "icinga2-metrics-scraper"
    static_configs:
      - targets: ["host.docker.internal:8889"] # You might need to adjust this address depending on your OS.
        labels:
          app: "icinga2-metrics-scraper"
    metrics_path: /metrics

And start Prometheus in the background:

docker run --network otel --name prometheus -d -p 9090:9090 -v $(pwd)/config.yml:/etc/prometheus/prometheus.yml prom/prometheus

Now, the only thing left to do is to build Icinga 2 with this PR applied and configure the OTelWriter. If you're an experienced Icinga 2 user that knows how to manually build Icinga 2 Docker images, you can do your own thing and skip the following steps. For everyone else, here's a quick guide:


  • First, clone the Icinga 2 repository and checkout this PR branch:
git clone git@github.com:Icinga/icinga2.git
cd icinga2
git checkout otel
  • Next, build a local Docker image using the Containerfile provided in the just cloned repository:
docker build --tag icinga/icinga2:otel --file Containerfile .

Afterwards, you can use this image icinga/icinga2:otel to start an Icinga 2 container with the OTelWriter configured just like any other Icinga 2 component.

Having done all of the above (especially if you chose the Prometheus exporter), how you verify that everything works as expected? Well, you have to do a few more things again :). Here's what I used to render some beautiful graphs in Icinga Web 2.


First, icingaweb2-module-perfdatagraphs-prometheus module developed by @oxzi. Though, in order to make it work with the data sent by the OTelWriter, you need to perform some monkey patching.

Monkey Patch
diff --git a/library/Perfdatagraphsprometheus/ProvidedHook/Perfdatagraphs/PerfdataSource.php b/library/Perfdatagraphsprometheus/ProvidedHook/Perfdatagraphs/PerfdataSource.php
index 0208f86..ea60e2e 100644
--- a/library/Perfdatagraphsprometheus/ProvidedHook/Perfdatagraphs/PerfdataSource.php
+++ b/library/Perfdatagraphsprometheus/ProvidedHook/Perfdatagraphs/PerfdataSource.php
@@ -26,13 +26,13 @@ class PerfdataSource extends PerfdataSourceHook
     {
         // TODO: honor PerfdataRequest's includeMetrics, excludeMetrics
         $promQuery = '{';
-        $promQuery .= '__name__=~"icinga_check_result_perf.*"';
-        $promQuery .= ', host="' . $req->getHostname(). '"';
+        $promQuery .= '__name__="icinga2_perfdata"';
+        $promQuery .= ', icinga2_host="' . $req->getHostname(). '"';
         if ($req->isHostCheck()) {
-            $promQuery .= ', object_type="host"';
+            $promQuery .= ', icinga2_service=""';
         } else {
-            $promQuery .= ', object_type="service"';
-            $promQuery .= ', service="' . $req->getServicename() . '"';
+            //$promQuery .= ', object_type="service"';
+            $promQuery .= ', icinga2_service="' . $req->getServicename() . '"';
         }
         $promQuery .= '}';

@@ -42,7 +42,7 @@ class PerfdataSource extends PerfdataSourceHook
         $client = new Client();
         $promResponse = $client->request(
             'POST',
-            'http://localhost:9090/api/v1/query_range', // TODO: configurable
+            'http://host.docker.internal:9090/api/v1/query_range', // TODO: configurable
             [
                 'form_params' => [
                     'query' => $promQuery,
@@ -75,20 +75,31 @@ class PerfdataSource extends PerfdataSourceHook
         // ]
         $datasets = array();
         foreach ($promResponse['data']['result'] as $result) {
-            $label = $result['metric']['label'];
-            $name = $result['metric']['__name__'];
-
-            if (! array_key_exists($name, $rename)) {
-                throw new Exception('unexpected __name__ ' . $name);
-            }
-            $name = $rename[$name];
+            $metric = $result['metric'];
+            $label = $metric['icinga2_perfdata_label'];
+            $name = 'value';

             if (! array_key_exists($label, $datasets)) {
                 $datasets[$label] = [
-                    'unit' => array_key_exists('unit', $result['metric']) ? $result['metric']['unit'] : '',
+                    'unit' => '',
                     'times' => array(),
                     'vals' => array(),
                 ];
+                if (isset($metric['icinga2_perfdata_unit'])) {
+                    $datasets[$label]['unit'] = $result['metric']['icinga2_perfdata_unit'];
+                }
+                if (isset($metric['icinga2_perfdata_crit'])) {
+                    $datasets[$label]['critical'] = $metric['icinga2_perfdata_crit'];
+                }
+                if (isset($metric['icinga2_perfdata_warn'])) {
+                    $datasets[$label]['warning'] = $metric['icinga2_perfdata_warn'];
+                }
+                if (isset($metric['icinga2_perfdata_min'])) {
+                    $datasets[$label]['min'] = $metric['icinga2_perfdata_min'];
+                }
+                if (isset($metric['icinga2_perfdata_max'])) {
+                    $datasets[$label]['warning'] = $metric['icinga2_perfdata_max'];
+                }

                 foreach ($result['values'] as $valuePair) {
                     $datasets[$label]['times'][] = $valuePair[0];

Next, well, you need to install and configure another module yet again, the icingaweb2-module-perfdatagraphs module in your Icinga Web 2 instance. Follow the instructions in the repository to get it set up and use the above cloned module as a backend for this module.

If everything is set up correctly, you should start seeing performance data metrics in Prometheus as well as beautiful
graphs in Icinga Web 2.

Bildschirmfoto 2026-01-15 um 15 36 39

Update 19.02: Since the data format emitted by this writer has evolved over time, the above mentioned module from @oxzi is not compatible with the current version of the writer anymore. However, @martialblog has implemented a new module that is compatible with the current version of the writer, and it can be used to render even more nice looking graphs.

Bildschirmfoto 2026-02-19 um 13 38 44

TODO

  • Missing documentation. Already added!

resolves #10439
resolves #9900
resolves #10576

@cla-bot cla-bot bot added the cla/signed label Jan 15, 2026
@martialblog
Copy link
Member

Very nice work! I'll have a look at it.

This should also resolve #9900

@martialblog
Copy link
Member

martialblog commented Jan 16, 2026

Hi,

I spent some time testing the new Writer. Here's some first feedback:

  1. OTLPMetricsWriter

Maybe the name of the writer could be OTLPMetricWriter instead of OTelWriter.
It more clearly describes what it does. Also that gives you room to maybe have a
OTLPLogsWriter in the future.

Just an idea. OTelWriter is also fine.

  1. enable_send_metadata = true

When I set enable_send_metadata = true then the daemon crashes when wants to send data.
As we discussed, maybe the enable_send_metadata is not yet required and we can remove this in the first release.

  1. Resource service.namespace

Maybe the resource attribute service.namespace should not be hard coded to "icinga".

From the OpenTelemetry Conventions:

type: service.namespace
Description: Groups related services that compose a system or application under a common namespace
A string value having a meaning that helps to distinguish a group of services, for example the team name that owns a group of services.

I think the namespace is more akin to a "Kubernetes Namespace".
Meaning users maybe want to set something like "icinga-production" or "icinga staging".

I'm not 100% sure if this is something that should be set via the actual Icinga Service Objects or
once for the Icinga OTLP Writer.

  1. Host and Service resource attributes

Currently the Icinga Host/Service information is set as an attribute, maybe these should be resource attributes.

From the OpenTelemetry Conventions:

A Resource is a representation of the entity producing telemetry as Attributes. For example, You could have a process producing telemetry that is running in a container on Kubernetes, which is associated to a Pod running on a Node that is a VM but also is in a namespace and possibly is part of a Deployment. Resource could have attributes to denote information about the Container, the Pod, the Node, the VM or the Deployment.

There are Host and Service conventions for this:

I think it's ok if they are "namespaced" like this "icinga2.host.name" and "icinga2.service.name".

  1. icinga2_perfdata Metric Name

Maybe we want a more generic metric name than "icinga2_perfdata".
Since the source (Icinga) can be determined from the (resource) attributes.

icinga2_perfdata{icinga2_check_command="procs", icinga2_host="674c37a9881b", icinga2_perfdata_label="procs", icinga2_service="procs", instance="44a9cbd1-cbf1-4728-a352-b32727382928", job="icinga/icinga2", service_instance_id="44a9cbd1-cbf1-4728-a352-b32727382928", service_name="icinga2", service_namespace="icinga", service_version="v2.15.0-235-gb35b335f2"}

For example there are proposals for a health_check.status and a health_check.threshold metric. Personally I think "state_check" is a good namespace to start with. Then we can have "state_check.perfdata", "state_check.threshold", "state_check.min", and so on.

See also open-telemetry/semantic-conventions#1106

  1. Thresholds should be metrics

When enable_send_thresholds = true is set the thresholds are added as attributes.

icinga2_perfdata{icinga2_check_command="load", icinga2_host="674c37a9881b", icinga2_perfdata_crit="6", icinga2_perfdata_label="load5", icinga2_perfdata_min="0", icinga2_perfdata_warn="4", icinga2_service="load", instance="f3028916-6b63-4357-97c3-2e281e1e4b2f", job="icinga/icinga2", service_instance_id="f3028916-6b63-4357-97c3-2e281e1e4b2f", service_name="icinga2", service_namespace="icinga", service_version="v2.15.0-235-gb35b335f2"}

This makes it hard to work with them for example when plotting them. They should be encoded as metrics with the same attributes as the perfdata metric. For example:

state_check.perfdata{
  service.name=icinga
  icinga2.host.name=node1
  icinga2.check.name=checkload
  icinga2.service.name=load} 3

state_check.threshold{
  service.name=icinga
  icinga2.threshold=warning
  icinga2.host.name=node1
  icinga2.check.name=checkload
  icinga2.service.name=load} 10

state_check.threshold{
  service.name=icinga
  icinga2.threshold=critical
  icinga2.host.name=node1
  icinga2.check.name=checkload
  icinga2.service.name=load} 20
  1. Prometheus OTLP

When I send the data to the Prometheus OTLP Writer. The Icinga2 daemon logs some warnings due to the response headers I think:

[2026-01-16 10:27:19 +0000] information/OTelExporter: Connecting to OpenTelemetry collector on host 'otel-collector:4318'.
[2026-01-16 10:27:19 +0000] information/OTelWriter: 'prometheus' resumed.
[2026-01-16 10:27:19 +0000] information/OTelExporter: Connecting to OpenTelemetry collector on host 'prometheus:9090'.
[2026-01-16 10:27:19 +0000] information/OTelExporter: Successfully connected to OpenTelemetry collector.
[2026-01-16 10:27:19 +0000] information/OTelExporter: Successfully connected to OpenTelemetry collector.
[2026-01-16 10:27:19 +0000] information/CheckerComponent: 'checker' started.
[2026-01-16 10:27:19 +0000] information/ConfigItem: Activated all objects.
[2026-01-16 10:27:29 +0000] information/WorkQueue: #6 (OTelWriter, otel) items: 0, rate: 0.116667/s (7/min 7/5min 7/15min);
[2026-01-16 10:27:29 +0000] information/WorkQueue: #7 (OTelWriter, prometheus) items: 0, rate: 0.116667/s (7/min 7/5min 7/15min);
[2026-01-16 10:27:34 +0000] warning/OTelExporter: Unexpected Content-Type from OpenTelemetry collector:  (OK).
[2026-01-16 10:27:49 +0000] warning/OTelExporter: Unexpected Content-Type from OpenTelemetry collector:  (OK).
[2026-01-16 10:28:04 +0000] warning/OTelExporter: Unexpected Content-Type from OpenTelemetry collector:  (OK).
[2026-01-16 10:28:19 +0000] warning/OTelExporter: Unexpected Content-Type from OpenTelemetry collector:  (OK).
Here is my test setup
---
services:
  icinga:
    image: localhost/icinga/icinga2:otel
    entrypoint: sleep infinity
    user: root
  prometheus:
    image: docker.io/prom/prometheus
    privileged: true
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.enable-otlp-receiver'
    ports:
      - '9090:9090'
  otel-collector:
    image: docker.io/otel/opentelemetry-collector
    volumes:
      - ./otel.yml:/etc/otelcol/config.yaml
      - ./metrics.json:/app/metrics.json
    ports:
      - 4318:4318
  grafana:
    image: docker.io/grafana/grafana
    ports:
      - '3000:3000'
OTel Collector
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
exporters:
  debug:
    verbosity: detailed
  file:
    path: /app/metrics.json
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: []
      exporters: [debug,file]
prometheus.yml
global:
  scrape_interval: 5s
otlp:
  promote_resource_attributes:
    - service.instance.id
    - service.name
    - service.namespace
    - service.version

@martialblog
Copy link
Member

martialblog commented Jan 16, 2026

Also tested it with Grafana Mimir. Works great

Grafana Mimir
object OTelWriter "prometheus" {
  host = "prometheus"
  port = 9090
  metrics_endpoint = "/api/v1/otlp/v1/metrics"
}

object OTelWriter "mimir" {
  host = "mimir"
  port = 8080
  metrics_endpoint = "/otlp/v1/metrics"
}
services:
  mimir:
    image: docker.io/grafana/mimir:latest
    command: ["-config.file=/etc/mimir.yaml"]
    ports:
      - '8080:8080'
    volumes:
      - ./mimir.yaml:/etc/mimir.yaml

cat mimir.yaml 
---
multitenancy_enabled: false

blocks_storage:
  backend: filesystem
  bucket_store:
    sync_dir: /tmp/mimir/tsdb-sync
  filesystem:
    dir: /tmp/mimir/data/tsdb
  tsdb:
    dir: /tmp/mimir/tsdb

compactor:
  data_dir: /tmp/mimir/compactor
  sharding_ring:
    kvstore:
      store: memberlist

distributor:
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: memberlist

ingester:
  ring:
    instance_addr: 0.0.0.0
    kvstore:
      store: memberlist
    replication_factor: 1

ruler_storage:
  backend: filesystem
  filesystem:
    dir: /tmp/mimir/rules

server:
  http_listen_port: 8080
  log_level: error

store_gateway:
  sharding_ring:
    replication_factor: 1

@martialblog
Copy link
Member

Another small note, I think icinga2_perfdata_unit can simply be "unit":

icinga2_perfdata{icinga2_check_command="disk", icinga2_host="938b0ea145df", icinga2_perfdata_crit="903362170060", icinga2_perfdata_label="/data", icinga2_perfdata_max="1003735744512", icinga2_perfdata_min="0", icinga2_perfdata_unit="bytes", icinga2_perfdata_warn="802988595609", icinga2_service="disk", instance="28e0fd2c-ddbc-48e3-8e6f-d158943bdc1d", job="icinga/icinga2", service_instance_id="28e0fd2c-ddbc-48e3-8e6f-d158943bdc1d", service_name="icinga2", service_namespace="icinga", service_version="v2.15.0-235-gb35b335f2"}

This was referenced Jan 16, 2026
@yhabteab
Copy link
Member Author

Thanks for the feedback and compose file you provided!

  1. OTLPMetricsWriter

I will bring this into our weekly meetings next week for discussion.

  1. enable_send_metadata = true

Aye, that was an oversight on my part. Though, I've dropped it completely now, so there shouldn't be any daemon crashes now :).

  1. Resource service.namespace

Maybe the resource attribute service.namespace should not be hard coded to "icinga".
I'm not 100% sure if this is something that should be set via the actual Icinga Service Objects or once for the Icinga OTLP Writer.

We must look into this from the Icinga 2 side and not from an OTel perspective. There is no way we're going to introduce
a new attribute to all host and service objects just for this purpose. So, the alternative is to at least make it configurable
so that users can set it on a per-instance basis. That way, each OTelWriter instance can have its own namespace.

  1. Host and Service resource attributes

Currently the Icinga Host/Service information is set as an attribute, maybe these should be resource attributes.

Ack! Will change that.

  1. icinga2_perfdata Metric Name

For example there are proposals for a health_check.status and a health_check.threshold metric. Personally I think "state_check" is a good namespace to start with. Then we can have "state_check.perfdata", "state_check.threshold", "state_check.min", and so on.

That makes sense. I'll update the metric names accordingly, especially since there's a proposal for this.

  1. Thresholds should be metrics

This makes it hard to work with them for example when plotting them. They should be encoded as metrics with the same attributes as the perfdata metric.

Good point! I'm going transform all thresholds into separate metric streams then, i.e, state_check.threshold.crit, state_check.threshold.warn, etc.

  1. Prometheus OTLP

I didn't know about this, so thanks for testing it out! I'll fix it.

@yhabteab
Copy link
Member Author

I've addressed all your feedbacks apart from the OTelWriter naming thing. I've also updated the PR description and included a JSON example that show case how the metrics would look like in an OTel collector. Please have another look when you get the chance. Thanks!

@yhabteab
Copy link
Member Author

Sorry. I had to fix one issue introduced with my last push (thanks @martialblog for testing!). Apparently, namespace is a reserved keyword in Icinga 2 DSL, so can't be used as an attribute name.

@Al2Klimov
Copy link
Member

Hold my beer. 😉

Icinga 2 (version: v2.15.0-232-g6701edf6e)
Type $help to view available commands.
<1> => {namespace = 1}
                  ^
syntax error, unexpected = (T_SET)
<2> => {"namespace" = 1}
{
	@namespace = 1.000000
}
<3> => {@namespace = 1}
{
	@namespace = 1.000000
}
<4> => { {{{namespace}}} = 1 }
{
	@namespace = 1.000000
}
<5> =>

@yhabteab
Copy link
Member Author

Thanks! I'm aware that DSL users can escape keywords but there's no point in using a reserved word as an attribute for a built-in config object.

@martialblog
Copy link
Member

I encountered a strange issue with the new code 71028d3a297844ce855052b72672b618d5179669.

The deamon did tell me it was flushing data, but then never did.

[2026-01-19 14:10:54 +0000] information/OTelWriter: Flushing OTel metrics to OpenTelemetry collector (timer expired).

@yhabteab and I did some debugging and isolated this area. When replacing the ASSERT with VERIFY is seems to work again. Yonas has the details.

 void OTel::Export(boost::asio::yield_context& yc)
 {
        AsioProtobufOutStream outputS{*m_Stream, m_ConnInfo, yc};
-       ASSERT(m_Request->SerializeToZeroCopyStream(&outputS));
+      VERIFY(m_Request->SerializeToZeroCopyStream(&outputS));

@yhabteab
Copy link
Member Author

yhabteab commented Jan 19, 2026

@yhabteab and I did some debugging and isolated this area. When replacing the ASSERT with VERIFY is seems to work again. Yonas has the details.

Thanks for your help! Apparently, assert() isn't allowed to have any side effects, which I was not aware of (thanks @jschmidt-icinga for confirming this). I was building my local images always in debug mode, so the side effect of SerializeToZeroCopyStream() was always executed, but @martialblog was using release builds where the ASSERT() with the function call in it was optimized away, leading to very confusing behavior.

	AsioProtobufOutStream outputS{*m_Stream, m_ConnInfo, yc};
	ASSERT(m_Request->SerializeToZeroCopyStream(&outputS));

C++ code after the C++ preprocessor has run (clang++ -I=... -E otel.cpp > otel.tmp.cpp:

 AsioProtobufOutStream outputS{*m_Stream, m_ConnInfo, yc};

 ((void)0);

On the other hand, when using VERIFY nothing is optimized away:

 AsioProtobufOutStream outputS{*m_Stream, m_ConnInfo, yc};

 ((m_Request->SerializeToZeroCopyStream(&outputS)) ? void(0) : icinga_assert_fail("m_Request->SerializeToZeroCopyStream(&outputS)", "otel.cpp", 340));

I don't know what I was thinking when I used ASSERT this way 🤦🏻‍♂️!

#ifndef I2_DEBUG
# define ASSERT(expr) ((void)0)

I've fixed it now and should behave normally.

@martialblog
Copy link
Member

Did some testing with OpenSearch Data-Prepper, I did manage to send data successfully to OpenSearch like this:

object OTelWriter "data-prepper" {
  host = "data-prepper"
  port = 21893
  metrics_endpoint = "/opentelemetry.proto.collector.metrics.v1.MetricsService/Export"
}

However, I did see some "critical" errors in the Icinga2 logs:

[2026-01-19 14:59:19 +0000] information/OTelExporter: Connecting to OpenTelemetry collector on host 'data-prepper:21893'.
[2026-01-19 14:59:19 +0000] information/OTelExporter: Successfully connected to OpenTelemetry collector.
[2026-01-19 15:00:19 +0000] information/WorkQueue: #6 (OTelWriter, data-prepper) items: 1, rate: 0.283333/s (17/min 82/5min 164/15min);
[2026-01-19 15:00:49 +0000] information/ConfigObject: Dumping program state to file '/data/var/lib/icinga2/icinga2.state'
[2026-01-19 15:00:49 +0000] critical/OTelExporter: Error: Error: end of stream [beast.http:1 at /usr/include/boost/beast/http/impl/read.hpp:231 in function 'operator()']
[2026-01-19 15:00:49 +0000] information/OTelExporter: Connecting to OpenTelemetry collector on host 'data-prepper:21893'.
[2026-01-19 15:00:49 +0000] information/OTelExporter: Successfully connected to OpenTelemetry collector.
[2026-01-19 15:00:59 +0000] information/WorkQueue: #6 (OTelWriter, data-prepper) items: 0, rate: 0.266667/s (16/min 80/5min 169/15min);
[2026-01-19 15:01:49 +0000] information/WorkQueue: #6 (OTelWriter, data-prepper) items: 1, rate: 0.25/s (15/min 79/5min 184/15min); empty in 9 seconds
[2026-01-19 15:01:49 +0000] critical/OTelExporter: Error: Error: end of stream [beast.http:1 at /usr/include/boost/beast/http/impl/read.hpp:231 in function 'operator()']
Compose with OpenSearch Data-Prepper
---
version: '3'
services:
  icinga:
    image: localhost/icinga/icinga2
    entrypoint: sleep infinity
    user: root

  data-prepper:
    image: docker.io/opensearchproject/data-prepper
    container_name: data-prepper
    volumes:
      - ./metric_pipeline.yaml:/usr/share/data-prepper/pipelines/metric_pipeline.yaml
      - ./data-prepper-config.yaml:/usr/share/data-prepper/config/data-prepper-config.yaml
    ports:
      - 2021:2021
      - 21891:21891
      - 21893:21893
      - 4900:4900

  opensearch:
    container_name: opensearch
    image: docker.io/opensearchproject/opensearch:3.4.0
    environment:
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms1024m -Xmx1024m"
      - "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Developer@123"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    ports:
      - 9200:9200

  dashboards:
    image: docker.io/opensearchproject/opensearch-dashboards:3.1.0
    container_name: opensearch-dashboards
    ports:
      - 5601:5601
    environment:
      OPENSEARCH_HOSTS: '["https://opensearch:9200"]'
# cat metric_pipeline.yaml 
metric-pipeline:
  source:
    otlp:
      unframed_requests: true
      health_check_service: true
      authentication:
        unauthenticated:
      ssl: false
  sink:
    - stdout:
    - opensearch:
        hosts: [ "https://opensearch:9200" ]
        insecure: true
        username: admin
        password: Developer@123
        index: otel_metrics

# cat data-prepper-config.yaml 
ssl: false

@yhabteab
Copy link
Member Author

However, I did see some "critical" errors in the Icinga2 logs:

I don't know, how OpenSearch behave and whether their OTELP receiver fully conforms to the OTELP specs but that looks like OpenSearch is closing the connection after some time (no persistent HTTP connection support?). I'll try to go through their docs and see if I can find something about that.

@yhabteab
Copy link
Member Author

yhabteab commented Jan 19, 2026

However, I did see some "critical" errors in the Icinga2 logs:

I don't know, how OpenSearch behave and whether their OTELP receiver fully conforms to the OTELP specs but that looks like OpenSearch is closing the connection after some time (no persistent HTTP connection support?). I'll try to go through their docs and see if I can find something about that.

I can't find anything about persistent connections in the Data Prepper1 docs so far, but the OTel spec2 clearly says:

The client SHOULD keep the connection alive between requests.

However, OpenSearch doesn't seem to honor that, so it closes the connection after each request, I guess it's because the sentence is rephrased as SHOULD and not MUST. Nonetheless, I will try to detect such cases and degrade from a critical to some other log severity instead.

$ netstat -ant | grep 21893
tcp4       0      0  127.0.0.1.21893        127.0.0.1.51712        FIN_WAIT_2 # OpenSearch closed the connection is waiting for remote peer to close it.
tcp4       0      0  127.0.0.1.51712        127.0.0.1.21893        CLOSE_WAIT # OpenSearch has closed the conn but Icinga 2 didn't close it.
tcp4       0      0  *.21893                *.*                    LISTEN

Footnotes

  1. https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/sources/otel-metrics-source/#metrics

  2. https://opentelemetry.io/docs/specs/otlp/#otlphttp-connection

@martialblog
Copy link
Member

Yeah that makes sense, if the client only SHOULD keep the connection alive then a less severe log level is alright.

@yhabteab
Copy link
Member Author

I've fixed the critical logs shown in #10685 (comment) by degrading that specific http::end_of_stream error to a debug log.

@yhabteab yhabteab added the area/opentelemetry Metrics to OpenTelemetry. label Jan 20, 2026
@yhabteab yhabteab added this to the 2.16.0 milestone Jan 20, 2026
@yhabteab yhabteab changed the title Add OTelWriter Add OTLPMetricsWriter Jan 21, 2026
@yhabteab
Copy link
Member Author

The newly pushed commits include the following changes:

  • I've renamed the writer to OTLPMetricsWriter as suggested by @martialblog in his first comment to better reflect its purpose. The feature can now be enabled by using the otlpmetrics by default.
  • Instead of using a randomly generated UUID for the service.instance.id attribute (that changes on every restart), I've switched to a SHA1 hash composed of the checkable name and service namespace. This ensures uniqueness while maintaining consistency across restarts, as per OTel specifications.
  • I've added the missing docs for the writer.

@yhabteab
Copy link
Member Author

Since I had to rebase this, force push was unavoidable, so while force-pushing anyway, I've cleaned up the commits a bit.

Copy link
Contributor

@jschmidt-icinga jschmidt-icinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Also gave it a quick test in my docker environment targeting Elasticsearch 9.3.1 directly and it worked as expected. 👍 Great feature.

@yhabteab
Copy link
Member Author

yhabteab commented Mar 10, 2026

Just rebased and squashed some commits and it seems I've changed nothing in the code, otherwise, GitHub would've dismissed the approval.

@yhabteab yhabteab requested a review from julianbrost March 12, 2026 08:33
yhabteab and others added 8 commits March 16, 2026 11:16
This module is copied from CMake's official module repository[^1] and
contains only minor changes as outlined below.

```diff
--- a/third-party/cmake/protobuf/FindProtobuf.cmake
+++ b/third-party/cmake/protobuf/FindProtobuf.cmake
@@ -218,9 +218,6 @@ Example:
         GENERATE_EXTENSIONS .grpc.pb.h .grpc.pb.cc)
 #]=======================================================================]

-cmake_policy(PUSH)
-cmake_policy(SET CMP0159 NEW) # file(STRINGS) with REGEX updates CMAKE_MATCH_<n>
-
 function(protobuf_generate)
        set(_options APPEND_PATH DESCRIPTORS)
        set(_singleargs LANGUAGE OUT_VAR EXPORT_MACRO PROTOC_OUT_DIR PLUGIN PLUGIN_OPTIONS DEPENDENCIES)
@@ -503,7 +500,7 @@ if( Protobuf_USE_STATIC_LIBS )
        endif()
 endif()

-include(${CMAKE_CURRENT_LIST_DIR}/SelectLibraryConfigurations.cmake)
+include(SelectLibraryConfigurations)

 # Internal function: search for normal library as well as a debug one
 #    if the debug one is specified also include debug/optimized keywords
@@ -768,7 +765,7 @@ if(Protobuf_INCLUDE_DIR)
        endif()
 endif()

-include(${CMAKE_CURRENT_LIST_DIR}/FindPackageHandleStandardArgs.cmake)
+include(FindPackageHandleStandardArgs)
 FIND_PACKAGE_HANDLE_STANDARD_ARGS(Protobuf
        REQUIRED_VARS Protobuf_LIBRARIES Protobuf_INCLUDE_DIR
        VERSION_VAR Protobuf_VERSION
@@ -805,5 +802,3 @@ foreach(Camel
        string(TOUPPER ${Camel} UPPER)
        set(${UPPER} ${${Camel}})
 endforeach()
-
-cmake_policy(POP)
```

[^1]: https://github.com/Kitware/CMake/blob/v3.31.0/Modules/FindProtobuf.cmake
@yhabteab
Copy link
Member Author

Made two minor changes:

  • I've previously called a wrong parent method in OTLPMetricsWriter::ValidateServiceResourceAttributes probably due to just a copy paste error.
  • After we've had a lot of discussions about when to and not to move in a function, I changed the signature of the public OTel::Export method to Export(std::unique_ptr<MetricsRequest>&& request), so that it's clear that the unique pointer will be gone after calling that method, even if that means that I have to do an extra std::move on the call-site.

@yhabteab
Copy link
Member Author

Oh, and I don't know why GitHub is rendering the GHAs like that but they're definitely not cancelled but running in the background https://github.com/Icinga/icinga2/pull/10685/checks.

Copy link
Member

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a bit of trouble mapping Icinga 2 concepts to OpenTelemetry concepts. I'm not yet sure if that's just confusion from some parts of the implementation or of that might actually be useful to have in the documentation. That is how Icinga 2 Host and Service objects map to OpenTelemetry {Resource,Entity,Service,...}.

In particular, my confusion originates from how service.* attributes are set, some (service.name and service.version) are set as if OpenTelemetry Service refers to Icinga 2 (either the node writing, or even the cluster as a whole) whereas others (service.instance.id) are set as if OpenTelemetry Service refers to an Icinga 2 checkable (so Icinga 2 Host or Service object):

icinga2/lib/otel/otel.cpp

Lines 191 to 198 in 5c6a70d

auto* attr = resource->add_attributes();
SetAttribute(*attr, "service.name"sv, "Icinga 2"sv);
attr = resource->add_attributes();
SetAttribute(*attr, "service.instance.id"sv, instanceID);
attr = resource->add_attributes();
SetAttribute(*attr, "service.version"sv, Application::GetAppVersion());

if (metricsForObj.ServiceInstanceId.IsEmpty()) {
// Use instance ID composed of checkable name and service namespace to ensure uniqueness as per OTel specs.
// See https://opentelemetry.io/docs/specs/semconv/resource/service/#service-instance.
Array::Ptr data = new Array{{checkable->GetName(), GetServiceNamespace()}};
metricsForObj.ServiceInstanceId = SHA1(PackObject(data));
}
metricsForObj.ResourceMetrics = std::make_unique<opentelemetry::proto::metrics::v1::ResourceMetrics>();
metricsForObj.ResourceMetrics->add_scope_metrics(); // Pre-create ScopeMetrics entry.
OTel::PopulateResourceAttrs(metricsForObj.ResourceMetrics, metricsForObj.ServiceInstanceId);

Unfortunately, I didn't find the OpenTelemetry documentation particularly helpful here. I've read several parts of their documentation, and went trough the Protobuf definitions, and I still don't have a picture clear enough so I could say what Icinga 2 concept should map to what. @martialblog Could you perhaps help shed some light on the matter?

Apart from that, see below for some other small findings in the code.

# You can enable TLS encryption by uncommenting and configuring the following options.
# By default, the OTel writer uses unencrypted connections (plain HTTP requests).
// enable_tls = false
// tl_insecure_noverify = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// tl_insecure_noverify = false
// tls_insecure_noverify = false

Comment on lines +564 to +568
for (auto it{attrs.begin()}; it != attrs.end(); /* NOPE */) {
auto* attr = dataPoint->add_attributes();
auto node = attrs.extract(it++);
SetAttribute(*attr, node.key(), node.mapped());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complexity of this loop suggests that you wanted to move here (otherwise you could have used a simple for (auto& x : attrs) loop here, but I don't think you're actually moving here.

Suggested change
for (auto it{attrs.begin()}; it != attrs.end(); /* NOPE */) {
auto* attr = dataPoint->add_attributes();
auto node = attrs.extract(it++);
SetAttribute(*attr, node.key(), node.mapped());
}
for (auto it{attrs.begin()}; it != attrs.end(); /* NOPE */) {
auto* attr = dataPoint->add_attributes();
auto node = attrs.extract(it++);
SetAttribute(*attr, std::move(node.key()), std::move(node.mapped()));
}

Also, "NOPE" is not really more useful than no comment at all. Maybe say that the iterator is advanced inside the loop as the loop body invalidates the iterator. Alternatively, I think this would be more readable when written like this (with no changes to the runtime complexity):

Suggested change
for (auto it{attrs.begin()}; it != attrs.end(); /* NOPE */) {
auto* attr = dataPoint->add_attributes();
auto node = attrs.extract(it++);
SetAttribute(*attr, node.key(), node.mapped());
}
while (!attrs.empty()) {
auto* attr = dataPoint->add_attributes();
auto node = attrs.extract(attrs.begin());
SetAttribute(*attr, std::move(node.key()), std::move(node.mapped()));
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complexity of this loop suggests that you wanted to move here (otherwise you could have used a simple for (auto& x : attrs) loop here, but I don't think you're actually moving here.

My initial implementation for SetAttribute did move internally, but after the discussions we had when to and not to move lately, the implementation was changed to only move when the caller explicitly provides an rvalue reference. Then, I eventually forgot to update this loop to reflect that change, because I want to of course move the attributes here.

Also, "NOPE" is not really more useful than no comment at all. Maybe say that the iterator is advanced inside the loop as the loop body invalidates the iterator. Alternatively, I think this would be more readable when written like this

The body contains literally 3 trivial lines of code, what in those lines is not readable? I'm not defending anything here to keep the for loop as it is, just wonder what exactly is not readable about it. That comment was more like, no I don't increment the iterator here, so see two lines below instead, or do you expect me to write a full sentence explaining why it's not incremented there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The body contains literally 3 trivial lines of code, what in those lines is not readable? I'm not defending anything here to keep the for loop as it is, just wonder what exactly is not readable about it.

To understand that code, you have to jump back and forth more often. for (auto it{attrs.begin()}; it != attrs.end(); /* NOPE */) immediately raises the question why there's no ++it and you have to look at the loop body to understand it. Whereas while (!attrs.empty()) more or less directly says that a loop is following that will clear/consume attrs (or is an endless loop if it's buggy).

or do you expect me to write a full sentence explaining why it's not incremented there?

I think my suggested while loop would need no further comment. For the for loop, I was thinking of something like "loop body increments the iterator as it invalidates the previous one".

Comment on lines +294 to +303
template<typename T>
std::size_t OTLPMetricsWriter::Record(
const Checkable::Ptr& checkable,
const CheckResult::Ptr& cr,
std::string_view metric,
T value,
double startTime,
double endTime,
OTel::AttrsMap attrs
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something or is this only used with T = double?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is for now but we might add more metrics like the other writers do that don't depend on the actual perfdata but on the checkable state. Do you see anything wrong with templating it? It's a private method anyway.

@martialblog
Copy link
Member

martialblog commented Mar 19, 2026

Hi @julianbrost

My current understanding is, the service.* resource attributes refer to the running Icinga2 instance.

This follows the OTel service semantic conventions: https://opentelemetry.io/docs/specs/semconv/resource/service/

Service is a logical component of an application that produces telemetry data (events, metrics, spans, etc.).

This is to distinguish different Icinga2 instances that produce data.

The Icinga2 checkables (Host or Service object) then use icinga2.* resource attributes, to a) have a distinct namespace for all Icinga2 attributes and b) distinguish data from each checkable.

The service.instance.id might be a bit confusing I agree, I think the intention was to represent the Icinga2 checkable that "produces" the data.
See #10685 (comment)

Now that I see it again, that might not be correct:

A service.instance is a distinct instance of a service component, e.g. a specific kubernetes container that is part of a kubernetes deployment which offers a service.

service.name <- Logical name of the service.
service.instance.id <- The string ID of the service instance. 
MUST be unique for each instance of the same service.namespace,service.name

Maybe this maps more to an "Icinga Service" that might have two masters "instances"?
And thus all resource attributes for checkables should be under the icinga2.* namespace.

@julianbrost
Copy link
Member

Service is a logical component of an application that produces telemetry data (events, metrics, spans, etc.).

There's the (almost philosophical) question of what produces the metric. Is it Icinga 2 because that's where the metric first becomes an OTel metric? Or is it the actual check command being executed?1 This probably boils down to the question whether Icinga 2 should present itself as a single service with lots of metrics (where the icinga2.host.name, icinga2.service.name and perfdata_label are equal types of keys) or whether it should promote the individual Icinga 2 checkables to something else in the OTel world.

Maybe this maps more to an "Icinga Service" that might have two masters "instances"?
And thus all resource attributes for checkables should be under the icinga2.* namespace.

I'm not sure how well that would play together with HA in Icinga 2. When enable_ha = true is set, only one master will forward the metrics to OTel and if a failover happens, that would result in basically the same metrics then switching over to another service.instance.id. With enable_ha = false set, all metrics would be submitted twice with different service.instance.id. Another consideration could be to populate the instance from the check source, i.e. the Icinga 2 node that executed the check.

Footnotes

  1. In an ideal world, the check command would probably produce OTel metrics directly, and Icinga 2 would just forward then.

Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @yhabteab, we're currently writing one metric with values and one with thresholds.

@martialblog Is this separation good and if yes, what for?

I'm just asking and afraid, depending on the backend, it may be hard/impossible to join separate metrics together again for arithmetic ops. See also #7060.

@martialblog Wouldn't it be clever and smart to instead write just one metric with everything (that the user wants)?

@martialblog
Copy link
Member

@Al2Klimov

The metrics are currently inspired by this proposal here:

Since there is no complete final semantic convention for the kind of data we have here, it's an approximation.

I guess semantically the separation makes sense, since you have two different things 1) the actual value and 2) the thresholds.

Having a single metric would conflate the meaning of both the value and the threshold.
We could compare that to for example the Memory metric,
where you have the system.memory.usage and system.memory.limit as seprate metrics, and not a single metric system.memory with attributes for usage and limit.

https://opentelemetry.io/docs/specs/semconv/system/system-metrics/#memory-metrics

We had some discussion around performance data for Nagios-compatible monitoring here: open-telemetry/semantic-conventions#3148

I don't think "one metric with everything" maps well onto the OTel data model for metrics. But I could be wrong. I don't have definitive answer to what is the best solution.
I think the separation makes sense and the resource attributes for service.* need another look at as @julianbrost discovered.

Bit of a stretch, BUT one could probably encode a "one entity with everything" in a log event with different fields: https://opentelemetry.io/docs/specs/otel/logs/data-model/#events

Then you could encode a "performance data point" into one entity AND you could output the thresholds "properly" with ranges instead of a number (as is currently the case for other Icinga2 Perfdata Writers). But this all feels a bit awkward, since most use cases for performance data I encountered so far have to do with timeseries metrics.

@yhabteab
Copy link
Member Author

This probably boils down to the question whether Icinga 2 should present itself as a single service with lots of metrics (where the icinga2.host.name, icinga2.service.name and perfdata_label are equal types of keys) or whether it should promote the individual Icinga 2 checkables to something else in the OTel world.

I think this distinction isn't even that important. The main point is that Icinga 2 should be able to export metrics in a way that should be easy to distinguish between the individual checkables, since we can't directly map all the checkables to OTel concepts anyway. So as long as we have a clear (and of course non-standardised) way to distinguish our checkables, it shouldn't matter what we populate all the standard service.* keys with. If in doubt, then I would probably go with the first option, since the checkables simply don't have enough information to be mapped to all the required service.* attributes.

When enable_ha = true is set, only one master will forward the metrics to OTel and if a failover happens, that would result in basically the same metrics then switching over to another service.instance.id.

Why would that be a problem? Wouldn't the service.instance.id actually be perfectly fit in that case?

With enable_ha = false set, all metrics would be submitted twice with different service.instance.id.

Is this an Icinga 2 problem? Doesn't this apply to all the other writers too and even the IDO? If you let multiple
Icinga 2 instance write to the same backend in parallel, then it's not going to end well either way, so I don't see how this is a problem of the OTel writer in particular.

@julianbrost
Copy link
Member

When enable_ha = true is set, only one master will forward the metrics to OTel and if a failover happens, that would result in basically the same metrics then switching over to another service.instance.id.

Why would that be a problem? Wouldn't the service.instance.id actually be perfectly fit in that case?

I don't know if this would be a problem. This is basically based on my understanding of Prometheus and how these attributes are mapped to Prometheus. If I understand that correctly, service.instance.id maps to Prometheus' instance and Prometheus' job is set as <service.namespace>/<service.name>. These are attributes I'd expect to stay the same for the same metric. Though on the Prometheus side, there wouldn't be an issue, I could just happily query state_check_perfdata{icinga2_host_name="example.com", icinga2_service_name="load", perfdata_label="load1"} and don't care if other labels change.

With enable_ha = false set, all metrics would be submitted twice with different service.instance.id.

Is this an Icinga 2 problem? Doesn't this apply to all the other writers too and even the IDO? If you let multiple Icinga 2 instance write to the same backend in parallel, then it's not going to end well either way, so I don't see how this is a problem of the OTel writer in particular.

I'm just saying that this will make the difference between submitting identical metrics twice and duplicating every metric and submitting it as two different metrics.

Back to my initial question: Would the following description work out in the OTel world and if my wording is correct?

Icinga 2 (as in a whole cluster) presents itself as a service to OTel receivers. Icinga 2 checkables (hosts and services) are mapped to OTel resources. Each resource has multiple metrics that represent the individual perfdata values reported by the check plugin.

That includes the assumption that a resource can have multiple metrics. One of the parts I still didn't fully grasp is what EntityRef and its id_keys are supposed to identify exactly:

// A reference to an Entity.
// Entity represents an object of interest associated with produced telemetry: e.g spans, metrics, profiles, or logs.
//
// Status: [Development]
message EntityRef {
// The Schema URL, if known. This is the identifier of the Schema that the entity data
// is recorded in. To learn more about Schema URL see
// https://opentelemetry.io/docs/specs/otel/schemas/#schema-url
//
// This schema_url applies to the data in this message and to the Resource attributes
// referenced by id_keys and description_keys.
// TODO: discuss if we are happy with this somewhat complicated definition of what
// the schema_url applies to.
//
// This field obsoletes the schema_url field in ResourceMetrics/ResourceSpans/ResourceLogs.
string schema_url = 1;
// Defines the type of the entity. MUST not change during the lifetime of the entity.
// For example: "service" or "host". This field is required and MUST not be empty
// for valid entities.
string type = 2;
// Attribute Keys that identify the entity.
// MUST not change during the lifetime of the entity. The Id must contain at least one attribute.
// These keys MUST exist in the containing {message}.attributes.
repeated string id_keys = 3;
// Descriptive (non-identifying) attribute keys of the entity.
// MAY change over the lifetime of the entity. MAY be empty.
// These attribute keys are not part of entity's identity.
// These keys MUST exist in the containing {message}.attributes.
repeated string description_keys = 4;
}

In the current implementation, id_keys contains icinga2.host.name and icinga2.host.service, but not the label, so it identifies Icinga 2 checkables:

entity->mutable_id_keys()->Add("icinga2.host.name");
if (service) {
entity->set_type("service");
entity->mutable_id_keys()->Add("icinga2.service.name");

As they can have multiple metrics, my description/suggestion from above should work out.

Comment on lines +38 to +40
[config] bool enable_ha {
default {{{ return false; }}}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting from #10685 (comment):

If you let multiple Icinga 2 instance write to the same backend in parallel, then it's not going to end well either way, so I don't see how this is a problem of the OTel writer in particular.

Wouldn't that make true the sane default?

@yhabteab
Copy link
Member Author

yhabteab commented Mar 19, 2026

Back to my initial question: Would the following description work out in the OTel world and if my wording is correct?

Icinga 2 (as in a whole cluster) presents itself as a service to OTel receivers. Icinga 2 checkables (hosts and services) are mapped to OTel resources. Each resource has multiple metrics that represent the individual perfdata values reported by the check plugin.

I don't see any issues with that, but mind you that I'm just as newbie as you are when it comes to OTel, so take my answer with a grain of salt.

One of the parts I still didn't fully grasp is what EntityRef and its id_keys are supposed to identify exactly:

Sorry, but if the descriptions given in this document or here don't make it clear, I don't think I can explain it better. As far as I understand it, EntityRef is a way telling the OTel system which resource attribute(s) uniquely identifies the system the metric data is about. In the case of Icinga 2, that would be the checkables, so icinga2.host.name or/and icinga2.service.name.

@julianbrost
Copy link
Member

It's just that the definitions are pretty vague, but that might also be intentional to allow for some flexibility. Thus, to me, my suggestion sounds plausible, but I can't really point to specific parts of the specification to base that argument on.

If we decided to follow my suggestion, what would this imply?

  • service.name: Icinga 2 (not changed)
  • service.namespace: user configuration (default: "icinga") (not changed)
  • service.instance.id: endpoint name (the one running the exporter or the one being the check source)? constant value? or possibly even user-configurable?
  • EntityRef: references an Icinga 2 checkable by icinga2.host.name and - for services only - icinga2.service.name (not changed)

Can anyone who was involved in this PR before reason what's the most OTel way from all the options (current state of the PR, my suggestions, possibly other suggestions)?

@martialblog
Copy link
Member

@julianbrost

I think the separation of "The Icinga Instance/Application/Service" and "The Icinga Checkables" makes sense.
Also for the future if you for example decide to have the "Icinga Instance" produce it's own metrics/logs/traces.

"Checkables" can then have their attributes in the icinga2.* namespace, since they are specific to Icinga2. There's similar conventions for application specific resource attributes in the OTel docs:

And for general attributes:

As for service.instance.id, not sure if this should be user-configurable. Then the user would have to make sure that it "MUST be unique for each instance of the same service.namespace,service.name pair". I think the application should take care of that. What would be your argument for having it user-configurable?

Endpoint name could be an idea. The DSL already makes sure they are unique right? That would also solve the "MUST be unique for each instance of the same service.namespace,service.name pair" constraint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/opentelemetry Metrics to OpenTelemetry. cla/signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update elasticsearch writer to write to datastreams. OpenTelemetry Writer Prometheus remote writer

5 participants