Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 54 additions & 7 deletions n8n/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@

## Overview

This check monitors [n8n][1] through the Datadog Agent.
This check monitors [n8n][1] through the Datadog Agent.

Collect n8n metrics including:
- Cache metrics: Hit and miss statistics.
- Message event bus metrics: Event-related metrics.
- Workflow metrics: Can include workflow ID labels.
- Node metrics: Can include node type labels.
- Credential metrics: Can include credential type labels.
- Queue metrics
- Cache metrics: hit, miss, and update counts.
- Workflow metrics: started, success, failed counters, audit workflow lifecycle counters; in n8n 2.x, an execution-duration histogram.
- Node metrics: per-node started and finished counters emitted by worker processes in queue mode.
- Queue metrics: queue depth, enqueued/dequeued/completed/failed/stalled counters, and scaling-mode worker gauges.
- HTTP metrics: request duration histograms tagged with status code.
- Process and Node.js runtime metrics.


## Setup
Expand Down Expand Up @@ -40,13 +40,60 @@ N8N_METRICS_INCLUDE_CACHE_METRICS=true
N8N_METRICS_INCLUDE_MESSAGE_EVENT_BUS_METRICS=true
N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=true
N8N_METRICS_INCLUDE_API_ENDPOINTS=true
N8N_METRICS_INCLUDE_QUEUE_METRICS=true

# Optional: n8n 2.x adds workflow_statistics gauges (workflows, users, executions, ...) - opt in
N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS=true

# Optional: Customize the metric prefix (default is 'n8n_')
N8N_METRICS_PREFIX=n8n_
```

For more details, see the n8n documentation on [enabling Prometheus metrics][10].

If you change `N8N_METRICS_PREFIX` from its default of `n8n_`, you **must** also set `raw_metric_prefix` in the integration's `conf.yaml` to the same value. Otherwise the check will not recognize the exposed metric names and will silently submit nothing:

```yaml
instances:
- openmetrics_endpoint: http://localhost:5678/metrics
raw_metric_prefix: my_custom_prefix_
```

#### Event-driven counters

Some n8n counters are registered dynamically the first time the corresponding event fires. For example, `n8n.workflow.started.count`, `n8n.workflow.success.count`, `n8n.workflow.failed.count`, audit workflow lifecycle counters, and the queue and node event counters do not appear until the corresponding workflow or queue event has occurred. This is expected behavior and is not a sign of a misconfigured integration.

#### Queue mode and workers

In queue mode, n8n runs separate worker processes that execute jobs picked up from a Redis-backed queue. Each worker exposes its own `/metrics` endpoint and emits a different subset of metrics than the main process. Worker-observed metrics include `n8n.queue.job.dequeued.count`, `n8n.queue.job.stalled.count`, `n8n.node.started.count`, `n8n.node.finished.count`, and `n8n.runner.task.requested.count`. Main-only metrics include `n8n.instance.role.leader` and the `n8n.scaling.mode.queue.jobs.*` family.

To expose worker metrics, set `QUEUE_HEALTH_CHECK_ACTIVE=true` and `QUEUE_HEALTH_CHECK_PORT=<port>` on each worker. **In n8n 2.x, port `5679` is reserved for the task runner broker, so pick a different port (for example `5680`).**

For full coverage in queue deployments, configure one Datadog instance per n8n process exposing `/metrics`, including main and worker processes:

```yaml
instances:
- openmetrics_endpoint: http://n8n-main:5678/metrics
- openmetrics_endpoint: http://n8n-worker:5680/metrics
```

#### Version-specific metrics

Several metric families were introduced in n8n 2.x and are not emitted on n8n 1.x:

- `n8n.workflow.execution.duration.seconds.*` (histogram)
- `n8n.audit.workflow.activated.count`, `n8n.audit.workflow.deactivated.count`, `n8n.audit.workflow.executed.count`, `n8n.audit.workflow.resumed.count`, `n8n.audit.workflow.version.updated.count`, and `n8n.audit.workflow.waiting.count`
- `n8n.embed.login.requests.count` (tagged with `result:success`/`failure`), `n8n.embed.login.failures.count` (tagged with `reason`)
- `n8n.token.exchange.requests.count` (tagged with `result:success`/`failure`), `n8n.token.exchange.failures.count` (tagged with `reason`), `n8n.token.exchange.identity.linked.count`, `n8n.token.exchange.jit.provisioning.count`
- `n8n.process.pss.bytes` (Linux only)
- The `n8n.{production,manual,production.root}.executions`, `n8n.users.total`, `n8n.enabled.users`, `n8n.workflows.total`, and `n8n.credentials.total` family - only emitted when `N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS=true` is set.

Some metrics only emit samples after the corresponding runtime event occurs. For example, failures-only counters (`*.failures.count`) need an authentication failure, audit workflow counters need the matching workflow state transition, and the libuv `n8n.nodejs.active.requests` gauge needs an in-flight libuv request. A healthy idle deployment may not produce data points for these metrics until that activity occurs.

#### Tag cardinality

When `N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=true`, http and workflow execution histograms are tagged with `workflow_id` (and similar labels for nodes). On deployments with many distinct workflows or nodes, this can produce high-cardinality metrics. Drop the label via `exclude_labels` or omit `N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL` to keep tag cardinality bounded.

#### Configure the Datadog Agent

1. Edit the `n8n.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your n8n performance data. See the [sample n8n.d/conf.yaml][4] for all available configuration options.
Expand Down
2 changes: 1 addition & 1 deletion n8n/assets/configuration/spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ files:
openmetrics_endpoint.required: true
openmetrics_endpoint.hidden: false
openmetrics_endpoint.display_priority: 1
openmetrics_endpoint.value.example: http://localhost:5678
openmetrics_endpoint.value.example: http://localhost:5678/metrics
openmetrics_endpoint.description: |
Endpoint exposing the n8n's metrics in the OpenMetrics format. For more information, refer to:
https://docs.n8n.io/hosting/logging-monitoring/monitoring/
Expand Down
15 changes: 15 additions & 0 deletions n8n/changelog.d/23635.added
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Update the n8n metric coverage and test harness, verified live against n8n 1.118.1 and 2.19.5:

- Add missing common event-driven metrics: ``audit.workflow.archived``, ``audit.workflow.created``, ``audit.workflow.deleted``, ``audit.workflow.unarchived``, ``audit.workflow.updated``, and ``queue.job.stalled``.
- Add n8n 2.x workflow duration metrics: ``workflow.execution.duration.seconds.*``.
- Add n8n 2.x audit workflow metrics: ``audit.workflow.activated``, ``audit.workflow.deactivated``, ``audit.workflow.executed``, ``audit.workflow.resumed``, ``audit.workflow.version.updated``, and ``audit.workflow.waiting``.
- Add n8n 2.x embed login metrics: ``embed.login.requests`` and ``embed.login.failures``.
- Add n8n 2.x token exchange metrics: ``token.exchange.requests``, ``token.exchange.failures``, ``token.exchange.identity.linked``, and ``token.exchange.jit.provisioning``.
- Add n8n 2.x process memory metric: ``process.pss.bytes``.
- Add n8n 2.x workflow statistics metrics: ``production.executions``, ``production.root.executions``, ``manual.executions``, ``users.total``, ``enabled.users``, ``workflows.total``, and ``credentials.total``.
- Restore valid metrics that the integration was previously dropping: ``queue.job.dequeued``, ``nodejs.active.requests``.
- Add worker-only families ``node.started``, ``node.finished``, ``queue.job.dequeued``, and ``runner.task.requested`` and document scraping the n8n worker process as a separate Datadog instance.
- Remove the gating of OpenMetrics scraping on ``/healthz/readiness`` - ``n8n.readiness.check`` is still submitted, but metrics keep flowing when readiness reports degraded so SRE-relevant signals (queue depth, process state) are not lost during incidents.
- Document version-specific metric availability and the n8n env flags that gate them (``N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS``, ``N8N_METRICS_INCLUDE_WORKFLOW_EXECUTION_DURATION``, ``N8N_METRICS_INCLUDE_QUEUE_METRICS``).
- Use the actual ``/metrics`` URL in the ``openmetrics_endpoint`` example in ``conf.yaml.example``/``spec.yaml`` (was previously the host root, which silently mismatched the scrape path the check uses).
- Document that ``raw_metric_prefix`` in ``conf.yaml`` must be kept in sync with a customised ``N8N_METRICS_PREFIX`` for the check to recognise the exposed metric names.
66 changes: 30 additions & 36 deletions n8n/datadog_checks/n8n/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,58 +2,52 @@
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)

from urllib.parse import urljoin
from urllib.parse import urljoin, urlparse

from requests.exceptions import RequestException

from datadog_checks.base import OpenMetricsBaseCheckV2
from datadog_checks.n8n.metrics import METRIC_MAP, RENAME_LABELS_MAP

from .config_models import ConfigMixin

DEFAULT_READY_ENDPOINT = '/healthz/readiness'
DEFAULT_READY_PATH = '/healthz/readiness'


class N8nCheck(OpenMetricsBaseCheckV2, ConfigMixin):
__NAMESPACE__ = 'n8n'
DEFAULT_METRIC_LIMIT = 0

def __init__(self, name, init_config, instances=None):
super(N8nCheck, self).__init__(
name,
init_config,
instances,
)
self.openmetrics_endpoint = self.instance["openmetrics_endpoint"]
self.tags = self.instance.get('tags', [])
self._ready_endpoint = DEFAULT_READY_ENDPOINT

def get_default_config(self):
def get_default_config(self) -> dict:
return {
'metrics': [METRIC_MAP],
'rename_labels': RENAME_LABELS_MAP,
'raw_metric_prefix': 'n8n_',
}

def _check_n8n_readiness(self):
endpoint = urljoin(self.openmetrics_endpoint, self._ready_endpoint)
response = self.http.get(endpoint)

# Determine metric value and status_code tag
if response.status_code is None:
self.log.warning("The readiness endpoint did not return a status code")
metric_value = 0
metric_tags = self.tags + ['status_code:null']
elif response.status_code == 200:
# Ready - submit 1
metric_value = 1
metric_tags = self.tags + [f'status_code:{response.status_code}']
else:
# Not ready - submit 0
metric_value = 0
metric_tags = self.tags + [f'status_code:{response.status_code}']

# Submit metric with appropriate value and status_code tag
self.gauge('readiness.check', metric_value, tags=metric_tags)

def check(self, instance):
super().check(instance)
def _readiness_endpoint(self) -> str:
parsed = urlparse(self.config.openmetrics_endpoint)
base = f'{parsed.scheme}://{parsed.netloc}'
return urljoin(base, DEFAULT_READY_PATH)

def _check_n8n_readiness(self) -> None:
endpoint = self._readiness_endpoint()
tags = list(self.config.tags or ())

try:
response = self.http.get(endpoint)
except RequestException as e:
self.log.warning("Could not reach n8n readiness endpoint %s: %s", endpoint, e)
self.gauge('readiness.check', 0, tags=tags + ['status_code:none'])
return

is_ready = response.status_code == 200
self.gauge(
'readiness.check',
1 if is_ready else 0,
tags=tags + [f'status_code:{response.status_code}'],
)

def check(self, instance: dict) -> None:
self._check_n8n_readiness()
super().check(instance)
2 changes: 1 addition & 1 deletion n8n/datadog_checks/n8n/data/conf.yaml.example
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ instances:
## https://docs.n8n.io/hosting/logging-monitoring/monitoring/
## https://docs.n8n.io/hosting/configuration/environment-variables/endpoints/
#
- openmetrics_endpoint: http://localhost:5678
- openmetrics_endpoint: http://localhost:5678/metrics

## @param raw_metric_prefix - string - optional - default: n8n_
## The prefix prepended to all metrics from n8n.
Expand Down
90 changes: 55 additions & 35 deletions n8n/datadog_checks/n8n/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,36 +2,58 @@
# All rights reserved
# Licensed under a 3-clause BSD style license (see LICENSE)

# Metrics mapping without prefix - use raw_metric_prefix config to strip prefixes like 'n8n_', 'n8n_my_team_', etc.
# Namespace will be applied by the check
# Note: OpenMetrics automatically appends .count to counter metrics, so don't add it here
# Metrics emitted by n8n's /metrics endpoint, verified live against n8n@1.118.1
# and n8n@2.19.5.
#
# The OpenMetrics base check strips `_total` from counter names before lookup
# and appends `.count` on submission, so counter keys here are written without
# the `_total` suffix (e.g. `cache_hits_total` -> key `cache_hits`).
#
# Many counters are dynamically registered from EventBus events (event
# `n8n.<a>.<b>.<c>` becomes counter `<a>_<b>_<c>_total`) and only appear once
# the corresponding event fires at runtime. In queue mode, worker processes
# emit `node_started_total`, `node_finished_total`, `queue_job_dequeued_total`,
# `queue_job_stalled_total`, and `runner_task_requested_total`.
#
# Several families were introduced in n8n 2.x (see the README "Version-specific
# metrics" section). The `workflow_statistics_*` and SSO/embed token-exchange
# families require additional flags (`N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS`,
# token-exchange counters always register but only emit on auth events).
METRIC_MAP = {
'active_workflow_count': 'active.workflow.count',
'api_request_duration_seconds': 'api.request.duration.seconds',
'api_requests': 'api.requests',
'cache_errors': 'cache.errors',
'audit_workflow_activated': 'audit.workflow.activated', # n8n 2.x+
'audit_workflow_archived': 'audit.workflow.archived',
'audit_workflow_created': 'audit.workflow.created',
'audit_workflow_deactivated': 'audit.workflow.deactivated', # n8n 2.x+
'audit_workflow_deleted': 'audit.workflow.deleted',
'audit_workflow_executed': 'audit.workflow.executed', # n8n 2.x+
'audit_workflow_resumed': 'audit.workflow.resumed', # n8n 2.x+
'audit_workflow_unarchived': 'audit.workflow.unarchived',
'audit_workflow_updated': 'audit.workflow.updated',
'audit_workflow_version_updated': 'audit.workflow.version.updated', # n8n 2.x+
'audit_workflow_waiting': 'audit.workflow.waiting', # n8n 2.x+
'cache_hits': 'cache.hits',
'cache_latency_seconds': 'cache.latency.seconds',
'cache_misses': 'cache.misses',
'cache_operations': 'cache.operations',
'eventbus_connections_total': 'eventbus.connections.total',
'eventbus_events_failed': 'eventbus.events.failed',
'eventbus_events_processed': 'eventbus.events.processed',
'eventbus_events': 'eventbus.events',
'eventbus_queue_size': 'eventbus.queue.size',
'cache_updates': 'cache.updates',
'credentials': 'credentials.total', # n8n 2.x+, requires N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS
'embed_login_failures': 'embed.login.failures', # n8n 2.x+
'embed_login_requests': 'embed.login.requests', # n8n 2.x+
'enabled_users': 'enabled.users', # n8n 2.x+, requires N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS
'http_request_duration_seconds': 'http.request.duration.seconds',
'instance_role_leader': 'instance.role.leader',
'last_activity': {
'name': 'last.activity',
'type': 'time_elapsed',
},
'manual_executions': 'manual.executions', # n8n 2.x+, requires N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS
'node_finished': 'node.finished',
'node_started': 'node.started',
'nodejs_active_handles': 'nodejs.active.handles',
'nodejs_active_handles_total': 'nodejs.active.handles.total',
'nodejs_active_requests': 'nodejs.active.requests',
'nodejs_active_requests_total': 'nodejs.active.requests.total',
'nodejs_active_resources': 'nodejs.active.resources',
'nodejs_active_resources_total': 'nodejs.active.resources.total',
'nodejs_event_loop_lag_seconds': 'nodejs.event.loop.lag.seconds',
'nodejs_eventloop_lag_max_seconds': 'nodejs.eventloop.lag.max.seconds',
'nodejs_eventloop_lag_mean_seconds': 'nodejs.eventloop.lag.mean.seconds',
'nodejs_eventloop_lag_min_seconds': 'nodejs.eventloop.lag.min.seconds',
Expand All @@ -47,47 +69,45 @@
'nodejs_heap_space_size_available_bytes': 'nodejs.heap.space.size.available.bytes',
'nodejs_heap_space_size_total_bytes': 'nodejs.heap.space.size.total.bytes',
'nodejs_heap_space_size_used_bytes': 'nodejs.heap.space.size.used.bytes',
'nodejs_heap_total_bytes': 'nodejs.heap.total.bytes',
'nodejs_heap_used_bytes': 'nodejs.heap.used.bytes',
'nodejs_version_info': {'type': 'metadata', 'label': 'version', 'name': 'nodejs.version'},
'process_cpu_seconds': 'process.cpu.seconds',
'process_cpu_system_seconds': 'process.cpu.system.seconds',
'process_cpu_user_seconds': 'process.cpu.user.seconds',
'process_heap_bytes': 'process.heap.bytes',
'process_max_fds': 'process.max.fds',
'process_open_fds': 'process.open.fds',
'process_pss_bytes': 'process.pss.bytes', # n8n 2.x+
'process_resident_memory_bytes': 'process.resident.memory.bytes',
'process_start_time_seconds': {
'name': 'process.uptime.seconds',
'type': 'time_elapsed',
},
'process_virtual_memory_bytes': 'process.virtual.memory.bytes',
'queue_job_active_total': 'queue.job.active.total',
'queue_job_attempts': 'queue.job.attempts',
'production_executions': 'production.executions', # n8n 2.x+, requires N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS
'production_root_executions': 'production.root.executions', # n8n 2.x+, requires flag
'queue_job_completed': 'queue.job.completed',
'queue_job_delayed_total': 'queue.job.delayed.total',
'queue_job_dequeued': 'queue.job.dequeued',
'queue_job_enqueued': 'queue.job.enqueued',
'queue_job_failed': 'queue.job.failed',
Comment on lines 88 to 91
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add the stalled queue counter mapping

In n8n 2.x queue mode, a stalled job event is emitted as n8n.queue.job.stalled, which the Prometheus service exposes as n8n_queue_job_stalled_total; with this map limited to completed/dequeued/enqueued/failed, that counter is silently ignored even when message event bus metrics are enabled. Since this block adds queue-job coverage, include queue_job_stalled (and corresponding metadata) so stalled jobs are collected.

Useful? React with 👍 / 👎.

'queue_job_waiting_duration_seconds': 'queue.job.waiting.duration.seconds',
'queue_job_waiting_total': 'queue.job.waiting.total',
'queue_jobs_duration_seconds': 'queue.jobs.duration.seconds',
'queue_jobs': 'queue.jobs',
'workflow_executions_active': 'workflow.executions.active',
'workflow_executions_duration_seconds': 'workflow.executions.duration.seconds',
'workflow_executions': 'workflow.executions',
'queue_job_stalled': 'queue.job.stalled',
'runner_task_requested': 'runner.task.requested',
'scaling_mode_queue_jobs_active': 'scaling.mode.queue.jobs.active',
'scaling_mode_queue_jobs_completed': 'scaling.mode.queue.jobs.completed',
'scaling_mode_queue_jobs_failed': 'scaling.mode.queue.jobs.failed',
'scaling_mode_queue_jobs_waiting': 'scaling.mode.queue.jobs.waiting',
'token_exchange_failures': 'token.exchange.failures', # n8n 2.x+
'token_exchange_identity_linked': 'token.exchange.identity.linked', # n8n 2.x+
'token_exchange_jit_provisioning': 'token.exchange.jit.provisioning', # n8n 2.x+
'token_exchange_requests': 'token.exchange.requests', # n8n 2.x+
'users': 'users.total', # n8n 2.x+, requires N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS
'version_info': {'type': 'metadata', 'label': 'version', 'name': 'version'},
'workflow_execution_duration_seconds': 'workflow.execution.duration.seconds', # n8n 2.x+
'workflow_failed': 'workflow.failed',
'workflow_started': 'workflow.started',
'workflow_success': 'workflow.success',
'process_cpu_seconds': 'process.cpu.seconds',
'version_info': 'version.info',
'nodejs_version_info': 'nodejs.version.info',
'workflows': 'workflows.total', # n8n 2.x+, requires N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS
}

N8N_VERSION = {'version_info': {'type': 'metadata', 'label': 'version', 'name': 'version'}}
NODEJS_VERSION = {'nodejs_version_info': {'type': 'metadata', 'label': 'version', 'name': 'nodejs.version'}}

METRIC_MAP.update(N8N_VERSION)
METRIC_MAP.update(NODEJS_VERSION)

RENAME_LABELS_MAP = {
'name': 'n8n_name',
'namespace': 'n8n_namespace',
Expand Down
Loading
Loading