[n8n] Fix metric mappings and add full v2 metric coverage#23635
[n8n] Fix metric mappings and add full v2 metric coverage#23635
Conversation
- Drop fabricated metric names that n8n never emitted; map only what is empirically present. - Add the n8n 2.x metric families: workflow.execution.duration histogram, audit.workflow.*, embed.login.*, token.exchange.*, process.pss.bytes, runner.task.requested, and the workflow_statistics gauges. - Add worker-only families (node.started, node.finished, queue.job.dequeued, runner.task.requested) by introducing a worker-scrape instance. - Stop gating the OpenMetrics scrape on /healthz/readiness; emit n8n.readiness.check unconditionally so metrics still flow when the readiness endpoint is unhealthy. - Replace the custom Dockerfile with a direct n8nio/n8n image reference and parameterise the version via hatch.toml so the test matrix can run against both 1.118.1 and 2.19.5. - Allocate free host ports via datadog_checks.dev.utils.find_free_ports and forward them through docker_run env_vars to avoid port collisions on re-runs.
|
The following files, which will be shipped with the agent, were modified in this PR and You can ignore this if you are sure the changes in this PR do not require QA. Otherwise, List of modified files that will be shipped with the agent |
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
|
✨ Fix all issues with BitsAI or with Cursor
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1be3b3dc6f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 'queue_job_completed': 'queue.job.completed', | ||
| 'queue_job_delayed_total': 'queue.job.delayed.total', | ||
| 'queue_job_dequeued': 'queue.job.dequeued', | ||
| 'queue_job_enqueued': 'queue.job.enqueued', | ||
| 'queue_job_failed': 'queue.job.failed', |
There was a problem hiding this comment.
Add the stalled queue counter mapping
In n8n 2.x queue mode, a stalled job event is emitted as n8n.queue.job.stalled, which the Prometheus service exposes as n8n_queue_job_stalled_total; with this map limited to completed/dequeued/enqueued/failed, that counter is silently ignored even when message event bus metrics are enabled. Since this block adds queue-job coverage, include queue_job_stalled (and corresponding metadata) so stalled jobs are collected.
Useful? React with 👍 / 👎.
A long-running n8n simulation that layers on top of the integration test environment so a real Datadog Agent can ship metrics to a Datadog org for dashboard / monitor iteration. - tests/lab/workflows/: five lab-only workflow JSONs covering distinct shapes (fast, slow Wait node, always-fail Code, flaky 30%, four-step chain). - tests/lab/traffic_generator.py: click CLI (start/generate/stop) that runs ddev env start --base, copies + imports + activates the lab workflows, restarts n8n, and drives a configurable async traffic mix against the webhooks and REST API. - tests/lab/config.yaml: webhook + REST probabilities and tick / reload intervals; hot-reloaded while the generator runs. - tests/lab/.ddev.toml: pins the lab to an `n8nlab` ddev org. - tests/lab/run_lab.sh: bash entrypoint with an EXIT trap so Ctrl+C always runs lab:stop. - hatch.toml: new [envs.lab] env with click/httpx/pyyaml/rich and start/generate/stop scripts.
Validation ReportAll 20 validations passed. Show details
|
What does this PR do?
Overhauls the n8n integration's metric mapping, test environment, fixtures, and public documentation so the check matches what n8n actually emits across both tested major versions.
workflow_executions_duration_secondsmapping, removes mappings that n8n does not emit, keeps valid runtime/queue metrics, and adds the missing families verified against n8n 1.118.1 and n8n 2.19.5.workflow_execution_duration_seconds,audit_workflow_*,embed_login_*,token_exchange_*,process_pss_bytes, and the optionalworkflow_statistics_*gauges. The metadata descriptions call out metrics that require n8n 2.x or an opt-in n8n flag.node_started,node_finished,queue_job_dequeued, andrunner_task_requestedare covered. Main and worker instances are tagged withn8n_process:mainandn8n_process:worker.n8nio/n8n:${N8N_VERSION}directly, and host ports are allocated dynamically to avoid CI/local port conflicts.docker_runsetup conditions. This keeps the dynamic port configuration intact and avoids running setup work again during teardown.nodejs.active.requestsgauge, while excluding them from live symmetric assertions that cannot reliably force those events at scrape time. Unit fixtures include synthetic samples for these metrics.n8n.readiness.check, but no longer gates the OpenMetrics scrape on the readiness endpoint. This preserves metric flow when readiness degrades while the OpenMetrics health service check still reports scrape failures.Motivation
Issue #23633 reported that the integration exposed the wrong Datadog metric name for n8n workflow execution duration. Validating the integration against live n8n containers showed a broader gap: some mapped metrics were invented or stale, several real metrics were missing, and the test environment did not exercise queue mode, worker metrics, or version-specific metric differences.
This PR makes the integration empirically grounded and keeps coverage for both the older supported n8n line and the current 2.x line.
Validation
ddev test -fs n8nddev validate config -s n8nddev validate models -s n8nddev validate metadata n8nddev validate readmes n8nddev --no-interactive test n8nddev env test --dev n8n py3.13-1ddev env test --dev n8n py3.13-2ddev validate all n8nwas also run. It did not report n8n-owned validation failures; the remaining failures were unrelated globallabelerstate forrate_limiterand a network timeout duringlicenses.Review checklist
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged