Skip to content

[n8n] Fix metric mappings and add full v2 metric coverage#23635

Draft
AAraKKe wants to merge 7 commits intomasterfrom
aarakke/fix-n8n-metrics
Draft

[n8n] Fix metric mappings and add full v2 metric coverage#23635
AAraKKe wants to merge 7 commits intomasterfrom
aarakke/fix-n8n-metrics

Conversation

@AAraKKe
Copy link
Copy Markdown
Contributor

@AAraKKe AAraKKe commented May 8, 2026

What does this PR do?

Overhauls the n8n integration's metric mapping, test environment, fixtures, and public documentation so the check matches what n8n actually emits across both tested major versions.

  • Metric map and metadata: fixes the incorrect workflow_executions_duration_seconds mapping, removes mappings that n8n does not emit, keeps valid runtime/queue metrics, and adds the missing families verified against n8n 1.118.1 and n8n 2.19.5.
  • n8n 2.x metric coverage: adds the 2.x-only families for workflow_execution_duration_seconds, audit_workflow_*, embed_login_*, token_exchange_*, process_pss_bytes, and the optional workflow_statistics_* gauges. The metadata descriptions call out metrics that require n8n 2.x or an opt-in n8n flag.
  • Worker scrape coverage: adds a worker instance to the test environment so worker-only families such as node_started, node_finished, queue_job_dequeued, and runner_task_requested are covered. Main and worker instances are tagged with n8n_process:main and n8n_process:worker.
  • Queue-mode test environment: runs n8n in queue mode with Redis and validates both n8n 1.118.1 and 2.19.5. The compose file pulls n8nio/n8n:${N8N_VERSION} directly, and host ports are allocated dynamically to avoid CI/local port conflicts.
  • Stable E2E setup: imports workflows with stable IDs, activates them, generates traffic, and waits for workflow metrics during docker_run setup conditions. This keeps the dynamic port configuration intact and avoids running setup work again during teardown.
  • Rare-event metric handling: keeps real but timing/event-dependent metrics in mapping and metadata, including auth failure counters and the libuv nodejs.active.requests gauge, while excluding them from live symmetric assertions that cannot reliably force those events at scrape time. Unit fixtures include synthetic samples for these metrics.
  • Readiness behavior: continues to emit n8n.readiness.check, but no longer gates the OpenMetrics scrape on the readiness endpoint. This preserves metric flow when readiness degrades while the OpenMetrics health service check still reports scrape failures.
  • Documentation: updates the public README with customer-facing n8n configuration guidance, queue-mode worker scraping instructions, required n8n environment variables, and version-specific metric notes.

Motivation

Issue #23633 reported that the integration exposed the wrong Datadog metric name for n8n workflow execution duration. Validating the integration against live n8n containers showed a broader gap: some mapped metrics were invented or stale, several real metrics were missing, and the test environment did not exercise queue mode, worker metrics, or version-specific metric differences.

This PR makes the integration empirically grounded and keeps coverage for both the older supported n8n line and the current 2.x line.

Validation

  • ddev test -fs n8n
  • ddev validate config -s n8n
  • ddev validate models -s n8n
  • ddev validate metadata n8n
  • ddev validate readmes n8n
  • ddev --no-interactive test n8n
  • ddev env test --dev n8n py3.13-1
  • ddev env test --dev n8n py3.13-2
  • ddev validate all n8n was also run. It did not report n8n-owned validation failures; the remaining failures were unrelated global labeler state for rate_limiter and a network timeout during licenses.

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

- Drop fabricated metric names that n8n never emitted; map only what is empirically present.
- Add the n8n 2.x metric families: workflow.execution.duration histogram, audit.workflow.*, embed.login.*, token.exchange.*, process.pss.bytes, runner.task.requested, and the workflow_statistics gauges.
- Add worker-only families (node.started, node.finished, queue.job.dequeued, runner.task.requested) by introducing a worker-scrape instance.
- Stop gating the OpenMetrics scrape on /healthz/readiness; emit n8n.readiness.check unconditionally so metrics still flow when the readiness endpoint is unhealthy.
- Replace the custom Dockerfile with a direct n8nio/n8n image reference and parameterise the version via hatch.toml so the test matrix can run against both 1.118.1 and 2.19.5.
- Allocate free host ports via datadog_checks.dev.utils.find_free_ports and forward them through docker_run env_vars to avoid port collisions on re-runs.
@AAraKKe AAraKKe added the qa/skip-qa Automatically skip this PR for the next QA label May 8, 2026
@AAraKKe AAraKKe requested review from a team as code owners May 8, 2026 10:53
@AAraKKe AAraKKe added the qa/skip-qa Automatically skip this PR for the next QA label May 8, 2026
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 8, 2026

⚠️ The qa/skip-qa label has been added with shippable changes

The following files, which will be shipped with the agent, were modified in this PR and
the qa/skip-qa label has been added.

You can ignore this if you are sure the changes in this PR do not require QA. Otherwise,
consider removing the label.

List of modified files that will be shipped with the agent
n8n/changelog.d/23635.added
n8n/datadog_checks/n8n/check.py
n8n/datadog_checks/n8n/data/conf.yaml.example
n8n/datadog_checks/n8n/metrics.py
n8n/hatch.toml

@AAraKKe AAraKKe marked this pull request as draft May 8, 2026 11:07
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 94.91525% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.03%. Comparing base (1befb90) to head (43e7fc8).

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 8, 2026

Tests

Fix all issues with BitsAI or with Cursor

⚠️ Warnings

❄️ 2 New flaky tests detected

test_all_metadata_metrics_emitted from test_integration.py   View in Datadog   (Fix with Cursor)
workflow_started_total never went non-zero
test_readiness_check_metric from test_integration.py   View in Datadog   (Fix with Cursor)
workflow_started_total never went non-zero

ℹ️ Info

No other issues found (see more)

🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 94.92%
Overall Coverage: 95.85% (+8.60%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 43e7fc8 | Docs | Datadog PR Page | Give us feedback!

@AAraKKe AAraKKe marked this pull request as ready for review May 8, 2026 12:44
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1be3b3dc6f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 83 to 86
'queue_job_completed': 'queue.job.completed',
'queue_job_delayed_total': 'queue.job.delayed.total',
'queue_job_dequeued': 'queue.job.dequeued',
'queue_job_enqueued': 'queue.job.enqueued',
'queue_job_failed': 'queue.job.failed',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add the stalled queue counter mapping

In n8n 2.x queue mode, a stalled job event is emitted as n8n.queue.job.stalled, which the Prometheus service exposes as n8n_queue_job_stalled_total; with this map limited to completed/dequeued/enqueued/failed, that counter is silently ignored even when message event bus metrics are enabled. Since this block adds queue-job coverage, include queue_job_stalled (and corresponding metadata) so stalled jobs are collected.

Useful? React with 👍 / 👎.

A long-running n8n simulation that layers on top of the integration test
environment so a real Datadog Agent can ship metrics to a Datadog org for
dashboard / monitor iteration.

- tests/lab/workflows/: five lab-only workflow JSONs covering distinct shapes
  (fast, slow Wait node, always-fail Code, flaky 30%, four-step chain).
- tests/lab/traffic_generator.py: click CLI (start/generate/stop) that runs
  ddev env start --base, copies + imports + activates the lab workflows,
  restarts n8n, and drives a configurable async traffic mix against the
  webhooks and REST API.
- tests/lab/config.yaml: webhook + REST probabilities and tick / reload
  intervals; hot-reloaded while the generator runs.
- tests/lab/.ddev.toml: pins the lab to an `n8nlab` ddev org.
- tests/lab/run_lab.sh: bash entrypoint with an EXIT trap so Ctrl+C always
  runs lab:stop.
- hatch.toml: new [envs.lab] env with click/httpx/pyyaml/rich and
  start/generate/stop scripts.
@AAraKKe AAraKKe marked this pull request as draft May 8, 2026 12:59
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 8, 2026

Validation Report

All 20 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant