Skip to content

Add Apache Airflow monitoring layer (SWIP-7)#13891

Open
songzhendong wants to merge 4 commits into
apache:masterfrom
songzhendong:feature/swip-7-airflow-monitoring
Open

Add Apache Airflow monitoring layer (SWIP-7)#13891
songzhendong wants to merge 4 commits into
apache:masterfrom
songzhendong:feature/swip-7-airflow-monitoring

Conversation

@songzhendong
Copy link
Copy Markdown
Contributor

Add Apache Airflow monitoring via native OpenTelemetry metrics (SWIP-7)

  • If this is non-trivial feature, paste the links/URLs to the design doc.

  • Update the documentation to include this new feature.

    • docs/en/setup/backend/backend-airflow-monitoring.md, docs/en/swip/SWIP-7.md, hierarchy docs
  • Tests(including UT, IT, E2E) are added to verify the new feature.

    • MAL data tests (airflow-service / airflow-instance)
    • Mock OTLP e2e: 30 checks (test/e2e-v2/cases/airflow/e2e.yaml)
    • Real Celery cluster e2e: 26 checks (test/e2e-v2/cases/airflow/e2e-cluster.yaml)
  • If it's UI related, attach the screenshots below.

    • Horizon UI dashboards ship separately in apache/skywalking-horizon-ui (Workflow Scheduler / Airflow layer)
  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.

  • Update the \CHANGES\ log.

Summary

  • New \AIRFLOW\ layer with Service (cluster) and Instance (host) dimensions
  • MAL rules under \otel-rules/airflow/\ for Airflow native OTel metrics
  • \hierarchy-definition.yml\ links \AIRFLOW\ to \K8S_SERVICE`n- CI: \Airflow\ + \Airflow Cluster\ jobs in .github/workflows/skywalking.yaml`n

Test plan

  • ./mvnw checkstyle:check`n- [x] Local mock e2e (30/30)
  • Local cluster e2e (26/26)
  • CI \Airflow\ + \Airflow Cluster\ matrix jobs

Add MAL otel-rules, AIRFLOW layer, hierarchy, mock and real-cluster e2e,
setup docs with Horizon UI screenshots, and CI e2e jobs. Cluster seed
reserializes DAG metadata before triggering native OTel workload DAGs.
@songzhendong
Copy link
Copy Markdown
Contributor Author

Horizon UI dashboards: apache/skywalking-horizon-ui#42

@wu-sheng
Copy link
Copy Markdown
Member

wu-sheng commented Jun 5, 2026

Please fix the CI, license header.

@wu-sheng wu-sheng added backend OAP backend related. feature New feature labels Jun 5, 2026
Remove empty .gitkeep placeholders that fail license-eye, and stop linking to gitignored runtime report/log files in the case README.
@wu-sheng
Copy link
Copy Markdown
Member

wu-sheng commented Jun 5, 2026

Question: do the 7 pool_* e2e checks actually pass locally?

The description lists "mock e2e (30/30)" and "cluster e2e (26/26)", but I think the pool_* checks can't match their expected file as written — and I'd like to confirm I'm not missing something about the setup before CI runs.

The pool rules keep pool_name as a label that is not an entity dimension:

  • otel-rules/airflow/airflow-service.yaml: airflow_pool_queued_slots.sum(['cluster', 'pool_name']) with scope .service(['cluster'], Layer.AIRFLOW)
  • otel-rules/airflow/airflow-instance.yaml: airflow_pool_open_slots.sum(['cluster', 'host_name', 'pool_name']) with scope .instance(['cluster'], ['host_name'], Layer.AIRFLOW)

So pool_name survives as a metric label, and MQE returns metric.labels: [{key: pool_name, value: default_pool}]. But all 7 pool queries (3 service + 4 instance) in both airflow-cases.yaml and airflow-cluster-cases.yaml point at expected/metrics-has-value.yml, whose template asserts metric.labels: [].

To check this rather than guess, I ran skywalking-infra-e2e's own verifier.Verify() against synthetic swctl output:

Expected template Actual result Verdict
generic (labels: []) labeled (pool_name=default_pool) MISMATCH → FAIL
generic (labels: []) unlabeled MATCH
label-asserting (- key: pool_name) labeled (pool_name=default_pool) MATCH

This is the same single-vs-labeled split other layers already handle with a dedicated template — e.g. ActiveMQ metrics-has-value-label-serviceinstanceid.yml, ClickHouse metrics-has-value-label.yml, Flink metrics-has-value-jobManager-node-label.yml.

Questions:

  1. Did the meter_airflow_pool_* / meter_airflow_instance_pool_* cases pass in your local e2e run? If so, could you share the swctl metrics exec output for one of them — does it carry a pool_name label?
  2. If they do carry the label, would you add an expected/metrics-has-value-label-poolname.yml (asserting - key: pool_name / value: {{ notEmpty .value }}) and repoint the 7 pool queries in both case files? Keeping pool_name in .sum(...) preserves the per-pool breakdown, so I'd suggest the label-template route rather than dropping the label.

Happy to be corrected if there's something in the setup I've overlooked.

Replace the Java mock-sender with a Python OTLP replay sidecar so port
9093 is ready under CI, and narrow the cluster integration smoke to 25
checks by dropping flaky pool_queued_slots and reusing instance.yml.
@songzhendong
Copy link
Copy Markdown
Contributor Author

I'm investigating and resolving this issue, please wait for a moment.

…lates

Pool metrics retain pool_name as a metric label in MAL; generic metrics-has-value.yml asserts labels: [] and fails infra-e2e verify. Add a dedicated expected template and repoint pool queries in mock and cluster case files.
@songzhendong
Copy link
Copy Markdown
Contributor Author

Root cause identified. Thanks for the detailed review — you were right about the pool expected mismatch.Before this fix, we had not run the official infra-e2e path (e2e run) locally. We were still unfamiliar with how SkyWalking e2e is wired in CI (e2e.yaml + -cases.yaml + expected/ + infra-e2e verifier), and overlooked using infra-e2e as the local validation gate. Instead, we used custom bash verify scripts (verify-mock-e2e.sh / verify-cluster-e2e.sh) that only check whether swctl metrics exec returns TIME_SERIES_VALUES with a non-null numeric value — they do not compare output against the expected/.yml templates. The earlier “mock 30/30” / “cluster 25/25” numbers came from those scripts, not from e2e run, so the label mismatch was missed locally but caught by CI and your analysis.
Fix (e1dd30d)
Added expected/metrics-has-value-label-poolname.yml (asserts pool_name label, same pattern as ActiveMQ).
Repointed pool queries in airflow-cases.yaml (7) and airflow-cluster-cases.yaml (6; cluster still omits unstable pool_queued_slots).
Verification after fix
We ran full infra-e2e on Linux/WSL2 (same path as CI): mock 30/30, cluster 25/25 passed. We will use e2e run as the local gate going forward.

Comment on lines +44 to +45
`cluster` is required so SkyWalking can name the Airflow Service (`airflow::prod-airflow`). You
can also inject it with a Collector `resource` processor.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default works, but the hard requirement behind it isn't stated, and it's the most likely silent-failure misconfiguration. Suggest making it explicit:

Suggested change
`cluster` is required so SkyWalking can name the Airflow Service (`airflow::prod-airflow`). You
can also inject it with a Collector `resource` processor.
`cluster` is required so SkyWalking can name the Airflow Service (`airflow::prod-airflow`). You
can also inject it with a Collector `resource` processor.
> **Required — `service.name` must be `Airflow`:** OAP maps the OTLP resource `service.name` to the
> `job_name` tag, and every MAL rule under `otel-rules/airflow/` filters on `job_name == 'Airflow'`
> (case-sensitive). Airflow's default OTel service name is `Airflow` (the `[metrics] otel_service`
> default in both 2.x and 3.x), so a default install works out of the box. If you override it via
> `OTEL_SERVICE_NAME`, set it to exactly `Airflow`, and do **not** put a custom `job_name` in
> `OTEL_RESOURCE_ATTRIBUTES` — an explicit `job_name` takes precedence over the `service.name`
> fallback. Otherwise OAP silently drops every Airflow metric and the AIRFLOW layer stays empty.

@wu-sheng
Copy link
Copy Markdown
Member

wu-sheng commented Jun 5, 2026

dataset.* vs asset.* — the asset metrics only resolve on Airflow 2.x; on 3.x they'll be empty

The asset rules read the Airflow 2.x metric names, but the setup doc targets 3.x ("Example environment variables for Airflow 3.x"). I checked the Airflow source for both lines (v2-10-stable and main):

SkyWalking metric Reads source Airflow 2.10 emits Airflow 3.x (main) emits
asset_updates airflow_dataset_updates Stats.incr("dataset.updates")datasets/manager.py:149 stats.incr("asset.updates")assets/manager.py:400
asset_triggered_dagruns airflow_dataset_triggered_dagruns Stats.incr("dataset.triggered_dagruns")scheduler_job_runner.py:1494 stats.incr("asset.triggered_dagruns")scheduler_job_runner.py:2454
asset_orphaned airflow_dataset_orphaned Stats.gauge("dataset.orphaned")scheduler_job_runner.py:2200 stats.gauge("asset.orphaned")scheduler_job_runner.py:3378

This is the AIP-74/75 Dataset → Asset rename. In Airflow 3.x there are no dataset.* Stats calls left, so OAP receives airflow_asset_* and the current rules (filtering airflow_dataset_*) collect nothing → the three asset_* panels stay empty on 3.x. The other 25 metrics (scheduler / executor / pool / triggerer / triggers / dag_processing) are byte-identical across 2.x and 3.x, so only these three are affected. (The e2e cluster pins 2.10.5, which is why it passes today.)

Not a blocker — just a version-targeting mismatch. Options:

  1. Doc-only: state that the asset/dataset metrics are collected for Airflow 2.x (dataset.*) and that 3.x asset.* support is pending.
  2. Support both: add parallel rules reading airflow_asset_* alongside the airflow_dataset_* ones (the SkyWalking-side metric is already named asset_*, so it lines up) — then the layer works on 2.x and 3.x.

Either way, worth making the docs + e2e target consistent. Happy to push a suggestion for whichever direction you prefer.

@wu-sheng
Copy link
Copy Markdown
Member

wu-sheng commented Jun 5, 2026

Thanks for the quick turnaround, and no worries — the e2e run path is easy to miss locally.

Re-checked the fix on e1dd30debd: the new metrics-has-value-label-poolname.yml asserts pool_name correctly, and all 7 mock + 6 cluster pool checks are repointed to it (no check left on the labels: [] template, and no non-pool check moved by mistake). I also ran your exact template through infra-e2e's verifier.Verify() against a realistic labeled result and it matches. CI now confirms it end-to-end — both Airflow and Airflow Cluster jobs are green. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend OAP backend related. feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants