Add Apache Airflow monitoring layer (SWIP-7)#13891
Conversation
Add MAL otel-rules, AIRFLOW layer, hierarchy, mock and real-cluster e2e, setup docs with Horizon UI screenshots, and CI e2e jobs. Cluster seed reserializes DAG metadata before triggering native OTel workload DAGs.
|
Horizon UI dashboards: apache/skywalking-horizon-ui#42 |
|
Please fix the CI, license header. |
Remove empty .gitkeep placeholders that fail license-eye, and stop linking to gitignored runtime report/log files in the case README.
|
Question: do the 7 The description lists "mock e2e (30/30)" and "cluster e2e (26/26)", but I think the The pool rules keep
So To check this rather than guess, I ran
This is the same single-vs-labeled split other layers already handle with a dedicated template — e.g. ActiveMQ Questions:
Happy to be corrected if there's something in the setup I've overlooked. |
Replace the Java mock-sender with a Python OTLP replay sidecar so port 9093 is ready under CI, and narrow the cluster integration smoke to 25 checks by dropping flaky pool_queued_slots and reusing instance.yml.
|
I'm investigating and resolving this issue, please wait for a moment. |
…lates Pool metrics retain pool_name as a metric label in MAL; generic metrics-has-value.yml asserts labels: [] and fails infra-e2e verify. Add a dedicated expected template and repoint pool queries in mock and cluster case files.
|
Root cause identified. Thanks for the detailed review — you were right about the pool expected mismatch.Before this fix, we had not run the official infra-e2e path (e2e run) locally. We were still unfamiliar with how SkyWalking e2e is wired in CI (e2e.yaml + -cases.yaml + expected/ + infra-e2e verifier), and overlooked using infra-e2e as the local validation gate. Instead, we used custom bash verify scripts (verify-mock-e2e.sh / verify-cluster-e2e.sh) that only check whether swctl metrics exec returns TIME_SERIES_VALUES with a non-null numeric value — they do not compare output against the expected/.yml templates. The earlier “mock 30/30” / “cluster 25/25” numbers came from those scripts, not from e2e run, so the label mismatch was missed locally but caught by CI and your analysis. |
| `cluster` is required so SkyWalking can name the Airflow Service (`airflow::prod-airflow`). You | ||
| can also inject it with a Collector `resource` processor. |
There was a problem hiding this comment.
The default works, but the hard requirement behind it isn't stated, and it's the most likely silent-failure misconfiguration. Suggest making it explicit:
| `cluster` is required so SkyWalking can name the Airflow Service (`airflow::prod-airflow`). You | |
| can also inject it with a Collector `resource` processor. | |
| `cluster` is required so SkyWalking can name the Airflow Service (`airflow::prod-airflow`). You | |
| can also inject it with a Collector `resource` processor. | |
| > **Required — `service.name` must be `Airflow`:** OAP maps the OTLP resource `service.name` to the | |
| > `job_name` tag, and every MAL rule under `otel-rules/airflow/` filters on `job_name == 'Airflow'` | |
| > (case-sensitive). Airflow's default OTel service name is `Airflow` (the `[metrics] otel_service` | |
| > default in both 2.x and 3.x), so a default install works out of the box. If you override it via | |
| > `OTEL_SERVICE_NAME`, set it to exactly `Airflow`, and do **not** put a custom `job_name` in | |
| > `OTEL_RESOURCE_ATTRIBUTES` — an explicit `job_name` takes precedence over the `service.name` | |
| > fallback. Otherwise OAP silently drops every Airflow metric and the AIRFLOW layer stays empty. |
|
The asset rules read the Airflow 2.x metric names, but the setup doc targets 3.x ("Example environment variables for Airflow 3.x"). I checked the Airflow source for both lines (
This is the AIP-74/75 Dataset → Asset rename. In Airflow 3.x there are no Not a blocker — just a version-targeting mismatch. Options:
Either way, worth making the docs + e2e target consistent. Happy to push a suggestion for whichever direction you prefer. |
|
Thanks for the quick turnaround, and no worries — the Re-checked the fix on |
Add Apache Airflow monitoring via native OpenTelemetry metrics (SWIP-7)
If this is non-trivial feature, paste the links/URLs to the design doc.
Update the documentation to include this new feature.
docs/en/setup/backend/backend-airflow-monitoring.md,docs/en/swip/SWIP-7.md, hierarchy docsTests(including UT, IT, E2E) are added to verify the new feature.
airflow-service/airflow-instance)test/e2e-v2/cases/airflow/e2e.yaml)test/e2e-v2/cases/airflow/e2e-cluster.yaml)If it's UI related, attach the screenshots below.
apache/skywalking-horizon-ui(Workflow Scheduler / Airflow layer)If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
Update the \CHANGES\ log.
Summary
Test plan