Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/skywalking.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -713,6 +713,10 @@ jobs:
config: test/e2e-v2/cases/kong/e2e.yaml
- name: Flink
config: test/e2e-v2/cases/flink/e2e.yaml
- name: Airflow
config: test/e2e-v2/cases/airflow/e2e.yaml
- name: Airflow Cluster
config: test/e2e-v2/cases/airflow/e2e-cluster.yaml

- name: OTLP Trace
config: test/e2e-v2/cases/otlp-traces/e2e.yaml
Expand Down
2 changes: 2 additions & 0 deletions docs/en/changes/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,7 @@
* Fix: TTL query add metadata TTL.
* Fix: PersistentWorker used wrong TTL for metrics cache if the storage is BanyanDB.
* Add iOS/iPadOS app monitoring via OpenTelemetry Swift SDK (SWIP-11). Includes the `IOS` layer, `IOSHTTPSpanListener` for outbound HTTP client metrics (supports OTel Swift `.old`/`.stable`/`.httpDup` semantic-convention modes via stable-then-legacy attribute fallback), `IOSMetricKitSpanListener` for daily MetricKit metrics (exit counts split by foreground/background, app-launch / hang-time percentile histograms with finite 30 s overflow ceiling), LAL rules for crash/hang diagnostics, Mobile menu, and iOS dashboards.
* Add Apache Airflow monitoring via native OpenTelemetry metrics (SWIP-7). New `AIRFLOW` layer with Service (cluster) and Instance (host) dimensions, MAL rules under `otel-rules/airflow/`, setup docs, mock OTLP e2e (`cases/airflow/e2e.yaml`: full SWIP-7, 30 checks), and real Celery-cluster integration smoke (`e2e-cluster.yaml`: scheduler + two workers + triggerer; deferrable and dataset DAGs with ~4-minute live workload; 25 checks — native scheduler/executor/triggerer OTLP plus e2e Celery sidecar pool gauges on one worker; metrics needing synthetic OTLP or rare failure events such as `pool_queued_slots` are mock-only). See `test/e2e-v2/cases/airflow/README.md`. Horizon UI dashboards ship separately in `apache/skywalking-horizon-ui` under the Workflow Scheduler menu group.
* Fix LAL `layer: auto` mode dropping logs after extractor set the layer. Codegen now propagates `layer "..."` assignments to `LogMetadata.layer` so `FilterSpec.doSink()` sees the script-decided layer.
* Fix MetricKit histogram percentile metrics being reported at 1000× their true value — the listener now marks its `SampleFamily` with `defaultHistogramBucketUnit(MILLISECONDS)` so MAL's default SECONDS→MS rescale of `le` labels is not applied.
* Add WeChat and Alipay Mini Program monitoring via the SkyAPM mini-program-monitor SDK (SWIP-12). Two new layers (`WECHAT_MINI_PROGRAM`, `ALIPAY_MINI_PROGRAM`); two new JavaScript componentIds (`WeChat-MiniProgram: 10002`, `AliPay-MiniProgram: 10003`). Service / instance / endpoint entities are produced by MAL + LAL, not trace analysis — mini-programs are client-side (exit-only) so `RPCAnalysisListener` stays unchanged (same pattern as browser and iOS). MAL rules per platform × scope under `otel-rules/miniprogram/` with explicit `.service(...)` / `.endpoint(...)` chains (empty `expSuffix` so endpoint-scope rules aren't overridden), histogram percentile via `.histogram("le", TimeUnit.MILLISECONDS)` to keep ms bucket bounds intact, and request-cpm derived from the histogram `_count` family. LAL `layer: auto` rule produces both layers via `miniprogram.platform` dispatch and emits error-count samples consumed by per-platform log-MAL rules. Per-layer menu entries and service / instance / endpoint dashboards with Trace and Log sub-tabs.
Expand All @@ -300,6 +301,7 @@
#### Documentation
* Update LAL documentation with `sourceAttribute()` function and `layer: auto` mode.
* Add iOS app monitoring setup documentation.
* Add Apache Airflow monitoring setup documentation (SWIP-7).
* Add WeChat / Alipay Mini Program monitoring setup documentation, plus a client-side-monitoring section in the security guide covering public-internet ingress (OTLP + `/v3/segments`) for mobile / browser / mini-program SDKs.
* Improve downsampling documentation

Expand Down
9 changes: 9 additions & 0 deletions docs/en/concepts-and-designs/service-hierarchy.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ If you want to customize it according to your own needs, please refer to [Servic
| PULSAR | K8S_SERVICE | [PULSAR On K8S_SERVICE](#pulsar-on-k8s_service) |
| SO11Y_OAP | K8S_SERVICE | [SO11Y_OAP On K8S_SERVICE](#so11y_oap-on-k8s_service) |
| KONG | K8S_SERVICE | [KONG On K8S_SERVICE](#kong-on-k8s_service) |
| AIRFLOW | K8S_SERVICE | [AIRFLOW On K8S_SERVICE](#airflow-on-k8s_service) |

- The following sections will describe the **default matching rules** in detail and use the `upper-layer On lower-layer` format.
- The example service name are based on SkyWalking [Showcase](https://github.com/apache/skywalking-showcase) default deployment.
Expand Down Expand Up @@ -229,6 +230,14 @@ If you want to customize it according to your own needs, please refer to [Servic
- KONG.service.name: `kong::kong.skywalking-showcase`
- K8S_SERVICE.service.name: `skywalking-showcase::kong.skywalking-showcase`

#### AIRFLOW On K8S_SERVICE
- Rule name: `short-name`
- Matching expression: `{ (u, l) -> u.shortName == l.shortName }`
- Description: AIRFLOW.service.shortName == K8S_SERVICE.service.shortName
- Matched Example:
- AIRFLOW.service.name: `airflow::airflow.skywalking-showcase`
- K8S_SERVICE.service.name: `skywalking-showcase::airflow.skywalking-showcase`

### Build Through Specific Agents
Use agent tech involved(such as eBPF) and deployment tools(such as operator and agent injector) to detect the service hierarchy relations.

Expand Down
239 changes: 239 additions & 0 deletions docs/en/setup/backend/backend-airflow-monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
# Airflow monitoring

## Airflow metrics via native OpenTelemetry

SkyWalking receives Airflow metrics through Airflow's native OpenTelemetry exporter and the
[OpenTelemetry receiver](opentelemetry-receiver.md), then aggregates them with
[MAL](../../concepts-and-designs/mal.md).

## Data flow

1. Enable OpenTelemetry metrics in Airflow (`pip install 'apache-airflow[otel]'`, `otel_on = True`
or standard `OTEL_EXPORTER_OTLP_*` environment variables).
2. Airflow **pushes** OTLP metrics to OpenTelemetry Collector.
3. OpenTelemetry Collector forwards metrics to SkyWalking OAP via OTLP gRPC exporter.
4. OAP applies MAL rules under `otel-rules/airflow/` and stores Service / Instance entities on
`Layer: AIRFLOW`.

```mermaid
graph LR;
Airflow("Airflow") --> Collector("OTel Collector")
Collector --> OAP("SkyWalking OAP")
OAP --> UI("Horizon UI")
```

In the Horizon UI, Airflow appears under the **Workflow Scheduler** menu group.

## Setup

### 1. Enable Airflow OpenTelemetry metrics

Install the OTel extra and enable metrics export. See the
[Airflow metrics documentation](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html).

Example environment variables for Airflow 3.x:

```bash
pip install 'apache-airflow[otel]'

export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_RESOURCE_ATTRIBUTES=cluster=prod-airflow
```

`cluster` is required so SkyWalking can name the Airflow Service (`airflow::prod-airflow`). You
can also inject it with a Collector `resource` processor.
Comment on lines +44 to +45
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default works, but the hard requirement behind it isn't stated, and it's the most likely silent-failure misconfiguration. Suggest making it explicit:

Suggested change
`cluster` is required so SkyWalking can name the Airflow Service (`airflow::prod-airflow`). You
can also inject it with a Collector `resource` processor.
`cluster` is required so SkyWalking can name the Airflow Service (`airflow::prod-airflow`). You
can also inject it with a Collector `resource` processor.
> **Required — `service.name` must be `Airflow`:** OAP maps the OTLP resource `service.name` to the
> `job_name` tag, and every MAL rule under `otel-rules/airflow/` filters on `job_name == 'Airflow'`
> (case-sensitive). Airflow's default OTel service name is `Airflow` (the `[metrics] otel_service`
> default in both 2.x and 3.x), so a default install works out of the box. If you override it via
> `OTEL_SERVICE_NAME`, set it to exactly `Airflow`, and do **not** put a custom `job_name` in
> `OTEL_RESOURCE_ATTRIBUTES` — an explicit `job_name` takes precedence over the `service.name`
> fallback. Otherwise OAP silently drops every Airflow metric and the AIRFLOW layer stays empty.


Legacy `airflow.cfg` keys (`otel_host`, `otel_port`, `otel_prefix`, …) still work on older
releases but are deprecated in favor of standard OTel SDK variables.

### 2. Configure OpenTelemetry Collector

Example pipeline:

```yaml
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317

processors:
batch:

exporters:
otlp:
endpoint: oap:11800
tls:
insecure: true

service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
```

Refer to [test/e2e-v2/cases/airflow/otel-collector-config.yaml](../../../../test/e2e-v2/cases/airflow/otel-collector-config.yaml)
for a minimal Collector pipeline without hard-coded service or instance labels.

### 3. Enable SkyWalking OpenTelemetry receiver

Ensure `airflow/*` is listed in `SW_OTEL_RECEIVER_ENABLED_OTEL_METRICS_RULES` (enabled by default
in the distribution).

## Entity model

| SkyWalking entity | Airflow mapping |
|-------------------|-----------------|
| Service | `airflow::{cluster}` from OTLP resource attribute `cluster` |
| Instance | `{host.name}` — scheduler, worker, or triggerer hostname |

### Components vs SkyWalking Instance vs Airflow Task Instance

In OAP and MAL, the second entity is the standard SkyWalking **Instance** (see
`Layer: AIRFLOW`, `airflow-instance.yaml`). In the Horizon UI, the AIRFLOW layer uses the
display alias **Components** (Chinese: **组件**) for that tab instead of the generic label
**Instance**.

This is intentional:

1. **Avoid confusion with Airflow [Task Instance](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#task-instance).**
In Airflow, a *task instance* is one execution of a task within a single DAG run (for example
`daily_etl · 2026-06-01 · extract_data · try_number=1`). It is short-lived, stored in the
Airflow metadata database, and unrelated to SkyWalking's Instance entity. Airflow operators
already use the word *instance* heavily in that sense; labeling the scheduler / worker /
triggerer tab **Instance** in the UI would suggest task-level drill-down rather than
long-running component processes.

2. **Match the deployment model.** Each row under **Components** is a long-running Airflow
role — scheduler, Celery worker, triggerer, and optionally webserver — identified by OTLP
resource `host.name` (pod hostname or an operator-defined name). Multiple worker replicas
appear as multiple component rows under one Service (`airflow::{cluster}`).

3. **Align with other Horizon layers.** Flink uses **TaskManagers**, Kubernetes uses **Pods**;
AIRFLOW uses **Components** for the same pattern: a domain-specific name for what SkyWalking
stores as Instance.

| Term | Meaning |
|------|---------|
| SkyWalking **Service** | One Airflow cluster (`airflow::{cluster}`) |
| SkyWalking **Instance** (OAP) / **Components** (UI) | One scheduler / worker / triggerer process or pod (`host.name`) |
| Airflow **Task Instance** | One run of one task in one DAG run — **not** shown on this dashboard tab |

Service-level panels aggregate cluster-wide samples. Instance-level (component-level) panels are
scoped per `host.name`. Do not sum instance-scoped samples into service dashboards when each
component exports the same instrument independently.

Airflow pushes OTLP metrics; SkyWalking does not pull them. The Collector only receives
push exports and forwards them to OAP. Do not hard-code service or instance names in
Collector processors — derive them from resource attributes that Airflow (or your
deployment) attaches to each export batch.

Required resource attributes:

| Attribute | Purpose |
|-----------|---------|
| `cluster` | Names the logical Airflow cluster (`airflow::{cluster}` service) |
| `host.name` | Identifies the scheduler / worker / triggerer host (SkyWalking instance) |

On Kubernetes, set `cluster` to your deployment name (for example via
`OTEL_RESOURCE_ATTRIBUTES=cluster=prod-airflow`) and rely on the OTel SDK's default
`host.name` (pod hostname) for instance identity. When a single Collector receives
metrics from multiple Airflow pods, each pod's push carries its own resource block, so
no per-instance relabeling is required.

### Kubernetes sidecar deployment (recommended)

For production Kubernetes deployments, run OpenTelemetry Collector as a **sidecar**
alongside each Airflow component (scheduler, worker, triggerer). Airflow pushes to
`localhost:4318`; the sidecar forwards to a cluster-wide Collector or directly to OAP.
This matches the push model and keeps `cluster` / `host.name` aligned with the pod that
emitted the metrics.

Two e2e cases cover Airflow monitoring (full coverage matrix and latest verify report:
[test/e2e-v2/cases/airflow/README.md](../../../../test/e2e-v2/cases/airflow/README.md)):

- **Mock (CI default, fast):** `test/e2e-v2/cases/airflow/e2e.yaml` replays OTLP JSON via a
Python sidecar ([`otlp_replay_server.py`](../../../../test/e2e-v2/cases/airflow/scripts/otlp_replay_server.py),
built from [`Dockerfile.mock-sender`](../../../../test/e2e-v2/cases/airflow/Dockerfile.mock-sender))
with realistic `cluster` and `host.name` resource attributes.
- **Real Celery cluster (production-like integration smoke):** `test/e2e-v2/cases/airflow/e2e-cluster.yaml`
starts scheduler, two workers, and triggerer (`cluster=airflow-e2e-cluster`), seeds deferrable
and dataset DAGs plus load workload (~4 minutes), then verifies **25 integration checks**
(native scheduler / executor / triggerer OTLP plus e2e Celery sidecar pool gauges on
`airflow-worker-1`). Metrics that need synthetic OTLP or rare Airflow events
(`asset_updates`, `pool_queued_slots`, `triggers_failed`, `triggers_blocked_main_thread`)
are covered only in the mock suite. See
[e2e README](../../../../test/e2e-v2/cases/airflow/README.md).

## Supported metrics

MAL rule definitions live in:

- `otel-rules/airflow/airflow-service.yaml` — cluster service metrics
- `otel-rules/airflow/airflow-instance.yaml` — per-host instance metrics

Metric names follow Airflow's OTel export (`airflow.{stat}` with dots escaped to underscores in
MAL). See [SWIP-7](../../swip/SWIP-7.md) for the full panel list.

## Horizon UI

After OAP ingests OTLP metrics, open **Workflow Scheduler → Airflow** in Horizon UI.

When Airflow runs on Kubernetes with [service hierarchy](../../concepts-and-designs/service-hierarchy.md)
(`AIRFLOW` ↔ `K8S_SERVICE`, matched by `shortName`), use the **3D Infrastructure Map** and
**Kubernetes Services** layer pages together with the AIRFLOW dashboards below.

Screenshots include a local Kubernetes validation stack (`airflow-dev::airflow.airflow-dev` on
`Layer: K8S_SERVICE`) and a Celery cluster layout matching
[`docker-compose-cluster.yml`](../../../../test/e2e-v2/cases/airflow/docker-compose-cluster.yml).

**3D Infrastructure Map** — Live OAP topology (`#/3d/map`): middleware tier **Airflow**; infra tier
groups **Kubernetes Services** and **Kubernetes** by namespace.

![Horizon UI — 3D Infrastructure Map with Airflow and Kubernetes tiers](images/airflow/horizon-infra-3d-map-airflow-dev.png)

**Kubernetes Services — Service** — HTTP RPM, latency, success rate, and pod counts for
`airflow.airflow-dev`.

![Horizon UI — K8S_SERVICE service dashboard](images/airflow/horizon-k8s-service-service.png)

**Kubernetes Services — Instances** — Pod instances under the service.

![Horizon UI — K8S_SERVICE instances](images/airflow/horizon-k8s-service-instances.png)

**Kubernetes Services — Endpoints** — Per-endpoint HTTP metrics (example: `GET:/health`).

![Horizon UI — K8S_SERVICE endpoint GET:/health](images/airflow/horizon-k8s-service-endpoints.png)

**Kubernetes Services — Topology** — Inbound traffic chain observed by Rover (example: Unknown →
kube-dns → airflow).

![Horizon UI — K8S_SERVICE topology](images/airflow/horizon-k8s-service-topology.png)

**AIRFLOW — Service** — Cluster-level SWIP-7 panels (tasks, pool slots, scheduler heartbeat, DAG
queue).

![Horizon UI — Airflow service dashboard](images/airflow/horizon-airflow-service.png)

**AIRFLOW — Components** — Scheduler, triggerer, and workers under one Service (four-node local
Celery layout).

![Horizon UI — Airflow components list](images/airflow/horizon-airflow-components.png)

**AIRFLOW — Component detail** — Instance-scoped metrics for a selected host (example:
`airflow-scheduler`).

![Horizon UI — Airflow scheduler component metrics](images/airflow/horizon-airflow-component-scheduler.png)

More e2e coverage and verify reports:
[test/e2e-v2/cases/airflow/README.md](../../../../test/e2e-v2/cases/airflow/README.md).

## Customization

You can extend or override MAL rules under `otel-rules/airflow/` and add UI dashboards in the
Horizon UI bundle. Restart OAP after rule changes.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading