diff --git a/content/en/data_observability/jobs_monitoring/glue.md b/content/en/data_observability/jobs_monitoring/glue.md index 95dd41c2ec5..1e4c4da25fd 100644 --- a/content/en/data_observability/jobs_monitoring/glue.md +++ b/content/en/data_observability/jobs_monitoring/glue.md @@ -108,6 +108,34 @@ This helps ensure the logs are searchable and available under the {{< ui >}}Glue Enable the [Glue Integration][4] tile for Glue metrics collection. Metrics should be available under the {{< ui >}}Glue{{< /ui >}} job tab in **Data Observability: Jobs Monitoring**. +## (Optional) Enable dataset lineage + +Glue jobs that run with the Spark engine can emit OpenLineage events directly to Datadog. This provides dataset-level lineage, showing which datasets your job reads and writes. + +**Note**: AWS Glue includes the Spark OpenLineage connector in its default class path. To use a more recent version, add the connector JAR manually through the `--extra-jars` Glue job parameter and set `--user-jars-first=true` to override the bundled version. For example: `--extra-jars s3:///openlineage-spark-.jar` and `--user-jars-first true`. + +### Configure the SparkSession + +In your Glue job script, configure the `SparkSession` with the following settings: + +```python +spark = SparkSession.builder \ + .config("spark.extraListeners", "io.openlineage.spark.agent.OpenLineageSparkListener") \ + .config("spark.openlineage.transport.type", "http") \ + .config("spark.openlineage.transport.url", "") \ + .config("spark.openlineage.transport.auth.type", "api_key") \ + .config("spark.openlineage.transport.auth.apiKey", "") \ + .config("spark.redaction.regex", "(?i)secret|password|token|access[.]key|apikey") \ + .config("spark.openlineage.capturedProperties", "spark.glue.JOB_RUN_ID") \ + .getOrCreate() +``` + +Replace `` with `https://data-obs-intake.`{{< region-param key="dd_site" code="true" >}}. Replace `` with your Datadog API key. `spark.glue.JOB_RUN_ID` is the Spark configuration property automatically set by AWS Glue with the current job run ID — use it verbatim. + +### Validate + +After enabling OpenLineage, open a job run in [Data Observability: Jobs Monitoring][6]. In the flame graph, additional spans such as `spark.application` or `spark.sql_job` should appear. The payloads of these spans should be helpful when debugging dataset extraction. + ## Next steps The crawler runs every few minutes.