From ef197ce2c55f462344e6ac845cd1e19eae6a3757 Mon Sep 17 00:00:00 2001 From: Shelly Matskel Date: Wed, 13 May 2026 15:44:38 -0400 Subject: [PATCH 1/6] Add Monitors and Automation page to DSM Kafka docs Documents the monitor templates Data Streams Monitoring ships at cluster and topic level, and gives concrete examples of automating responses with webhooks (consumer lag, data-loss risk, broker disk, partition health). Co-Authored-By: Claude Opus 4.7 (1M context) --- content/en/data_streams/kafka/_index.md | 3 +- .../kafka/monitors_and_automation.md | 86 +++++++++++++++++++ content/en/data_streams/kafka/setup.md | 1 + 3 files changed, 89 insertions(+), 1 deletion(-) create mode 100644 content/en/data_streams/kafka/monitors_and_automation.md diff --git a/content/en/data_streams/kafka/_index.md b/content/en/data_streams/kafka/_index.md index f0ac2417471..403fa316525 100644 --- a/content/en/data_streams/kafka/_index.md +++ b/content/en/data_streams/kafka/_index.md @@ -14,7 +14,7 @@ With Data Streams Monitoring's Kafka Monitoring, a Datadog Agent check connects - **Connect services to topics**: See which producers and consumers interact with each topic, with linked owners, repos, on-call rotations, traces, and error logs - **Inspect topic schemas and messages**: View schemas, compare versions, and access messages to debug poison payloads or explore the topic -To get started, see [Kafka Monitoring Setup][2]. +To get started, see [Kafka Monitoring Setup][2]. Once your clusters and topics are connected, see [Monitors and Automation][4] for recommended monitor templates and examples of automating responses with webhooks. ## Workflows @@ -58,3 +58,4 @@ The {{< ui >}}Messages{{< /ui >}} section lets you retrieve messages by partitio [2]: /data_streams/kafka/setup/ [3]: /data_streams/kafka/setup/#enable-message-inspection +[4]: /data_streams/kafka/monitors_and_automation/ diff --git a/content/en/data_streams/kafka/monitors_and_automation.md b/content/en/data_streams/kafka/monitors_and_automation.md new file mode 100644 index 00000000000..f27a9513f0b --- /dev/null +++ b/content/en/data_streams/kafka/monitors_and_automation.md @@ -0,0 +1,86 @@ +--- +title: Monitors and Automation +description: Recommended Datadog monitors for Kafka clusters and topics tracked by Data Streams Monitoring, and examples of using webhooks to automate a response when a monitor triggers. +weight: 2 +further_reading: +- link: "/integrations/webhooks/" + tag: "Documentation" + text: "Webhooks integration" +- link: "/monitors/types/metric/" + tag: "Documentation" + text: "Metric monitors" +- link: "/data_streams/kafka/setup/" + tag: "Documentation" + text: "Kafka Monitoring setup" +--- + +Once your Kafka clusters are connected to Data Streams Monitoring (see [Kafka Monitoring Setup][1]), the next step is to alert on the conditions that put your pipelines at risk and, where possible, automate the response. This page covers the monitor templates that Data Streams Monitoring provides out of the box, and shows how to wire those monitors to a webhook so a triggered alert can take action automatically. + +## Recommended monitors + +Data Streams Monitoring ships with monitor templates that you can create directly from a cluster or topic detail page — no query writing required. On any cluster or topic, open the {{< ui >}}Monitors{{< /ui >}} side panel and click {{< ui >}}Start{{< /ui >}} on a template to pre-fill the query with the right scope. + +### Cluster-level templates + +| Template | Metric | Condition | +|-----------------------------------------|-------------------------------------|--------------------------------------------| +| Offline partitions detected | `kafka.partition.offline` | Any partition in the cluster is offline. | +| Under-replicated partitions detected | `kafka.partition.under_replicated` | Any partition in the cluster is under-replicated, which puts data at risk if a broker fails. | + +Both monitors are grouped by `kafka_cluster_id` so each cluster alerts its own owner. + +### Topic-level templates + +| Template | Metric | Condition | +|------------------------------------------------------------|-------------------------------------------------------------------------|-----------| +| Incoming message rate has dropped | `kafka.topic.message_rate` | Produce rate to the topic drops below a threshold (default `< 1` msg/sec). Catches silent producer failures. | +| Consumer lag is high for topic | `kafka.estimated_consumer_lag` | Consumer lag exceeds a threshold for a topic and consumer group (default `> 1000` messages). | +| Offline partitions on topic | `kafka.partition.offline` | Any partition for this specific topic goes offline, indicating data unavailability for that topic. | +| Consumer lag is approaching time retention limit | `kafka.estimated_consumer_lag` / `kafka.topic.config.retention_ms` | Estimated lag exceeds 80% of the topic's time-based retention. Beyond 100% means the consumer cannot recover lost data. | +| Consumer lag is approaching bytes retention limit | `kafka.consumer_lag` × throughput / `kafka.topic.config.retention_bytes` | Estimated lag exceeds 80% of the topic's bytes-based retention. Requires Kafka broker metrics to be available. | + +The two "approaching retention" monitors are the most important guardrails against data loss: when lag exceeds retention, the broker deletes messages before the consumer reads them. + +## Automate responses with webhooks + +Datadog monitors can call any HTTP endpoint when they trigger, recover, or change state. Use this to take action automatically when a Kafka condition fires, rather than waiting for a human to triage. See [Webhooks integration][2] for how to configure the webhook destination, payload variables, and authentication. + +A common pattern is to add `@webhook-` to a monitor's notification message — Datadog calls the webhook each time the monitor changes state, with monitor metadata available as template variables (`{{topic.name}}`, `{{kafka_cluster_id.name}}`, `{{value}}`, and so on). + +The examples below show conditions where automation is particularly valuable in a Kafka pipeline. + +### Consumer lag increasing + +When the **Consumer lag is high for topic** monitor triggers, automate one of the following: + +- **Scale the consumer group.** Call a CI/CD or autoscaler webhook to increase the consumer deployment's replica count. +- **Notify the owning team.** Post to the consumer service's on-call channel with the topic, consumer group, and current lag pulled from the monitor's template variables. +- **Open an incident** for the consumer service if lag stays elevated past a recovery window. + +### Risk of data loss (lag approaching retention) + +When **Consumer lag is approaching time retention limit** or **bytes retention limit** triggers, the topic is within 80% of the point where unread messages are deleted. This is the highest-severity automation: + +- **Page on-call immediately**, with the topic and remaining retention budget in the payload. +- **Trigger an emergency runbook** that can, for example, temporarily extend retention on the affected topic, pause the upstream producer, or scale the consumer group ahead of the threshold. + +### Disk space running out on brokers + +A broker that runs out of disk goes offline and takes its partitions with it. Create a monitor on the broker host's `system.disk.in_use` (or your Kafka deployment's equivalent) and have its webhook: + +- **Notify the platform team's on-call** with the broker hostname and current disk usage. +- **Trigger a capacity workflow** to add storage, expand the cluster, or compact a candidate topic, depending on your operational model. + +### Offline or under-replicated partitions + +When the cluster-level **Offline partitions detected** or **Under-replicated partitions detected** monitor triggers: + +- **Notify the cluster owner's on-call**, including the affected broker IDs and partition counts from the monitor's template variables. +- **Trigger a broker-health workflow** (for example, restart a stuck broker or rebalance partitions) if your runbook supports it. + +## Further reading + +{{< partial name="whats-next/whats-next.html" >}} + +[1]: /data_streams/kafka/setup/ +[2]: /integrations/webhooks/ diff --git a/content/en/data_streams/kafka/setup.md b/content/en/data_streams/kafka/setup.md index aef7759d61f..b0c54ed241c 100644 --- a/content/en/data_streams/kafka/setup.md +++ b/content/en/data_streams/kafka/setup.md @@ -1,6 +1,7 @@ --- title: Kafka Monitoring Setup description: Set up Data Streams Monitoring's Kafka Monitoring, including prerequisites, Agent configuration, and the additional steps required to inspect Kafka messages. +weight: 1 --- This page covers the prerequisites and setup steps for Data Streams Monitoring's Kafka Monitoring. From dc8d96db72871c502c743bc3ed35c32e462a6e81 Mon Sep 17 00:00:00 2001 From: Shelly Matskel Date: Wed, 13 May 2026 17:32:46 -0400 Subject: [PATCH 2/6] Surface Workflow Automation alongside webhooks The automation section previously framed webhooks as the only path. Workflow Automation is the better fit for the "trigger a runbook" patterns described below (Kubernetes, AWS, Slack, Jira actions). Reframe the section to present both options, and flag the broker-disk example as a non-DSM monitor. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../kafka/monitors_and_automation.md | 24 +++++++++++++------ 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/content/en/data_streams/kafka/monitors_and_automation.md b/content/en/data_streams/kafka/monitors_and_automation.md index f27a9513f0b..ec7e80973c6 100644 --- a/content/en/data_streams/kafka/monitors_and_automation.md +++ b/content/en/data_streams/kafka/monitors_and_automation.md @@ -1,8 +1,14 @@ --- title: Monitors and Automation -description: Recommended Datadog monitors for Kafka clusters and topics tracked by Data Streams Monitoring, and examples of using webhooks to automate a response when a monitor triggers. +description: Recommended Datadog monitors for Kafka clusters and topics tracked by Data Streams Monitoring, and examples of automating responses with Workflow Automation or webhooks when a monitor triggers. weight: 2 further_reading: +- link: "/actions/workflows/" + tag: "Documentation" + text: "Workflow Automation" +- link: "/actions/workflows/trigger/#add-a-monitor-trigger-to-your-workflow" + tag: "Documentation" + text: "Trigger a workflow from a monitor" - link: "/integrations/webhooks/" tag: "Documentation" text: "Webhooks integration" @@ -14,7 +20,7 @@ further_reading: text: "Kafka Monitoring setup" --- -Once your Kafka clusters are connected to Data Streams Monitoring (see [Kafka Monitoring Setup][1]), the next step is to alert on the conditions that put your pipelines at risk and, where possible, automate the response. This page covers the monitor templates that Data Streams Monitoring provides out of the box, and shows how to wire those monitors to a webhook so a triggered alert can take action automatically. +After your Kafka clusters are connected to Data Streams Monitoring (see [Kafka Monitoring Setup][1]), the next step is to alert on the conditions that put your pipelines at risk and, where possible, automate the response. This page covers the monitor templates that Data Streams Monitoring provides, and shows how to connect those monitors to Datadog Workflow Automation or a webhook so a triggered alert can take action automatically. ## Recommended monitors @@ -41,11 +47,14 @@ Both monitors are grouped by `kafka_cluster_id` so each cluster alerts its own o The two "approaching retention" monitors are the most important guardrails against data loss: when lag exceeds retention, the broker deletes messages before the consumer reads them. -## Automate responses with webhooks +## Automate responses to triggered monitors + +When a monitor triggers, Datadog can take action automatically rather than waiting for a human to triage. Two options: -Datadog monitors can call any HTTP endpoint when they trigger, recover, or change state. Use this to take action automatically when a Kafka condition fires, rather than waiting for a human to triage. See [Webhooks integration][2] for how to configure the webhook destination, payload variables, and authentication. +- **Workflow Automation** — Build a Datadog Workflow that chains pre-built actions across your infrastructure and tools (PagerDuty, Slack, Jira, AWS, Kubernetes, and so on), and run it from a monitor trigger. Best for the "trigger a runbook" patterns below. See [Trigger a workflow from a monitor][3]. +- **Webhooks** — Call any HTTP endpoint when a monitor triggers, recovers, or changes state. Best when the action lives in a system outside Datadog and you already have an HTTPS callback. See [Webhooks integration][2]. -A common pattern is to add `@webhook-` to a monitor's notification message — Datadog calls the webhook each time the monitor changes state, with monitor metadata available as template variables (`{{topic.name}}`, `{{kafka_cluster_id.name}}`, `{{value}}`, and so on). +Either option can be added to a monitor by mentioning it in the notification message — `@workflow-` for Workflow Automation, `@webhook-` for a webhook. Monitor metadata is available as template variables (`{{topic.name}}`, `{{kafka_cluster_id.name}}`, `{{value}}`, and so on) and can be passed to the workflow or webhook payload. The examples below show conditions where automation is particularly valuable in a Kafka pipeline. @@ -53,7 +62,7 @@ The examples below show conditions where automation is particularly valuable in When the **Consumer lag is high for topic** monitor triggers, automate one of the following: -- **Scale the consumer group.** Call a CI/CD or autoscaler webhook to increase the consumer deployment's replica count. +- **Scale the consumer group.** Run a workflow that increases the consumer deployment's replica count (for example, with the Kubernetes or AWS actions in Workflow Automation), or call a CI/CD or autoscaler webhook. - **Notify the owning team.** Post to the consumer service's on-call channel with the topic, consumer group, and current lag pulled from the monitor's template variables. - **Open an incident** for the consumer service if lag stays elevated past a recovery window. @@ -66,7 +75,7 @@ When **Consumer lag is approaching time retention limit** or **bytes retention l ### Disk space running out on brokers -A broker that runs out of disk goes offline and takes its partitions with it. Create a monitor on the broker host's `system.disk.in_use` (or your Kafka deployment's equivalent) and have its webhook: +A broker that runs out of disk goes offline and takes its partitions with it. This is not a Data Streams Monitoring template — create a monitor on the broker host's `system.disk.in_use` (or your Kafka deployment's equivalent) and have its automation: - **Notify the platform team's on-call** with the broker hostname and current disk usage. - **Trigger a capacity workflow** to add storage, expand the cluster, or compact a candidate topic, depending on your operational model. @@ -84,3 +93,4 @@ When the cluster-level **Offline partitions detected** or **Under-replicated par [1]: /data_streams/kafka/setup/ [2]: /integrations/webhooks/ +[3]: /actions/workflows/trigger/#add-a-monitor-trigger-to-your-workflow From 224c4e48c8b2d3cf05093bff6e4fc493b98b0d23 Mon Sep 17 00:00:00 2001 From: Shelly Matskel Date: Thu, 14 May 2026 09:44:11 -0400 Subject: [PATCH 3/6] Restructure Monitors and Automation page after review - Add page to left nav under Kafka via main.en.yaml (below Setup). - Convert intro to a bulleted overview linking to in-page sections. - Reorder topic-level monitor templates table (consumer lag first) and drop default thresholds that may change. - Reformat per-scenario automation guidance: detection sentence followed by a single "Potential action" line. - Cover Datadog Workflow Automation alongside webhooks. - Replace the prose call-out on the Kafka index with a capability bullet that matches the existing list format. Co-Authored-By: Claude Opus 4.7 (1M context) --- config/_default/menus/main.en.yaml | 5 ++ content/en/data_streams/kafka/_index.md | 3 +- .../kafka/monitors_and_automation.md | 46 +++++++++---------- 3 files changed, 29 insertions(+), 25 deletions(-) diff --git a/config/_default/menus/main.en.yaml b/config/_default/menus/main.en.yaml index c3376639166..a47e38a431c 100644 --- a/config/_default/menus/main.en.yaml +++ b/config/_default/menus/main.en.yaml @@ -5038,6 +5038,11 @@ menu: identifier: data_streams_kafka_setup parent: data_streams_kafka weight: 1 + - name: Monitors and Automation + url: data_streams/kafka/monitors_and_automation + identifier: data_streams_kafka_monitors_and_automation + parent: data_streams_kafka + weight: 2 - name: Schema Tracking url: data_streams/schema_tracking identifier: data_streams_schema_tracking diff --git a/content/en/data_streams/kafka/_index.md b/content/en/data_streams/kafka/_index.md index 403fa316525..ed776556e65 100644 --- a/content/en/data_streams/kafka/_index.md +++ b/content/en/data_streams/kafka/_index.md @@ -13,8 +13,9 @@ With Data Streams Monitoring's Kafka Monitoring, a Datadog Agent check connects - **Pinpoint root cause**: Correlate configuration and schema changes with lag, throughput, and errors, and trace issues to the exact topic, schema version, or configuration change - **Connect services to topics**: See which producers and consumers interact with each topic, with linked owners, repos, on-call rotations, traces, and error logs - **Inspect topic schemas and messages**: View schemas, compare versions, and access messages to debug poison payloads or explore the topic +- **Alert and automate responses**: Use [recommended monitor templates][4] and trigger Workflow Automation or webhooks when a Kafka condition fires -To get started, see [Kafka Monitoring Setup][2]. Once your clusters and topics are connected, see [Monitors and Automation][4] for recommended monitor templates and examples of automating responses with webhooks. +To get started, see [Kafka Monitoring Setup][2]. ## Workflows diff --git a/content/en/data_streams/kafka/monitors_and_automation.md b/content/en/data_streams/kafka/monitors_and_automation.md index ec7e80973c6..ef042518a30 100644 --- a/content/en/data_streams/kafka/monitors_and_automation.md +++ b/content/en/data_streams/kafka/monitors_and_automation.md @@ -20,11 +20,16 @@ further_reading: text: "Kafka Monitoring setup" --- -After your Kafka clusters are connected to Data Streams Monitoring (see [Kafka Monitoring Setup][1]), the next step is to alert on the conditions that put your pipelines at risk and, where possible, automate the response. This page covers the monitor templates that Data Streams Monitoring provides, and shows how to connect those monitors to Datadog Workflow Automation or a webhook so a triggered alert can take action automatically. +After your Kafka clusters are connected to Data Streams Monitoring (see [Kafka Monitoring Setup][1]), the next step is to alert on the conditions that put your pipelines at risk and, where possible, automate the response. + +This page covers: + +- [Recommended monitors](#recommended-monitors): available out-of-the-box monitor templates for cluster and topic health +- [Automate responses to triggered monitors](#automate-responses-to-triggered-monitors): using Datadog Workflow Automation or a webhook to take action when a monitor triggers ## Recommended monitors -Data Streams Monitoring ships with monitor templates that you can create directly from a cluster or topic detail page — no query writing required. On any cluster or topic, open the {{< ui >}}Monitors{{< /ui >}} side panel and click {{< ui >}}Start{{< /ui >}} on a template to pre-fill the query with the right scope. +Data Streams Monitoring ships with monitor templates that you can create directly from a cluster or topic detail page. ### Cluster-level templates @@ -39,13 +44,11 @@ Both monitors are grouped by `kafka_cluster_id` so each cluster alerts its own o | Template | Metric | Condition | |------------------------------------------------------------|-------------------------------------------------------------------------|-----------| -| Incoming message rate has dropped | `kafka.topic.message_rate` | Produce rate to the topic drops below a threshold (default `< 1` msg/sec). Catches silent producer failures. | -| Consumer lag is high for topic | `kafka.estimated_consumer_lag` | Consumer lag exceeds a threshold for a topic and consumer group (default `> 1000` messages). | +| Consumer lag is high for topic | `kafka.estimated_consumer_lag` | Consumer lag exceeds a threshold for a topic and consumer group. | +| Incoming message rate has dropped | `kafka.topic.message_rate` | Produce rate to the topic drops below a threshold. Catches silent producer failures. | | Offline partitions on topic | `kafka.partition.offline` | Any partition for this specific topic goes offline, indicating data unavailability for that topic. | -| Consumer lag is approaching time retention limit | `kafka.estimated_consumer_lag` / `kafka.topic.config.retention_ms` | Estimated lag exceeds 80% of the topic's time-based retention. Beyond 100% means the consumer cannot recover lost data. | -| Consumer lag is approaching bytes retention limit | `kafka.consumer_lag` × throughput / `kafka.topic.config.retention_bytes` | Estimated lag exceeds 80% of the topic's bytes-based retention. Requires Kafka broker metrics to be available. | - -The two "approaching retention" monitors are the most important guardrails against data loss: when lag exceeds retention, the broker deletes messages before the consumer reads them. +| Consumer lag is approaching time retention limit | `kafka.estimated_consumer_lag` / `kafka.topic.config.retention_ms` | Estimated lag approaches the topic's time-based retention. Beyond the retention limit, the consumer cannot recover lost data. | +| Consumer lag is approaching bytes retention limit | `kafka.consumer_lag` × throughput / `kafka.topic.config.retention_bytes` | Estimated lag approaches the topic's bytes-based retention. Requires Kafka broker metrics to be available. | ## Automate responses to triggered monitors @@ -58,34 +61,29 @@ Either option can be added to a monitor by mentioning it in the notification mes The examples below show conditions where automation is particularly valuable in a Kafka pipeline. -### Consumer lag increasing +### Consumer lag is high -When the **Consumer lag is high for topic** monitor triggers, automate one of the following: +Signals that a consumer group is falling behind its producer, with messages accumulating in the topic faster than they can be read. -- **Scale the consumer group.** Run a workflow that increases the consumer deployment's replica count (for example, with the Kubernetes or AWS actions in Workflow Automation), or call a CI/CD or autoscaler webhook. -- **Notify the owning team.** Post to the consumer service's on-call channel with the topic, consumer group, and current lag pulled from the monitor's template variables. -- **Open an incident** for the consumer service if lag stays elevated past a recovery window. +**Potential action:** Run a workflow that scales the consumer group's replica count (for example, with the Kubernetes or AWS actions in Workflow Automation), or call a CI/CD or autoscaler webhook. -### Risk of data loss (lag approaching retention) +### Lag approaching retention limit -When **Consumer lag is approaching time retention limit** or **bytes retention limit** triggers, the topic is within 80% of the point where unread messages are deleted. This is the highest-severity automation: +Signals that unread messages are approaching the topic's retention window. If lag exceeds retention, those messages get deleted before the consumer can read them. -- **Page on-call immediately**, with the topic and remaining retention budget in the payload. -- **Trigger an emergency runbook** that can, for example, temporarily extend retention on the affected topic, pause the upstream producer, or scale the consumer group ahead of the threshold. +**Potential action:** Trigger an emergency runbook that can temporarily extend retention on the affected topic, pause the upstream producer, or scale the consumer group ahead of the threshold. -### Disk space running out on brokers +### Broker disk filling up -A broker that runs out of disk goes offline and takes its partitions with it. This is not a Data Streams Monitoring template — create a monitor on the broker host's `system.disk.in_use` (or your Kafka deployment's equivalent) and have its automation: +Signals that a broker host is running low on disk space. If the disk fills up, the broker goes offline and its partitions become unavailable. -- **Notify the platform team's on-call** with the broker hostname and current disk usage. -- **Trigger a capacity workflow** to add storage, expand the cluster, or compact a candidate topic, depending on your operational model. +**Potential action:** Trigger a capacity workflow to add storage, expand the cluster, or compact a candidate topic. ### Offline or under-replicated partitions -When the cluster-level **Offline partitions detected** or **Under-replicated partitions detected** monitor triggers: +Signals that one or more partitions in the cluster are offline (unavailable) or under-replicated, which puts data durability at risk if a broker fails. -- **Notify the cluster owner's on-call**, including the affected broker IDs and partition counts from the monitor's template variables. -- **Trigger a broker-health workflow** (for example, restart a stuck broker or rebalance partitions) if your runbook supports it. +**Potential action:** Trigger a broker-health workflow — for example, restart a stuck broker or rebalance partitions. ## Further reading From 6cd89c9336f11dfc7ad81ec3300c0b81b9ae1425 Mon Sep 17 00:00:00 2001 From: Shelly Matskel Date: Thu, 14 May 2026 10:42:19 -0400 Subject: [PATCH 4/6] Address Piotr's review comments - Link "Kafka broker metrics" to the Kafka integration page. - "Compact a candidate topic" -> "reduce retention on a candidate topic" for the disk-filling-up remediation, since reducing retention is the more immediate capacity recovery action. Co-Authored-By: Claude Opus 4.7 (1M context) --- content/en/data_streams/kafka/monitors_and_automation.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/en/data_streams/kafka/monitors_and_automation.md b/content/en/data_streams/kafka/monitors_and_automation.md index ef042518a30..d82740224f0 100644 --- a/content/en/data_streams/kafka/monitors_and_automation.md +++ b/content/en/data_streams/kafka/monitors_and_automation.md @@ -48,7 +48,7 @@ Both monitors are grouped by `kafka_cluster_id` so each cluster alerts its own o | Incoming message rate has dropped | `kafka.topic.message_rate` | Produce rate to the topic drops below a threshold. Catches silent producer failures. | | Offline partitions on topic | `kafka.partition.offline` | Any partition for this specific topic goes offline, indicating data unavailability for that topic. | | Consumer lag is approaching time retention limit | `kafka.estimated_consumer_lag` / `kafka.topic.config.retention_ms` | Estimated lag approaches the topic's time-based retention. Beyond the retention limit, the consumer cannot recover lost data. | -| Consumer lag is approaching bytes retention limit | `kafka.consumer_lag` × throughput / `kafka.topic.config.retention_bytes` | Estimated lag approaches the topic's bytes-based retention. Requires Kafka broker metrics to be available. | +| Consumer lag is approaching bytes retention limit | `kafka.consumer_lag` × throughput / `kafka.topic.config.retention_bytes` | Estimated lag approaches the topic's bytes-based retention. Requires [Kafka broker metrics](/integrations/kafka/?tab=host#overview) to be available. | ## Automate responses to triggered monitors @@ -77,7 +77,7 @@ Signals that unread messages are approaching the topic's retention window. If la Signals that a broker host is running low on disk space. If the disk fills up, the broker goes offline and its partitions become unavailable. -**Potential action:** Trigger a capacity workflow to add storage, expand the cluster, or compact a candidate topic. +**Potential action:** Trigger a capacity workflow to add storage, expand the cluster, or reduce retention on a candidate topic. ### Offline or under-replicated partitions From 21499bc5ef68e5d13bee10d349e8215f57de7cf5 Mon Sep 17 00:00:00 2001 From: Shelly Matskel Date: Thu, 14 May 2026 14:20:07 -0400 Subject: [PATCH 5/6] Add Description column to monitor template tables Co-Authored-By: Claude Opus 4.7 (1M context) --- .../kafka/monitors_and_automation.md | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/content/en/data_streams/kafka/monitors_and_automation.md b/content/en/data_streams/kafka/monitors_and_automation.md index d82740224f0..cd5a72fd80f 100644 --- a/content/en/data_streams/kafka/monitors_and_automation.md +++ b/content/en/data_streams/kafka/monitors_and_automation.md @@ -33,22 +33,22 @@ Data Streams Monitoring ships with monitor templates that you can create directl ### Cluster-level templates -| Template | Metric | Condition | -|-----------------------------------------|-------------------------------------|--------------------------------------------| -| Offline partitions detected | `kafka.partition.offline` | Any partition in the cluster is offline. | -| Under-replicated partitions detected | `kafka.partition.under_replicated` | Any partition in the cluster is under-replicated, which puts data at risk if a broker fails. | +| Template | Description | Metric | Condition | +|-----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|----------------------------------------------------------------| +| Offline partitions detected | Topic data is unavailable for both reads and writes, risking message loss, consumer lag, and service outages until leadership is reassigned | `kafka.partition.offline` | Any partition in the cluster is offline | +| Under-replicated partitions detected | Topic data has fewer in-sync replicas than configured, increasing risk of data loss if the leader broker fails before replication catches up | `kafka.partition.under_replicated` | Any partition in the cluster is under-replicated | Both monitors are grouped by `kafka_cluster_id` so each cluster alerts its own owner. ### Topic-level templates -| Template | Metric | Condition | -|------------------------------------------------------------|-------------------------------------------------------------------------|-----------| -| Consumer lag is high for topic | `kafka.estimated_consumer_lag` | Consumer lag exceeds a threshold for a topic and consumer group. | -| Incoming message rate has dropped | `kafka.topic.message_rate` | Produce rate to the topic drops below a threshold. Catches silent producer failures. | -| Offline partitions on topic | `kafka.partition.offline` | Any partition for this specific topic goes offline, indicating data unavailability for that topic. | -| Consumer lag is approaching time retention limit | `kafka.estimated_consumer_lag` / `kafka.topic.config.retention_ms` | Estimated lag approaches the topic's time-based retention. Beyond the retention limit, the consumer cannot recover lost data. | -| Consumer lag is approaching bytes retention limit | `kafka.consumer_lag` × throughput / `kafka.topic.config.retention_bytes` | Estimated lag approaches the topic's bytes-based retention. Requires [Kafka broker metrics](/integrations/kafka/?tab=host#overview) to be available. | +| Template | Description | Metric | Condition | +|------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------|-----------| +| Consumer lag is high for topic | Measured in seconds, indicating stale data served to customers, message backlog buildup, and delayed downstream processing | `kafka.estimated_consumer_lag` | Consumer lag in seconds exceeds a threshold for a topic and consumer group | +| Incoming message rate has dropped | Catches silent producer failures | `kafka.topic.message_rate` | Produce rate to the topic drops below a threshold | +| Offline partitions on topic | Topic data is unavailable for both reads and writes, risking message loss, consumer lag, and service outages until leadership is reassigned | `kafka.partition.offline` | Any partition for this specific topic goes offline | +| Consumer lag is approaching time retention limit | Increased risk of data loss. Beyond the retention limit, the consumer cannot recover lost data | `kafka.estimated_consumer_lag` / `kafka.topic.config.retention_ms` | Estimated lag approaches the topic's time-based retention | +| Consumer lag is approaching bytes retention limit | Increased risk of data loss. Beyond the retention limit, the consumer cannot recover lost data | `kafka.consumer_lag` × throughput / `kafka.topic.config.retention_bytes` | Estimated lag approaches the topic's bytes-based retention.

Requires [Kafka broker metrics](/integrations/kafka/?tab=host#overview) to be available | ## Automate responses to triggered monitors From 8d3d313811b6b83288f10bac37eb044a273b3c25 Mon Sep 17 00:00:00 2001 From: cswatt Date: Thu, 14 May 2026 14:04:01 -0700 Subject: [PATCH 6/6] edits --- content/en/data_streams/kafka/monitors_and_automation.md | 5 ++--- content/en/data_streams/kafka/setup.md | 1 - 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/content/en/data_streams/kafka/monitors_and_automation.md b/content/en/data_streams/kafka/monitors_and_automation.md index cd5a72fd80f..2cf6283b241 100644 --- a/content/en/data_streams/kafka/monitors_and_automation.md +++ b/content/en/data_streams/kafka/monitors_and_automation.md @@ -1,7 +1,6 @@ --- title: Monitors and Automation description: Recommended Datadog monitors for Kafka clusters and topics tracked by Data Streams Monitoring, and examples of automating responses with Workflow Automation or webhooks when a monitor triggers. -weight: 2 further_reading: - link: "/actions/workflows/" tag: "Documentation" @@ -57,9 +56,9 @@ When a monitor triggers, Datadog can take action automatically rather than waiti - **Workflow Automation** — Build a Datadog Workflow that chains pre-built actions across your infrastructure and tools (PagerDuty, Slack, Jira, AWS, Kubernetes, and so on), and run it from a monitor trigger. Best for the "trigger a runbook" patterns below. See [Trigger a workflow from a monitor][3]. - **Webhooks** — Call any HTTP endpoint when a monitor triggers, recovers, or changes state. Best when the action lives in a system outside Datadog and you already have an HTTPS callback. See [Webhooks integration][2]. -Either option can be added to a monitor by mentioning it in the notification message — `@workflow-` for Workflow Automation, `@webhook-` for a webhook. Monitor metadata is available as template variables (`{{topic.name}}`, `{{kafka_cluster_id.name}}`, `{{value}}`, and so on) and can be passed to the workflow or webhook payload. +Either option can be added to a monitor by mentioning it in the notification message: `@workflow-` for Workflow Automation, `@webhook-` for a webhook. Monitor metadata is available as template variables (`{{topic.name}}`, `{{kafka_cluster_id.name}}`, `{{value}}`, etc.) and can be passed to the workflow or webhook payload. -The examples below show conditions where automation is particularly valuable in a Kafka pipeline. +The following examples show conditions where automation is particularly valuable in a Kafka pipeline. ### Consumer lag is high diff --git a/content/en/data_streams/kafka/setup.md b/content/en/data_streams/kafka/setup.md index b0c54ed241c..aef7759d61f 100644 --- a/content/en/data_streams/kafka/setup.md +++ b/content/en/data_streams/kafka/setup.md @@ -1,7 +1,6 @@ --- title: Kafka Monitoring Setup description: Set up Data Streams Monitoring's Kafka Monitoring, including prerequisites, Agent configuration, and the additional steps required to inspect Kafka messages. -weight: 1 --- This page covers the prerequisites and setup steps for Data Streams Monitoring's Kafka Monitoring.