Skip to content

Add Monitors and Automation page to DSM Kafka docs#36667

Open
shellymat-dd wants to merge 6 commits into
masterfrom
shelly/dsm-kafka-monitors-and-automation
Open

Add Monitors and Automation page to DSM Kafka docs#36667
shellymat-dd wants to merge 6 commits into
masterfrom
shelly/dsm-kafka-monitors-and-automation

Conversation

@shellymat-dd
Copy link
Copy Markdown
Contributor

What does this PR do? What is the motivation?

Adds a new page, Monitors and Automation, under data_streams/kafka/ so customers landing on the Kafka docs can find:

  • The cluster- and topic-level monitor templates that Data Streams Monitoring ships out of the box (with the exact metrics, scopes, and default thresholds the product uses).
  • Concrete examples of using webhooks to automate a response when a Kafka monitor triggers — consumer lag rising, lag approaching retention (data-loss risk), broker disk exhaustion, and offline/under-replicated partitions.

The monitor template details (titles, descriptions, metrics, default thresholds) are taken from the in-product templates defined in web-ui (getClusterRecommendedMonitors / getTopicRecommendedMonitors), so the docs match what users see in the {{< ui >}}Monitors{{< /ui >}} side panel on a cluster or topic page.

Also adds a weight field to setup.md and the new page so the left nav reads Setup → Monitors and Automation, and adds a pointer to the new page from the Kafka section's _index.md.

Merge instructions

Merge readiness:

  • Ready for merge

AI assistance

Drafted with Claude Code. The monitor templates section was rewritten against the source of truth in web-ui/packages/apps/data-streams/private/runtime/kafka/components/KafkaMonitorsSidePanel/kafka-monitors-side-panel.utils.ts to make sure metric names, scopes, and the 80% retention threshold match what's actually shipped. Webhook examples were authored with Shelly's input on which "when" conditions to cover.

Additional notes

  • The Consumer lag is approaching bytes retention limit template is conditional in-product (only shown when hasKafkaBrokerMetrics is true). The doc notes this requirement inline rather than omitting the template.
  • No screenshots in this first pass — happy to add a screenshot of the Monitor Templates side panel in a follow-up if reviewers want one.

Documents the monitor templates Data Streams Monitoring ships at cluster
and topic level, and gives concrete examples of automating responses with
webhooks (consumer lag, data-loss risk, broker disk, partition health).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shellymat-dd shellymat-dd requested a review from a team as a code owner May 13, 2026 19:45
@maycmlee maycmlee added the editorial review Waiting on a more in-depth review label May 13, 2026
@maycmlee
Copy link
Copy Markdown
Contributor

Created https://datadoghq.atlassian.net/browse/DOCS-14396 for docs review.

@shellymat-dd shellymat-dd marked this pull request as draft May 13, 2026 21:05
shellymat-dd and others added 2 commits May 13, 2026 17:32
The automation section previously framed webhooks as the only path. Workflow
Automation is the better fit for the "trigger a runbook" patterns described
below (Kubernetes, AWS, Slack, Jira actions). Reframe the section to present
both options, and flag the broker-disk example as a non-DSM monitor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add page to left nav under Kafka via main.en.yaml (below Setup).
- Convert intro to a bulleted overview linking to in-page sections.
- Reorder topic-level monitor templates table (consumer lag first)
  and drop default thresholds that may change.
- Reformat per-scenario automation guidance: detection sentence
  followed by a single "Potential action" line.
- Cover Datadog Workflow Automation alongside webhooks.
- Replace the prose call-out on the Kafka index with a capability
  bullet that matches the existing list format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the Architecture Everything related to the Doc backend label May 14, 2026
@shellymat-dd shellymat-dd marked this pull request as ready for review May 14, 2026 14:09
Copy link
Copy Markdown
Contributor

@piochelepiotr piochelepiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I see we are pretty in depth on technologies not owned by the DSM team (like webhooks, workflows). The downside is that these docs will run out of date if the webhook / workflow docs get updated. Also, my guess is that most users only want to get paged when an issue happens, and not take automatic actions.

Comment thread content/en/data_streams/kafka/monitors_and_automation.md Outdated
Comment thread content/en/data_streams/kafka/monitors_and_automation.md Outdated
- Link "Kafka broker metrics" to the Kafka integration page.
- "Compact a candidate topic" -> "reduce retention on a candidate topic"
  for the disk-filling-up remediation, since reducing retention is the
  more immediate capacity recovery action.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@johannbotha johannbotha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we're suggesting what monitors to have at each level. As a user I'd still be confused to understand what metrics I get with KA and what is free with DD. I suggest adding a metrics section under kafka that lets you know all the metrics you get in addition to the free metrics.

Then we can easily link to those in our recommended monitors section.

Comment thread content/en/data_streams/kafka/monitors_and_automation.md Outdated

| Template | Metric | Condition |
|------------------------------------------------------------|-------------------------------------------------------------------------|-----------|
| Consumer lag is high for topic | `kafka.estimated_consumer_lag` | Consumer lag exceeds a threshold for a topic and consumer group. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important part of this metric is it's estimating in seconds and not offsets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a "description" column to explain the impact so we can highlight it's in seconds. Want to take a look at all descriptions? Preview is still WIP but you should be able to see the code change easily

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes it clearer that this one is in seconds.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you also read the other descriptions? Would love your review if anything is not clear

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Architecture Everything related to the Doc backend editorial review Waiting on a more in-depth review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants