Replies: 2 comments 3 replies
-
|
this is a great idea! Some other interesting thoughts popped in my mind about this:
|
Beta Was this translation helpful? Give feedback.
-
|
Posting an update following a discussion between @platinummonkey and the teams involved here at Datadog: We've agreed that we're not planning to implement/ship That said, we do intend to carry forward elements of this proposal. You should expect parts of the approach to be integrated into other Datadog features and/or APIs where they can be maintained and supported responsibly. We'll use this discussion as a place to collect feedback and requirements so they're easy to track and route to the right owners. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
RFC:
pup vet— Observability health checksMoved from #129
The Problem
Observability configurations degrade silently. Monitors lose their notification channels. Traces break across service boundaries. Logs stop correlating with APM. These aren't matters of preference or org-specific style — they're genuinely broken configurations that are invisible until something goes wrong.
No one finds these without a dedicated audit, and no one does a dedicated audit until it's too late.
The Idea
pup vet— a health check for your Datadog setup. Surfaces things that are universally broken or misconfigured, without assuming how your org has chosen to structure Datadog.The name fits: take your pup to the vet for a checkup.
Design Principles
Universally useful, not opinionated. Every check should surface something that's broken regardless of how an org uses Datadog. Silent monitors are broken everywhere. Broken trace correlation is broken everywhere. We don't tell you how to tag or organize — we find things that aren't working as intended.
Agent-first, human-readable. In agent mode, output is structured JSON with severity, affected resources, and actionable recommendations. In human mode, a clear terminal summary. An AI agent can run
pup vetas a starting point for any observability task.Scoped or full.
pup vetaudits everything.pup vet --service=web-apiorpup vet --tags=team:platformnarrows the scope.Initial Checks
Starting with things that are universally broken — no assumptions about org structure:
silent-monitorsstale-monitorsuntagged-monitorsmuted-forgottenno-recovery-thresholdNatural expansions as we learn what's valuable: broken trace/log correlation, USM gaps, SLOs without error budget alerts, services without any monitoring.
Where This Goes
pup vetis the foundation. Once you can audit what exists and what's broken, the next natural question during an incident becomes: "what's actually happening right now across all my data?" That's a cross-domain investigation capability that builds on top of vet — but it's a separate proposal, and one that needs more careful design to avoid being org-specific.Design Detail
A more detailed design doc (shared types, output schemas, phasing) is available at
docs/plans/2026-02-27-vet-investigate-design.md.Open Questions
pup vetexisted today?This design was developed collaboratively with Claude Code.
Beta Was this translation helpful? Give feedback.
All reactions