Skip to content

Commit 4c2b7a5

Browse files
MatteoMoriclaude
andauthored
feat(metrics): implement Prometheus observability (#45)
* feat(metrics): implement Prometheus observability with dedicated server Replace generateRuntimeMetrics() with prometheus/client_golang and add flexible metrics server architecture supporting same-port or dedicated port deployment. Changes: - Add internal/metrics package with custom Prometheus registry - Configurable metrics port via --metrics-port flag (default: 8084) - Two-server architecture with proper WaitGroup coordination - Graceful shutdown for both main and metrics servers - Export kagent_tools_mcp_server_info (version metadata) - Export kagent_tools_mcp_registered_tools (tool providers) - Include Go runtime metrics (goroutines, memory, GC stats) - Include process metrics (CPU, memory, file descriptors) Architecture improvement: Move http.Server instantiation outside goroutines to prevent race condition between assignment and shutdown. Test coverage: 5 unit tests validating registry, collectors, and metrics. Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com> * feat(metrics): auto-register tool metrics using ListTools() diff Use MCPServer.ListTools() to automatically detect which tools each provider registers, eliminating the need to modify individual tool packages. The approach snapshots the tool list before and after each provider's RegisterTools() call, then records the newly added tools in Prometheus with the correct tool_provider label. This means: - Zero changes required in any pkg/ file - Future tools are automatically tracked - No risk of forgetting to add a metric for a new tool Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com> * feat(metrics): instrument tool handlers with invocation counters Add kagent_tools_mcp_invocations_total and kagent_tools_mcp_invocations_failure_total counters using the wrapper/middleware pattern. All handlers are centrally instrumented in wrapToolHandlersWithMetrics with zero changes to pkg/ files. Update README with Observability section and CLI flags reference. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com> * feat(observability): add Helm chart support and Grafana dashboard Add comprehensive Prometheus Operator integration via Helm chart: - ServiceMonitor resource for automatic target discovery - Dedicated metrics service (kagent-tools-metrics) - Deployment args for --metrics-port configuration - Configurable scrape interval, timeout, and labels Include Grafana dashboard with 8 panels visualizing: - Server version and health metrics - Tool invocation rates by provider - Success/failure rates and trends - Top invoked tools table with heat mapping Add CLAUDE.md with architecture documentation covering: - Tool provider pattern and MCP server lifecycle - Observability architecture (metrics wrapper pattern) - Development commands and key implementation patterns - Helm chart structure and troubleshooting guide Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: MatteoMori <morimatteo14@gmail.com> * fix(metrics): default metrics-port to 0 (same as --port) Previously --metrics-port defaulted to 8084, causing a mismatch when the server ran on any other port (e.g. E2E tests use port 18190). The metrics server would start on 8084 instead of sharing the main port, so /metrics was unreachable at the expected address. Change the default to 0, resolved at runtime as "same as --port". Update Helm templates to fall back to the main targetPort when tools.metrics.port is unset. Signed-off-by: MatteoMori <morimatteo14@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * fix(metrics): count result.IsError as invocation failure The failure counter previously only incremented on non-nil Go errors. Handlers in this codebase signal tool-level failures by returning NewToolResultError(...), nil — result.IsError=true, err=nil — a pattern used 214 times across pkg/. This meant the failure metric was always 0 for tool-level errors. Fix the wrapper condition to check both: err != nil || (result != nil && result.IsError) Add three tests in cmd/metrics_wrap_test.go: - IsError=true increments failure counter (regression test) - Successful call does not increment failure counter - Real Go error increments failure counter Remove CLAUDE.md from the repository. Signed-off-by: MatteoMori <morimatteo14@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> --------- Signed-off-by: MatteoMori <morimatteo14@gmail.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent eaaefde commit 4c2b7a5

12 files changed

Lines changed: 1516 additions & 60 deletions

File tree

README.md

Lines changed: 34 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -188,9 +188,21 @@ go build -o kagent-tools .
188188

189189
The server runs using sse transport for MCP communication.
190190

191+
#### CLI Flags
192+
193+
| Flag | Default | Description |
194+
|------|---------|-------------|
195+
| `--port`, `-p` | `8084` | Port to run the MCP server on |
196+
| `--metrics-port` | `8084` | Port to run the Prometheus metrics server on |
197+
| `--stdio` | `false` | Use stdio for communication instead of HTTP |
198+
| `--tools` | `[]` (all) | Comma-separated list of tool providers to register |
199+
| `--read-only` | `false` | Disable tools that perform write operations |
200+
| `--kubeconfig` | `""` | Path to kubeconfig file (defaults to in-cluster config) |
201+
| `--version`, `-v` | `false` | Show version information and exit |
202+
191203
### Testing
192204
```bash
193-
go test -v
205+
go test -v ./...
194206
```
195207

196208
## Tool Implementation Details
@@ -243,6 +255,25 @@ Tools can be configured through environment variables:
243255
- `GRAFANA_URL`: Default Grafana server URL
244256
- `GRAFANA_API_KEY`: Default Grafana API key
245257

258+
## Observability
259+
260+
The MCP server exposes Prometheus metrics on a configurable HTTP endpoint (`/metrics`). By default, the metrics endpoint runs on the same port as the MCP server. To run it on a separate port:
261+
262+
```bash
263+
./kagent-tools --port 8084 --metrics-port 9090
264+
```
265+
266+
### Exposed Metrics
267+
268+
| Metric | Type | Labels | Description |
269+
|--------|------|--------|-------------|
270+
| `kagent_tools_mcp_server_info` | Gauge | `server_name`, `version`, `git_commit`, `build_date`, `server_mode` | Server metadata (always set to 1) |
271+
| `kagent_tools_mcp_registered_tools` | Gauge | `tool_name`, `tool_provider` | Set to 1 for each registered tool |
272+
| `kagent_tools_mcp_invocations_total` | Counter | `tool_name`, `tool_provider` | Total number of tool invocations |
273+
| `kagent_tools_mcp_invocations_failure_total` | Counter | `tool_name`, `tool_provider` | Total number of failed tool invocations |
274+
275+
Standard Go runtime and process metrics are also included (goroutines, memory, CPU, file descriptors, etc.).
276+
246277
## Error Handling and Debugging
247278

248279
The tools provide detailed error messages and support verbose output. When debugging issues:
@@ -258,9 +289,8 @@ Potential areas for future improvement:
258289
1. **Native Client Libraries**: Replace CLI calls with native Go client libraries where possible
259290
2. **Advanced Documentation Search**: Implement full vector search for documentation queries
260291
3. **Caching**: Add caching for frequently accessed data
261-
4. **Metrics and Observability**: Add metrics and tracing for tool usage
262-
5. **Configuration Management**: Enhanced configuration management and validation
263-
6. **Parallel Execution**: Support for parallel execution of related operations
292+
4. **Configuration Management**: Enhanced configuration management and validation
293+
5. **Parallel Execution**: Support for parallel execution of related operations
264294

265295
## Contributing
266296

cmd/main.go

Lines changed: 140 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,15 @@ import (
88
"os"
99
"os/signal"
1010
"runtime"
11+
"strconv"
1112
"strings"
1213
"sync"
1314
"syscall"
1415
"time"
1516

1617
"github.com/joho/godotenv"
1718
"github.com/kagent-dev/tools/internal/logger"
19+
"github.com/kagent-dev/tools/internal/metrics"
1820
"github.com/kagent-dev/tools/internal/telemetry"
1921
"github.com/kagent-dev/tools/internal/version"
2022
"github.com/kagent-dev/tools/pkg/argo"
@@ -25,16 +27,19 @@ import (
2527
"github.com/kagent-dev/tools/pkg/kubescape"
2628
"github.com/kagent-dev/tools/pkg/prometheus"
2729
"github.com/kagent-dev/tools/pkg/utils"
30+
"github.com/prometheus/client_golang/prometheus/promhttp"
2831
"github.com/spf13/cobra"
2932
"go.opentelemetry.io/otel"
3033
"go.opentelemetry.io/otel/attribute"
3134
"go.opentelemetry.io/otel/codes"
3235

36+
"github.com/mark3labs/mcp-go/mcp"
3337
"github.com/mark3labs/mcp-go/server"
3438
)
3539

3640
var (
3741
port int
42+
metricsPort int
3843
stdio bool
3944
tools []string
4045
kubeconfig *string
@@ -56,6 +61,7 @@ var rootCmd = &cobra.Command{
5661

5762
func init() {
5863
rootCmd.Flags().IntVarP(&port, "port", "p", 8084, "Port to run the server on")
64+
rootCmd.Flags().IntVarP(&metricsPort, "metrics-port", "m", 0, "Port to run the metrics server on (default 0: same as --port)")
5965
rootCmd.Flags().BoolVar(&stdio, "stdio", false, "Use stdio for communication instead of HTTP")
6066
rootCmd.Flags().StringSliceVar(&tools, "tools", []string{}, "List of tools to register. If empty, all tools are registered.")
6167
rootCmd.Flags().BoolVarP(&showVersion, "version", "v", false, "Show version information and exit")
@@ -92,6 +98,11 @@ func run(cmd *cobra.Command, args []string) {
9298
return
9399
}
94100

101+
// 0 means "same as --port" - resolve it before any server logic uses it
102+
if metricsPort == 0 {
103+
metricsPort = port
104+
}
105+
95106
logger.Init(stdio)
96107
defer logger.Sync()
97108

@@ -134,8 +145,11 @@ func run(cmd *cobra.Command, args []string) {
134145
Version,
135146
)
136147

137-
// Register tools
138-
registerMCP(mcp, tools, *kubeconfig, readOnly)
148+
// Register tools and wrap handlers with metrics instrumentation.
149+
// registerMCP returns a map of tool_name -> tool_provider so that
150+
// wrapToolHandlersWithMetrics knows which provider each tool belongs to.
151+
toolProviders := registerMCP(mcp, tools, *kubeconfig, readOnly)
152+
wrapToolHandlersWithMetrics(mcp, toolProviders)
139153

140154
// Create wait group for server goroutines
141155
var wg sync.WaitGroup
@@ -146,6 +160,7 @@ func run(cmd *cobra.Command, args []string) {
146160

147161
// HTTP server reference (only used when not in stdio mode)
148162
var httpServer *http.Server
163+
var metricsServer *http.Server // Separate server for metrics if metricsPort is different from main port
149164

150165
// Start server based on chosen mode
151166
wg.Add(1)
@@ -170,17 +185,40 @@ func run(cmd *cobra.Command, args []string) {
170185
}
171186
})
172187

173-
// Add metrics endpoint (basic implementation for e2e tests)
174-
mux.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
175-
w.Header().Set("Content-Type", "text/plain")
176-
w.WriteHeader(http.StatusOK)
177-
178-
// Generate real runtime metrics instead of hardcoded values
179-
metrics := generateRuntimeMetrics()
180-
if err := writeResponse(w, []byte(metrics)); err != nil {
181-
logger.Get().Error("Failed to write metrics response", "error", err)
188+
// Add metrics endpoint
189+
registry := metrics.InitServer() // Initialize Prometheus metrics before starting the server
190+
191+
if metricsPort != port { // Only start a separate metrics server if the metrics port is different from the main server port
192+
// Create the metrics server outside the goroutine to avoid a race condition
193+
// between the goroutine assigning metricsServer and the shutdown handler reading it
194+
metricsMux := http.NewServeMux()
195+
metricsMux.Handle("/metrics", promhttp.HandlerFor(registry, promhttp.HandlerOpts{}))
196+
metricsServer = &http.Server{
197+
Addr: fmt.Sprintf(":%d", metricsPort),
198+
Handler: metricsMux,
182199
}
183-
})
200+
201+
wg.Add(1)
202+
go func() {
203+
defer wg.Done()
204+
logger.Get().Info("Starting Prometheus metrics endpoint on /metrics", "port", strconv.Itoa(metricsPort))
205+
if err := metricsServer.ListenAndServe(); err != nil {
206+
if !errors.Is(err, http.ErrServerClosed) {
207+
logger.Get().Error("Metrics endpoint failed", "error", err)
208+
} else {
209+
logger.Get().Info("Metrics server closed gracefully.")
210+
}
211+
}
212+
}()
213+
} else {
214+
logger.Get().Info("Starting Prometheus metrics endpoint on /metrics", "port", strconv.Itoa(port))
215+
mux.Handle("/metrics", promhttp.HandlerFor(registry, promhttp.HandlerOpts{}))
216+
}
217+
serverMode := "read-write"
218+
if readOnly {
219+
serverMode = "read-only"
220+
}
221+
metrics.KagentToolsMCPServerInfo.WithLabelValues(Name, Version, GitCommit, BuildDate, serverMode).Set(1)
184222

185223
// Handle all other routes with the MCP server wrapped in telemetry middleware
186224
mux.Handle("/", telemetry.HTTPMiddleware(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@@ -229,6 +267,19 @@ func run(cmd *cobra.Command, args []string) {
229267
rootSpan.AddEvent("server.shutdown.completed")
230268
}
231269
}
270+
271+
// Gracefully shutdown metrics server if running separately
272+
if !stdio && metricsServer != nil {
273+
shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 5*time.Second)
274+
defer shutdownCancel()
275+
276+
if err := metricsServer.Shutdown(shutdownCtx); err != nil {
277+
logger.Get().Error("Failed to shutdown metrics server gracefully", "error", err)
278+
rootSpan.RecordError(err)
279+
} else {
280+
logger.Get().Info("Metrics server shutdown completed")
281+
}
282+
}
232283
}()
233284

234285
// Wait for all server operations to complete
@@ -242,47 +293,6 @@ func writeResponse(w http.ResponseWriter, data []byte) error {
242293
return err
243294
}
244295

245-
// generateRuntimeMetrics generates real runtime metrics for the /metrics endpoint
246-
func generateRuntimeMetrics() string {
247-
var m runtime.MemStats
248-
runtime.ReadMemStats(&m)
249-
250-
now := time.Now().Unix()
251-
252-
// Build metrics in Prometheus format
253-
metrics := strings.Builder{}
254-
255-
// Go runtime info
256-
metrics.WriteString("# HELP go_info Information about the Go environment.\n")
257-
metrics.WriteString("# TYPE go_info gauge\n")
258-
metrics.WriteString(fmt.Sprintf("go_info{version=\"%s\"} 1\n", runtime.Version()))
259-
260-
// Process start time
261-
metrics.WriteString("# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.\n")
262-
metrics.WriteString("# TYPE process_start_time_seconds gauge\n")
263-
metrics.WriteString(fmt.Sprintf("process_start_time_seconds %d\n", now))
264-
265-
// Memory metrics
266-
metrics.WriteString("# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.\n")
267-
metrics.WriteString("# TYPE go_memstats_alloc_bytes gauge\n")
268-
metrics.WriteString(fmt.Sprintf("go_memstats_alloc_bytes %d\n", m.Alloc))
269-
270-
metrics.WriteString("# HELP go_memstats_total_alloc_bytes Total number of bytes allocated, even if freed.\n")
271-
metrics.WriteString("# TYPE go_memstats_total_alloc_bytes counter\n")
272-
metrics.WriteString(fmt.Sprintf("go_memstats_total_alloc_bytes %d\n", m.TotalAlloc))
273-
274-
metrics.WriteString("# HELP go_memstats_sys_bytes Number of bytes obtained from system.\n")
275-
metrics.WriteString("# TYPE go_memstats_sys_bytes gauge\n")
276-
metrics.WriteString(fmt.Sprintf("go_memstats_sys_bytes %d\n", m.Sys))
277-
278-
// Goroutine count
279-
metrics.WriteString("# HELP go_goroutines Number of goroutines that currently exist.\n")
280-
metrics.WriteString("# TYPE go_goroutines gauge\n")
281-
metrics.WriteString(fmt.Sprintf("go_goroutines %d\n", runtime.NumGoroutine()))
282-
283-
return metrics.String()
284-
}
285-
286296
func runStdioServer(ctx context.Context, mcp *server.MCPServer) {
287297
logger.Get().Info("Running KAgent Tools Server STDIO:", "tools", strings.Join(tools, ","))
288298
stdioServer := server.NewStdioServer(mcp)
@@ -291,7 +301,11 @@ func runStdioServer(ctx context.Context, mcp *server.MCPServer) {
291301
}
292302
}
293303

294-
func registerMCP(mcp *server.MCPServer, enabledToolProviders []string, kubeconfig string, readOnly bool) {
304+
// registerMCP registers tool providers with the MCP server and returns a mapping
305+
// of tool_name -> tool_provider. This mapping is built using the ListTools() diff
306+
// technique: we snapshot the tool list before and after each provider registers,
307+
// so we know exactly which tools belong to which provider.
308+
func registerMCP(mcp *server.MCPServer, enabledToolProviders []string, kubeconfig string, readOnly bool) map[string]string {
295309
// A map to hold tool providers and their registration functions
296310
toolProviderMap := map[string]func(*server.MCPServer){
297311
"argo": func(s *server.MCPServer) { argo.RegisterTools(s, readOnly) },
@@ -310,11 +324,83 @@ func registerMCP(mcp *server.MCPServer, enabledToolProviders []string, kubeconfi
310324
enabledToolProviders = append(enabledToolProviders, name)
311325
}
312326
}
327+
328+
// toolToProvider maps each tool name to its provider (e.g., "kubectl_get" -> "k8s").
329+
// This is used later by wrapToolHandlersWithMetrics to set the correct tool_provider label.
330+
toolToProvider := make(map[string]string)
331+
313332
for _, toolProviderName := range enabledToolProviders {
314333
if registerFunc, ok := toolProviderMap[toolProviderName]; ok {
334+
// Snapshot the tool list before this provider registers its tools.
335+
// We need this because ListTools() returns ALL tools from ALL providers,
336+
// so the only way to know which tools belong to THIS provider is to compare
337+
// the list before and after registration.
338+
toolsBefore := mcp.ListTools()
339+
315340
registerFunc(mcp)
341+
342+
// Determine which tools were just registered by this provider
343+
// by finding tools that exist now but didn't exist before.
344+
// Record each one in Prometheus so we can observe the full tool inventory.
345+
for toolName := range mcp.ListTools() {
346+
if _, existed := toolsBefore[toolName]; !existed {
347+
metrics.KagentToolsMCPRegisteredTools.WithLabelValues(toolName, toolProviderName).Set(1)
348+
toolToProvider[toolName] = toolProviderName
349+
}
350+
}
316351
} else {
317352
logger.Get().Error("Unknown tool specified", "provider", toolProviderName)
318353
}
319354
}
355+
356+
return toolToProvider
357+
}
358+
359+
// wrapToolHandlersWithMetrics applies the wrapper/middleware pattern to instrument
360+
// all registered MCP tool handlers with Prometheus invocation counters.
361+
//
362+
// How it works:
363+
// 1. Grab all registered tools from the MCP server using ListTools()
364+
// 2. For each tool, wrap its handler with a function that increments metrics
365+
// 3. Replace all tools in the MCP server using SetTools()
366+
//
367+
// The wrapper function:
368+
// - Increments kagent_tools_mcp_invocations_total on every call
369+
// - Increments kagent_tools_mcp_invocations_failure_total when the handler returns a
370+
// non-nil Go error OR when result.IsError is true (the MCP convention for tool-level
371+
// failures - handlers return NewToolResultError(...), nil, not a Go error)
372+
// - Calls the original handler unchanged - the tool's behaviour is not affected
373+
//
374+
// This uses the standard middleware/decorator pattern: the original handler and the
375+
// wrapped handler have the same function signature, so they are interchangeable.
376+
// No changes are required in any pkg/ file - all instrumentation happens centrally here.
377+
func wrapToolHandlersWithMetrics(mcpServer *server.MCPServer, toolToProvider map[string]string) {
378+
allTools := mcpServer.ListTools()
379+
wrapped := make([]server.ServerTool, 0, len(allTools))
380+
381+
for name, st := range allTools {
382+
originalHandler := st.Handler
383+
toolName := name // capture for closure
384+
provider := toolToProvider[toolName]
385+
386+
wrapped = append(wrapped, server.ServerTool{
387+
Tool: st.Tool,
388+
Handler: func(ctx context.Context, req mcp.CallToolRequest) (*mcp.CallToolResult, error) {
389+
metrics.KagentToolsMCPInvocationsTotal.WithLabelValues(toolName, provider).Inc()
390+
391+
result, err := originalHandler(ctx, req)
392+
393+
// Count as failure if the Go error is non-nil OR if the tool returned
394+
// a result with IsError=true (the MCP convention for tool-level failures,
395+
// which always return nil for the Go error).
396+
if err != nil || (result != nil && result.IsError) {
397+
metrics.KagentToolsMCPInvocationsFailureTotal.WithLabelValues(toolName, provider).Inc()
398+
}
399+
400+
return result, err
401+
},
402+
})
403+
}
404+
405+
mcpServer.SetTools(wrapped...)
320406
}

0 commit comments

Comments
 (0)