Metrics Integration Guide

This document describes the Prometheus metrics integration across all CryptoFunk components.

Overview

All components expose Prometheus metrics on dedicated HTTP endpoints for monitoring system health, performance, and trading activity.

Metrics Endpoints

Component	Port	Endpoint	Status
Orchestrator	8081	/metrics	✅ Complete
API Server	8080	/metrics	✅ Complete
Market Data Server	9201	/metrics	✅ Complete
Technical Indicators Server	9202	/metrics	🔄 Pending
Risk Analyzer Server	9203	/metrics	🔄 Pending
Order Executor Server	9204	/metrics	🔄 Pending
Technical Agent	9101	/metrics	✅ Complete (via BaseAgent)
Orderbook Agent	9102	/metrics	✅ Complete (via BaseAgent)
Sentiment Agent	9103	/metrics	✅ Complete (via BaseAgent)
Trend Agent	9104	/metrics	✅ Complete (via BaseAgent)
Reversion Agent	9105	/metrics	✅ Complete (via BaseAgent)
Arbitrage Agent	9106	/metrics	✅ Complete (via BaseAgent)
Risk Agent	9107	/metrics	✅ Complete (via BaseAgent)

Metrics Categories

1. MCP Server Metrics

Request Metrics:

cryptofunk_mcp_requests_total - Total MCP requests by method and status
cryptofunk_mcp_request_duration_seconds - MCP request latency distribution

Tool Call Metrics:

cryptofunk_mcp_tool_calls_total - Total tool calls by tool name and status
cryptofunk_mcp_tool_call_duration_seconds - Tool call latency distribution

2. Agent Metrics

Activity Metrics:

cryptofunk_agent_signals_total - Signals generated by agent and type
cryptofunk_agent_analysis_duration_seconds - Analysis duration
cryptofunk_agent_confidence - Signal confidence distribution
cryptofunk_agent_healthy - Agent health status (1=healthy, 0=unhealthy)

LLM Metrics:

cryptofunk_agent_llm_calls_total - LLM calls by provider and status
cryptofunk_agent_llm_duration_seconds - LLM call latency
cryptofunk_agent_llm_tokens_total - Tokens used (prompt/completion)

3. Trading Metrics

Trade Execution:

cryptofunk_trades_total - Total trades by symbol, side, status, and mode
cryptofunk_trade_value_usd - Trade value distribution
cryptofunk_positions_open - Number of open positions
cryptofunk_portfolio_value_usd - Total portfolio value

Performance:

cryptofunk_total_pnl - Total profit/loss
cryptofunk_win_rate - Win rate ratio
cryptofunk_sharpe_ratio - Risk-adjusted return
cryptofunk_current_drawdown - Current drawdown percentage

4. Risk Metrics

Risk Management:

cryptofunk_risk_limit_breaches_total - Risk limit violations by type
cryptofunk_circuit_breaker_tripped - Circuit breaker status
cryptofunk_var_value_usd - Value at Risk

5. System Metrics

Component Health:

cryptofunk_component_uptime_seconds - Component uptime
cryptofunk_component_healthy - Component health status
cryptofunk_database_connections_active - Active database connections
cryptofunk_redis_connections_active - Active Redis connections

Messaging:

cryptofunk_nats_messages_published_total - NATS messages published
cryptofunk_nats_messages_received_total - NATS messages received

Adding Metrics to MCP Servers

All MCP servers should follow this pattern:

1. Import Metrics Package

import (
    // ... other imports
    "github.com/ajitpratap0/cryptofunk/internal/metrics"
)

2. Define Server Name Constant

const (
    serverName = "market-data"  // or "technical-indicators", "risk-analyzer", "order-executor"
)

3. Start Metrics Server in main()

func main() {
    // ... logger setup ...

    // Start metrics server on assigned port
    metricsServer := metrics.NewServer(9201, logger) // Use appropriate port
    if err := metricsServer.Start(); err != nil {
        logger.Fatal().Err(err).Msg("Failed to start metrics server")
    }
    logger.Info().Msg("Metrics server started on :9201")

    // ... rest of main() ...
}

4. Record Request Metrics in handleRequest()

func (s *MCPServer) handleRequest(req *MCPRequest) *MCPResponse {
    startTime := time.Now()

    response := &MCPResponse{
        JSONRPC: "2.0",
        ID:      req.ID,
    }

    defer func() {
        status := "success"
        if response.Error != nil {
            status = "error"
        }
        metrics.MCPRequestsTotal.WithLabelValues(serverName, req.Method, status).Inc()
        metrics.MCPRequestDuration.WithLabelValues(serverName, req.Method).Observe(time.Since(startTime).Seconds())
    }()

    // ... handle request ...
}

5. Record Tool Call Metrics in callTool()

func (s *MCPServer) callTool(name string, args map[string]interface{}) (interface{}, error) {
    startTime := time.Now()

    // ... execute tool ...

    var result interface{}
    var err error

    switch name {
    case "tool_name":
        result, err = s.service.handleTool(ctx, args)
    // ... other cases ...
    }

    // Record metrics
    status := "success"
    if err != nil {
        status = "error"
    }
    metrics.MCPToolCallsTotal.WithLabelValues(serverName, name, status).Inc()
    metrics.MCPToolCallDuration.WithLabelValues(serverName, name).Observe(time.Since(startTime).Seconds())

    return result, err
}

Implementation Status

Completed ✅

Metrics Infrastructure (internal/metrics/)
- Comprehensive metrics definitions
- HTTP server for exposing metrics
- Helper functions for recording metrics
Prometheus Configuration (deployments/prometheus/prometheus.yml)
- All scrape targets configured
- Appropriate scrape intervals (10-15s)
- Dedicated job for MCP servers and agents
Agent Metrics (internal/agents/base.go)
- All agents inherit metrics from BaseAgent
- Automatic health tracking
- MCP call instrumentation
Orchestrator Metrics (internal/orchestrator/)
- Voting and consensus metrics
- Session management metrics
Market Data Server (cmd/mcp-servers/market-data/main.go)
- Metrics server on port 9201
- Request and tool call instrumentation

Pending 🔄

The following MCP servers need metrics integration following the pattern above:

Technical Indicators Server (Port 9202)
- Location: cmd/mcp-servers/technical-indicators/main.go
- Tools to instrument: RSI, MACD, Bollinger, EMA, ADX
Risk Analyzer Server (Port 9203)
- Location: cmd/mcp-servers/risk-analyzer/main.go
- Tools to instrument: calculate_var, check_limits, kelly_criterion
Order Executor Server (Port 9204)
- Location: cmd/mcp-servers/order-executor/main.go
- Tools to instrument: place_market_order, place_limit_order, cancel_order

Testing Metrics

Local Testing

# Start component with metrics
./bin/market-data-server &

# Check metrics endpoint
curl http://localhost:9201/metrics

# Check health endpoint
curl http://localhost:9201/health

Prometheus Integration

# Start Prometheus
docker-compose -f deployments/docker-compose.yml up prometheus

# Access Prometheus UI
open http://localhost:9090

# Query metrics
cryptofunk_mcp_requests_total{server="market-data"}
rate(cryptofunk_mcp_tool_calls_total[5m])

Grafana Dashboards

After deploying Prometheus, create Grafana dashboards for:

MCP server performance
Agent activity and health
Trading performance
Risk metrics
System health

See docs/GRAFANA_DASHBOARDS.md for dashboard JSON templates (to be created in T276).

Prometheus Queries Examples

MCP Server Performance

# Request rate by server
rate(cryptofunk_mcp_requests_total[5m])

# P95 request latency
histogram_quantile(0.95, rate(cryptofunk_mcp_request_duration_seconds_bucket[5m]))

# Error rate by server
rate(cryptofunk_mcp_requests_total{status="error"}[5m]) /
rate(cryptofunk_mcp_requests_total[5m])

# Tool call latency by tool
histogram_quantile(0.99, rate(cryptofunk_mcp_tool_call_duration_seconds_bucket[5m]))

Agent Metrics

# Signals generated per minute
rate(cryptofunk_agent_signals_total[1m]) * 60

# Agent health status
cryptofunk_agent_healthy

# LLM call latency
histogram_quantile(0.95, rate(cryptofunk_agent_llm_duration_seconds_bucket[5m]))

# LLM tokens per hour
rate(cryptofunk_agent_llm_tokens_total[1h]) * 3600

Trading Performance

# Win rate
cryptofunk_win_rate

# Current drawdown
cryptofunk_current_drawdown

# Trades per hour
rate(cryptofunk_total_trades[1h]) * 3600

# Portfolio value
cryptofunk_portfolio_value_usd

System Health

# Component uptime
cryptofunk_component_uptime_seconds / 3600  # hours

# Database connection pool usage
cryptofunk_database_connections_active / cryptofunk_database_connections_max

# NATS message rate
rate(cryptofunk_nats_messages_published_total[1m])

Alerting Rules

Recommended Prometheus alerting rules:

groups:
  - name: cryptofunk_alerts
    rules:
      - alert: MCPServerDown
        expr: up{job="mcp-servers"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "MCP Server {{ $labels.instance }} is down"

      - alert: HighMCPErrorRate
        expr: rate(cryptofunk_mcp_requests_total{status="error"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.server }}"

      - alert: AgentUnhealthy
        expr: cryptofunk_agent_healthy == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Agent {{ $labels.agent }} is unhealthy"

      - alert: HighDrawdown
        expr: cryptofunk_current_drawdown > 0.15
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Current drawdown exceeds 15%"

      - alert: CircuitBreakerTripped
        expr: cryptofunk_circuit_breaker_tripped == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker tripped: {{ $labels.reason }}"

Next Steps

Complete MCP Server Instrumentation (T277 - in progress)
- Add metrics to technical-indicators server
- Add metrics to risk-analyzer server
- Add metrics to order-executor server
Create Grafana Dashboards (T276 - pending)
- System overview dashboard
- Trading performance dashboard
- Agent performance dashboard
- Risk metrics dashboard
Add AlertManager Integration (T278 - pending)
- Configure alert routing
- Set up notification channels (Slack, email, PagerDuty)
- Define alert escalation policies
Production Dry-Run (T286 - pending)
- Deploy full stack with metrics
- Verify all metrics are being collected
- Test alerting rules
- Validate dashboard accuracy

References

Prometheus Documentation: https://prometheus.io/docs/
Prometheus Client Golang: https://github.com/prometheus/client_golang
Grafana Dashboards: https://grafana.com/docs/grafana/latest/dashboards/
AlertManager: https://prometheus.io/docs/alerting/latest/alertmanager/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics Integration Guide

Overview

Metrics Endpoints

Metrics Categories

1. MCP Server Metrics

2. Agent Metrics

3. Trading Metrics

4. Risk Metrics

5. System Metrics

Adding Metrics to MCP Servers

1. Import Metrics Package

2. Define Server Name Constant

3. Start Metrics Server in main()

4. Record Request Metrics in handleRequest()

5. Record Tool Call Metrics in callTool()

Implementation Status

Completed ✅

Pending 🔄

Testing Metrics

Local Testing

Prometheus Integration

Grafana Dashboards

Prometheus Queries Examples

MCP Server Performance

Agent Metrics

Trading Performance

System Health

Alerting Rules

Next Steps

References

FilesExpand file tree

METRICS_INTEGRATION.md

Latest commit

History

METRICS_INTEGRATION.md

File metadata and controls

Metrics Integration Guide

Overview

Metrics Endpoints

Metrics Categories

1. MCP Server Metrics

2. Agent Metrics

3. Trading Metrics

4. Risk Metrics

5. System Metrics

Adding Metrics to MCP Servers

1. Import Metrics Package

2. Define Server Name Constant

3. Start Metrics Server in main()

4. Record Request Metrics in handleRequest()

5. Record Tool Call Metrics in callTool()

Implementation Status

Completed ✅

Pending 🔄

Testing Metrics

Local Testing

Prometheus Integration

Grafana Dashboards

Prometheus Queries Examples

MCP Server Performance

Agent Metrics

Trading Performance

System Health

Alerting Rules

Next Steps

References