Skip to content

Latest commit

 

History

History
401 lines (301 loc) · 11 KB

File metadata and controls

401 lines (301 loc) · 11 KB

Metrics Integration Guide

This document describes the Prometheus metrics integration across all CryptoFunk components.

Overview

All components expose Prometheus metrics on dedicated HTTP endpoints for monitoring system health, performance, and trading activity.

Metrics Endpoints

Component Port Endpoint Status
Orchestrator 8081 /metrics ✅ Complete
API Server 8080 /metrics ✅ Complete
Market Data Server 9201 /metrics ✅ Complete
Technical Indicators Server 9202 /metrics 🔄 Pending
Risk Analyzer Server 9203 /metrics 🔄 Pending
Order Executor Server 9204 /metrics 🔄 Pending
Technical Agent 9101 /metrics ✅ Complete (via BaseAgent)
Orderbook Agent 9102 /metrics ✅ Complete (via BaseAgent)
Sentiment Agent 9103 /metrics ✅ Complete (via BaseAgent)
Trend Agent 9104 /metrics ✅ Complete (via BaseAgent)
Reversion Agent 9105 /metrics ✅ Complete (via BaseAgent)
Arbitrage Agent 9106 /metrics ✅ Complete (via BaseAgent)
Risk Agent 9107 /metrics ✅ Complete (via BaseAgent)

Metrics Categories

1. MCP Server Metrics

Request Metrics:

  • cryptofunk_mcp_requests_total - Total MCP requests by method and status
  • cryptofunk_mcp_request_duration_seconds - MCP request latency distribution

Tool Call Metrics:

  • cryptofunk_mcp_tool_calls_total - Total tool calls by tool name and status
  • cryptofunk_mcp_tool_call_duration_seconds - Tool call latency distribution

2. Agent Metrics

Activity Metrics:

  • cryptofunk_agent_signals_total - Signals generated by agent and type
  • cryptofunk_agent_analysis_duration_seconds - Analysis duration
  • cryptofunk_agent_confidence - Signal confidence distribution
  • cryptofunk_agent_healthy - Agent health status (1=healthy, 0=unhealthy)

LLM Metrics:

  • cryptofunk_agent_llm_calls_total - LLM calls by provider and status
  • cryptofunk_agent_llm_duration_seconds - LLM call latency
  • cryptofunk_agent_llm_tokens_total - Tokens used (prompt/completion)

3. Trading Metrics

Trade Execution:

  • cryptofunk_trades_total - Total trades by symbol, side, status, and mode
  • cryptofunk_trade_value_usd - Trade value distribution
  • cryptofunk_positions_open - Number of open positions
  • cryptofunk_portfolio_value_usd - Total portfolio value

Performance:

  • cryptofunk_total_pnl - Total profit/loss
  • cryptofunk_win_rate - Win rate ratio
  • cryptofunk_sharpe_ratio - Risk-adjusted return
  • cryptofunk_current_drawdown - Current drawdown percentage

4. Risk Metrics

Risk Management:

  • cryptofunk_risk_limit_breaches_total - Risk limit violations by type
  • cryptofunk_circuit_breaker_tripped - Circuit breaker status
  • cryptofunk_var_value_usd - Value at Risk

5. System Metrics

Component Health:

  • cryptofunk_component_uptime_seconds - Component uptime
  • cryptofunk_component_healthy - Component health status
  • cryptofunk_database_connections_active - Active database connections
  • cryptofunk_redis_connections_active - Active Redis connections

Messaging:

  • cryptofunk_nats_messages_published_total - NATS messages published
  • cryptofunk_nats_messages_received_total - NATS messages received

Adding Metrics to MCP Servers

All MCP servers should follow this pattern:

1. Import Metrics Package

import (
    // ... other imports
    "github.com/ajitpratap0/cryptofunk/internal/metrics"
)

2. Define Server Name Constant

const (
    serverName = "market-data"  // or "technical-indicators", "risk-analyzer", "order-executor"
)

3. Start Metrics Server in main()

func main() {
    // ... logger setup ...

    // Start metrics server on assigned port
    metricsServer := metrics.NewServer(9201, logger) // Use appropriate port
    if err := metricsServer.Start(); err != nil {
        logger.Fatal().Err(err).Msg("Failed to start metrics server")
    }
    logger.Info().Msg("Metrics server started on :9201")

    // ... rest of main() ...
}

4. Record Request Metrics in handleRequest()

func (s *MCPServer) handleRequest(req *MCPRequest) *MCPResponse {
    startTime := time.Now()

    response := &MCPResponse{
        JSONRPC: "2.0",
        ID:      req.ID,
    }

    defer func() {
        status := "success"
        if response.Error != nil {
            status = "error"
        }
        metrics.MCPRequestsTotal.WithLabelValues(serverName, req.Method, status).Inc()
        metrics.MCPRequestDuration.WithLabelValues(serverName, req.Method).Observe(time.Since(startTime).Seconds())
    }()

    // ... handle request ...
}

5. Record Tool Call Metrics in callTool()

func (s *MCPServer) callTool(name string, args map[string]interface{}) (interface{}, error) {
    startTime := time.Now()

    // ... execute tool ...

    var result interface{}
    var err error

    switch name {
    case "tool_name":
        result, err = s.service.handleTool(ctx, args)
    // ... other cases ...
    }

    // Record metrics
    status := "success"
    if err != nil {
        status = "error"
    }
    metrics.MCPToolCallsTotal.WithLabelValues(serverName, name, status).Inc()
    metrics.MCPToolCallDuration.WithLabelValues(serverName, name).Observe(time.Since(startTime).Seconds())

    return result, err
}

Implementation Status

Completed ✅

  1. Metrics Infrastructure (internal/metrics/)

    • Comprehensive metrics definitions
    • HTTP server for exposing metrics
    • Helper functions for recording metrics
  2. Prometheus Configuration (deployments/prometheus/prometheus.yml)

    • All scrape targets configured
    • Appropriate scrape intervals (10-15s)
    • Dedicated job for MCP servers and agents
  3. Agent Metrics (internal/agents/base.go)

    • All agents inherit metrics from BaseAgent
    • Automatic health tracking
    • MCP call instrumentation
  4. Orchestrator Metrics (internal/orchestrator/)

    • Voting and consensus metrics
    • Session management metrics
  5. Market Data Server (cmd/mcp-servers/market-data/main.go)

    • Metrics server on port 9201
    • Request and tool call instrumentation

Pending 🔄

The following MCP servers need metrics integration following the pattern above:

  1. Technical Indicators Server (Port 9202)

    • Location: cmd/mcp-servers/technical-indicators/main.go
    • Tools to instrument: RSI, MACD, Bollinger, EMA, ADX
  2. Risk Analyzer Server (Port 9203)

    • Location: cmd/mcp-servers/risk-analyzer/main.go
    • Tools to instrument: calculate_var, check_limits, kelly_criterion
  3. Order Executor Server (Port 9204)

    • Location: cmd/mcp-servers/order-executor/main.go
    • Tools to instrument: place_market_order, place_limit_order, cancel_order

Testing Metrics

Local Testing

# Start component with metrics
./bin/market-data-server &

# Check metrics endpoint
curl http://localhost:9201/metrics

# Check health endpoint
curl http://localhost:9201/health

Prometheus Integration

# Start Prometheus
docker-compose -f deployments/docker-compose.yml up prometheus

# Access Prometheus UI
open http://localhost:9090

# Query metrics
cryptofunk_mcp_requests_total{server="market-data"}
rate(cryptofunk_mcp_tool_calls_total[5m])

Grafana Dashboards

After deploying Prometheus, create Grafana dashboards for:

  • MCP server performance
  • Agent activity and health
  • Trading performance
  • Risk metrics
  • System health

See docs/GRAFANA_DASHBOARDS.md for dashboard JSON templates (to be created in T276).

Prometheus Queries Examples

MCP Server Performance

# Request rate by server
rate(cryptofunk_mcp_requests_total[5m])

# P95 request latency
histogram_quantile(0.95, rate(cryptofunk_mcp_request_duration_seconds_bucket[5m]))

# Error rate by server
rate(cryptofunk_mcp_requests_total{status="error"}[5m]) /
rate(cryptofunk_mcp_requests_total[5m])

# Tool call latency by tool
histogram_quantile(0.99, rate(cryptofunk_mcp_tool_call_duration_seconds_bucket[5m]))

Agent Metrics

# Signals generated per minute
rate(cryptofunk_agent_signals_total[1m]) * 60

# Agent health status
cryptofunk_agent_healthy

# LLM call latency
histogram_quantile(0.95, rate(cryptofunk_agent_llm_duration_seconds_bucket[5m]))

# LLM tokens per hour
rate(cryptofunk_agent_llm_tokens_total[1h]) * 3600

Trading Performance

# Win rate
cryptofunk_win_rate

# Current drawdown
cryptofunk_current_drawdown

# Trades per hour
rate(cryptofunk_total_trades[1h]) * 3600

# Portfolio value
cryptofunk_portfolio_value_usd

System Health

# Component uptime
cryptofunk_component_uptime_seconds / 3600  # hours

# Database connection pool usage
cryptofunk_database_connections_active / cryptofunk_database_connections_max

# NATS message rate
rate(cryptofunk_nats_messages_published_total[1m])

Alerting Rules

Recommended Prometheus alerting rules:

groups:
  - name: cryptofunk_alerts
    rules:
      - alert: MCPServerDown
        expr: up{job="mcp-servers"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "MCP Server {{ $labels.instance }} is down"

      - alert: HighMCPErrorRate
        expr: rate(cryptofunk_mcp_requests_total{status="error"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.server }}"

      - alert: AgentUnhealthy
        expr: cryptofunk_agent_healthy == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Agent {{ $labels.agent }} is unhealthy"

      - alert: HighDrawdown
        expr: cryptofunk_current_drawdown > 0.15
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Current drawdown exceeds 15%"

      - alert: CircuitBreakerTripped
        expr: cryptofunk_circuit_breaker_tripped == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker tripped: {{ $labels.reason }}"

Next Steps

  1. Complete MCP Server Instrumentation (T277 - in progress)

    • Add metrics to technical-indicators server
    • Add metrics to risk-analyzer server
    • Add metrics to order-executor server
  2. Create Grafana Dashboards (T276 - pending)

    • System overview dashboard
    • Trading performance dashboard
    • Agent performance dashboard
    • Risk metrics dashboard
  3. Add AlertManager Integration (T278 - pending)

    • Configure alert routing
    • Set up notification channels (Slack, email, PagerDuty)
    • Define alert escalation policies
  4. Production Dry-Run (T286 - pending)

    • Deploy full stack with metrics
    • Verify all metrics are being collected
    • Test alerting rules
    • Validate dashboard accuracy

References