This document describes the Prometheus metrics integration across all CryptoFunk components.
All components expose Prometheus metrics on dedicated HTTP endpoints for monitoring system health, performance, and trading activity.
| Component | Port | Endpoint | Status |
|---|---|---|---|
| Orchestrator | 8081 | /metrics | ✅ Complete |
| API Server | 8080 | /metrics | ✅ Complete |
| Market Data Server | 9201 | /metrics | ✅ Complete |
| Technical Indicators Server | 9202 | /metrics | 🔄 Pending |
| Risk Analyzer Server | 9203 | /metrics | 🔄 Pending |
| Order Executor Server | 9204 | /metrics | 🔄 Pending |
| Technical Agent | 9101 | /metrics | ✅ Complete (via BaseAgent) |
| Orderbook Agent | 9102 | /metrics | ✅ Complete (via BaseAgent) |
| Sentiment Agent | 9103 | /metrics | ✅ Complete (via BaseAgent) |
| Trend Agent | 9104 | /metrics | ✅ Complete (via BaseAgent) |
| Reversion Agent | 9105 | /metrics | ✅ Complete (via BaseAgent) |
| Arbitrage Agent | 9106 | /metrics | ✅ Complete (via BaseAgent) |
| Risk Agent | 9107 | /metrics | ✅ Complete (via BaseAgent) |
Request Metrics:
cryptofunk_mcp_requests_total- Total MCP requests by method and statuscryptofunk_mcp_request_duration_seconds- MCP request latency distribution
Tool Call Metrics:
cryptofunk_mcp_tool_calls_total- Total tool calls by tool name and statuscryptofunk_mcp_tool_call_duration_seconds- Tool call latency distribution
Activity Metrics:
cryptofunk_agent_signals_total- Signals generated by agent and typecryptofunk_agent_analysis_duration_seconds- Analysis durationcryptofunk_agent_confidence- Signal confidence distributioncryptofunk_agent_healthy- Agent health status (1=healthy, 0=unhealthy)
LLM Metrics:
cryptofunk_agent_llm_calls_total- LLM calls by provider and statuscryptofunk_agent_llm_duration_seconds- LLM call latencycryptofunk_agent_llm_tokens_total- Tokens used (prompt/completion)
Trade Execution:
cryptofunk_trades_total- Total trades by symbol, side, status, and modecryptofunk_trade_value_usd- Trade value distributioncryptofunk_positions_open- Number of open positionscryptofunk_portfolio_value_usd- Total portfolio value
Performance:
cryptofunk_total_pnl- Total profit/losscryptofunk_win_rate- Win rate ratiocryptofunk_sharpe_ratio- Risk-adjusted returncryptofunk_current_drawdown- Current drawdown percentage
Risk Management:
cryptofunk_risk_limit_breaches_total- Risk limit violations by typecryptofunk_circuit_breaker_tripped- Circuit breaker statuscryptofunk_var_value_usd- Value at Risk
Component Health:
cryptofunk_component_uptime_seconds- Component uptimecryptofunk_component_healthy- Component health statuscryptofunk_database_connections_active- Active database connectionscryptofunk_redis_connections_active- Active Redis connections
Messaging:
cryptofunk_nats_messages_published_total- NATS messages publishedcryptofunk_nats_messages_received_total- NATS messages received
All MCP servers should follow this pattern:
import (
// ... other imports
"github.com/ajitpratap0/cryptofunk/internal/metrics"
)const (
serverName = "market-data" // or "technical-indicators", "risk-analyzer", "order-executor"
)func main() {
// ... logger setup ...
// Start metrics server on assigned port
metricsServer := metrics.NewServer(9201, logger) // Use appropriate port
if err := metricsServer.Start(); err != nil {
logger.Fatal().Err(err).Msg("Failed to start metrics server")
}
logger.Info().Msg("Metrics server started on :9201")
// ... rest of main() ...
}func (s *MCPServer) handleRequest(req *MCPRequest) *MCPResponse {
startTime := time.Now()
response := &MCPResponse{
JSONRPC: "2.0",
ID: req.ID,
}
defer func() {
status := "success"
if response.Error != nil {
status = "error"
}
metrics.MCPRequestsTotal.WithLabelValues(serverName, req.Method, status).Inc()
metrics.MCPRequestDuration.WithLabelValues(serverName, req.Method).Observe(time.Since(startTime).Seconds())
}()
// ... handle request ...
}func (s *MCPServer) callTool(name string, args map[string]interface{}) (interface{}, error) {
startTime := time.Now()
// ... execute tool ...
var result interface{}
var err error
switch name {
case "tool_name":
result, err = s.service.handleTool(ctx, args)
// ... other cases ...
}
// Record metrics
status := "success"
if err != nil {
status = "error"
}
metrics.MCPToolCallsTotal.WithLabelValues(serverName, name, status).Inc()
metrics.MCPToolCallDuration.WithLabelValues(serverName, name).Observe(time.Since(startTime).Seconds())
return result, err
}-
Metrics Infrastructure (
internal/metrics/)- Comprehensive metrics definitions
- HTTP server for exposing metrics
- Helper functions for recording metrics
-
Prometheus Configuration (
deployments/prometheus/prometheus.yml)- All scrape targets configured
- Appropriate scrape intervals (10-15s)
- Dedicated job for MCP servers and agents
-
Agent Metrics (
internal/agents/base.go)- All agents inherit metrics from BaseAgent
- Automatic health tracking
- MCP call instrumentation
-
Orchestrator Metrics (
internal/orchestrator/)- Voting and consensus metrics
- Session management metrics
-
Market Data Server (
cmd/mcp-servers/market-data/main.go)- Metrics server on port 9201
- Request and tool call instrumentation
The following MCP servers need metrics integration following the pattern above:
-
Technical Indicators Server (Port 9202)
- Location:
cmd/mcp-servers/technical-indicators/main.go - Tools to instrument: RSI, MACD, Bollinger, EMA, ADX
- Location:
-
Risk Analyzer Server (Port 9203)
- Location:
cmd/mcp-servers/risk-analyzer/main.go - Tools to instrument: calculate_var, check_limits, kelly_criterion
- Location:
-
Order Executor Server (Port 9204)
- Location:
cmd/mcp-servers/order-executor/main.go - Tools to instrument: place_market_order, place_limit_order, cancel_order
- Location:
# Start component with metrics
./bin/market-data-server &
# Check metrics endpoint
curl http://localhost:9201/metrics
# Check health endpoint
curl http://localhost:9201/health# Start Prometheus
docker-compose -f deployments/docker-compose.yml up prometheus
# Access Prometheus UI
open http://localhost:9090
# Query metrics
cryptofunk_mcp_requests_total{server="market-data"}
rate(cryptofunk_mcp_tool_calls_total[5m])After deploying Prometheus, create Grafana dashboards for:
- MCP server performance
- Agent activity and health
- Trading performance
- Risk metrics
- System health
See docs/GRAFANA_DASHBOARDS.md for dashboard JSON templates (to be created in T276).
# Request rate by server
rate(cryptofunk_mcp_requests_total[5m])
# P95 request latency
histogram_quantile(0.95, rate(cryptofunk_mcp_request_duration_seconds_bucket[5m]))
# Error rate by server
rate(cryptofunk_mcp_requests_total{status="error"}[5m]) /
rate(cryptofunk_mcp_requests_total[5m])
# Tool call latency by tool
histogram_quantile(0.99, rate(cryptofunk_mcp_tool_call_duration_seconds_bucket[5m]))
# Signals generated per minute
rate(cryptofunk_agent_signals_total[1m]) * 60
# Agent health status
cryptofunk_agent_healthy
# LLM call latency
histogram_quantile(0.95, rate(cryptofunk_agent_llm_duration_seconds_bucket[5m]))
# LLM tokens per hour
rate(cryptofunk_agent_llm_tokens_total[1h]) * 3600
# Win rate
cryptofunk_win_rate
# Current drawdown
cryptofunk_current_drawdown
# Trades per hour
rate(cryptofunk_total_trades[1h]) * 3600
# Portfolio value
cryptofunk_portfolio_value_usd
# Component uptime
cryptofunk_component_uptime_seconds / 3600 # hours
# Database connection pool usage
cryptofunk_database_connections_active / cryptofunk_database_connections_max
# NATS message rate
rate(cryptofunk_nats_messages_published_total[1m])
Recommended Prometheus alerting rules:
groups:
- name: cryptofunk_alerts
rules:
- alert: MCPServerDown
expr: up{job="mcp-servers"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "MCP Server {{ $labels.instance }} is down"
- alert: HighMCPErrorRate
expr: rate(cryptofunk_mcp_requests_total{status="error"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate on {{ $labels.server }}"
- alert: AgentUnhealthy
expr: cryptofunk_agent_healthy == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Agent {{ $labels.agent }} is unhealthy"
- alert: HighDrawdown
expr: cryptofunk_current_drawdown > 0.15
for: 1m
labels:
severity: critical
annotations:
summary: "Current drawdown exceeds 15%"
- alert: CircuitBreakerTripped
expr: cryptofunk_circuit_breaker_tripped == 1
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker tripped: {{ $labels.reason }}"-
Complete MCP Server Instrumentation (T277 - in progress)
- Add metrics to technical-indicators server
- Add metrics to risk-analyzer server
- Add metrics to order-executor server
-
Create Grafana Dashboards (T276 - pending)
- System overview dashboard
- Trading performance dashboard
- Agent performance dashboard
- Risk metrics dashboard
-
Add AlertManager Integration (T278 - pending)
- Configure alert routing
- Set up notification channels (Slack, email, PagerDuty)
- Define alert escalation policies
-
Production Dry-Run (T286 - pending)
- Deploy full stack with metrics
- Verify all metrics are being collected
- Test alerting rules
- Validate dashboard accuracy
- Prometheus Documentation: https://prometheus.io/docs/
- Prometheus Client Golang: https://github.com/prometheus/client_golang
- Grafana Dashboards: https://grafana.com/docs/grafana/latest/dashboards/
- AlertManager: https://prometheus.io/docs/alerting/latest/alertmanager/