This runbook provides operational guidance for responding to alerts generated by the CryptoFunk trading system.
- Alert Severity Levels
- Alert Categories
- Alert Response Procedures
- Circuit Breaker Alerts
- Trading Alerts
- Agent Health Alerts
- System Resource Alerts
- Vector Search Alerts
- Alert Notification Channels
- Escalation Procedures
- Response Time: Immediate (within 15 minutes)
- Impact: Trading system is down or at risk of significant financial loss
- Actions:
- Check alert details in Slack #cryptofunk-critical
- Access AlertManager UI at http://alertmanager-service:9093
- Follow specific alert procedure below
- Escalate to on-call engineer if not resolved within 30 minutes
- Response Time: Within 1 hour
- Impact: System degradation or potential future issues
- Actions:
- Review alert in Slack #cryptofunk-alerts
- Investigate root cause during business hours
- Monitor for escalation to critical
- Response Time: Best effort
- Impact: Informational, no immediate action required
- Actions:
- Review during regular maintenance windows
- Use for trend analysis and capacity planning
Severity: Warning (can escalate to Critical)
Description: A circuit breaker has opened for a critical service (exchange, LLM, or database).
Impact:
- Exchange circuit breaker: Cannot place new orders
- LLM circuit breaker: Agent decision-making degraded
- Database circuit breaker: Cannot persist data
Response Steps:
-
Identify the affected service:
# Check AlertManager curl http://alertmanager-service:9093/api/v2/alerts | jq '.[] | select(.labels.alertname == "CircuitBreakerOpen")' # Check Prometheus metrics curl http://prometheus-service:9090/api/v1/query?query=circuit_breaker_state | jq
-
Check service health:
# For exchange circuit breaker kubectl logs -n cryptofunk deployment/order-executor-server --tail=100 # For LLM circuit breaker kubectl logs -n cryptofunk deployment/bifrost --tail=100 # For database circuit breaker kubectl logs -n cryptofunk deployment/postgres --tail=100 kubectl exec -it -n cryptofunk deployment/postgres -- psql -U postgres -c "SELECT 1;"
-
Check failure rate:
# View recent failures circuit_breaker_failures_total{service="exchange"} circuit_breaker_requests_total{service="exchange",result="failure"}
-
Remediation:
- Exchange: Check API keys, rate limits, network connectivity
- LLM: Check Bifrost logs, verify API keys, check provider status
- Database: Check connection pool, disk space, PostgreSQL logs
-
Recovery:
- Circuit breakers automatically transition to half-open state after timeout
- Monitor metrics for successful requests
- If failures persist, investigate root cause before forcing reset
-
Prevention:
- Review error logs for patterns
- Adjust circuit breaker thresholds if needed (internal/risk/circuit_breaker.go)
- Implement additional retry logic or fallback mechanisms
Severity: Critical
Description: Error rate exceeds threshold (>5% for 5 minutes).
Impact: System functionality degraded, potential financial loss.
Response Steps:
-
Check error logs:
kubectl logs -n cryptofunk deployment/orchestrator --tail=200 | grep -i error kubectl logs -n cryptofunk deployment/api --tail=200 | grep -i error
-
Identify error patterns:
# Check Prometheus for error metrics rate(http_requests_total{status=~"5.."}[5m]) rate(mcp_tool_errors_total[5m])
-
Check dependencies:
- Database connectivity
- Redis availability
- NATS messaging
- External APIs (CoinGecko, exchange)
-
Remediation:
- If database: Check connections, run migrations
- If external API: Check rate limits, verify credentials
- If internal service: Check resource limits, restart if needed
-
Communication:
- Update #cryptofunk-critical with status
- Notify stakeholders if trading is affected
Severity: Critical
Description: Trading session failed to start or crashed.
Impact: No trading activity, potential missed opportunities.
Response Steps:
-
Check orchestrator status:
kubectl logs -n cryptofunk deployment/orchestrator --tail=100 kubectl get pods -n cryptofunk -l app.kubernetes.io/name=orchestrator
-
Verify agent health:
kubectl get pods -n cryptofunk -l app.kubernetes.io/component=trading-agent # Check agent_status table kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT agent_name, status, last_heartbeat FROM agent_status ORDER BY last_heartbeat DESC;"
-
Check MCP server connectivity:
# Test market data server kubectl exec -it -n cryptofunk deployment/orchestrator -- \ curl http://market-data-server:9201/health # Test other MCP servers for server in technical-indicators risk-analyzer order-executor; do echo "Checking $server..." kubectl exec -it -n cryptofunk deployment/orchestrator -- \ curl http://${server}-server:920X/health done
-
Restart trading session:
# Gracefully restart orchestrator kubectl rollout restart deployment/orchestrator -n cryptofunk # Monitor startup kubectl logs -f -n cryptofunk deployment/orchestrator
-
Verify recovery:
- Check trading_sessions table for new session
- Verify agent signals are being generated
- Monitor order placement
Severity: Warning (escalates to Critical if repeated)
Description: Position size or exposure exceeded configured limits.
Impact: Risk management violation, potential for excessive loss.
Response Steps:
-
Check current positions:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT symbol, side, quantity, entry_price, current_price, unrealized_pnl FROM positions WHERE status = 'OPEN' ORDER BY unrealized_pnl;"
-
Review risk limits:
# Check risk analyzer configuration kubectl logs -n cryptofunk deployment/risk-analyzer-server | grep -i "limit"
-
Manual intervention:
- If position is truly too large, consider reducing exposure
- Review strategy agent that opened the position
- Update risk limits if appropriate
-
Investigation:
- Check why risk agent allowed the position
- Review agent_signals table for decision reasoning
- Verify risk calculation logic
-
Prevention:
- Adjust position sizing parameters
- Implement stricter pre-trade risk checks
- Review and update risk limits in configs/config.yaml
Severity: Critical
Description: Portfolio drawdown exceeded maximum threshold.
Impact: Circuit breaker triggered, trading halted.
Response Steps:
-
Verify drawdown:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT session_id, total_pnl, max_drawdown, roi_percentage FROM trading_sessions ORDER BY created_at DESC LIMIT 1;"
-
Review losing trades:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT symbol, side, quantity, entry_price, exit_price, realized_pnl FROM positions WHERE status = 'CLOSED' AND realized_pnl < 0 ORDER BY realized_pnl LIMIT 20;"
-
Assess market conditions:
- Check for unusual market volatility
- Review recent price movements
- Identify if drawdown is strategy-specific or market-wide
-
Decision point:
- Continue trading: If drawdown is temporary and market conditions are normalizing
- Pause trading: If market conditions are adverse or strategy is flawed
- Modify strategy: Adjust parameters if specific strategy is underperforming
-
Recovery:
# If resuming trading, restart orchestrator kubectl rollout restart deployment/orchestrator -n cryptofunk # Monitor closely for continued losses kubectl logs -f -n cryptofunk deployment/risk-agent
-
Post-mortem:
- Analyze what led to drawdown
- Review strategy performance metrics
- Update risk parameters if needed
Severity: Warning
Description: Trading agent is not responding or failing health checks.
Impact: Reduced decision-making capacity, potential missed trading opportunities.
Response Steps:
-
Identify affected agent:
# Check pod status kubectl get pods -n cryptofunk -l app.kubernetes.io/component=trading-agent # Check agent health in database kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT agent_name, status, last_heartbeat, error_message FROM agent_status WHERE status != 'ACTIVE';"
-
Check agent logs:
# Replace <agent-name> with specific agent (e.g., technical-agent) kubectl logs -n cryptofunk deployment/<agent-name> --tail=100
-
Common issues:
- LLM API failure: Check Bifrost logs and API key validity
- MCP server connectivity: Verify MCP servers are healthy
- Database connection: Check connection pool and database health
- Resource limits: Check if pod is OOM or CPU throttled
-
Restart agent:
kubectl rollout restart deployment/<agent-name> -n cryptofunk kubectl logs -f -n cryptofunk deployment/<agent-name>
-
Verify recovery:
- Check agent_status table shows ACTIVE
- Verify agent is generating signals
- Monitor for repeated failures
Severity: Warning
Description: Agent decision-making taking longer than expected (>5 seconds).
Impact: Delayed trading decisions, potential missed opportunities.
Response Steps:
-
Check agent performance metrics:
# Query Prometheus histogram_quantile(0.95, rate(agent_decision_duration_seconds_bucket[5m])) -
Identify bottleneck:
- LLM latency: Check Bifrost response times
- MCP tool latency: Check tool call durations
- Database queries: Check slow query log
- Resource constraints: Check CPU/memory usage
-
Investigate LLM performance:
kubectl logs -n cryptofunk deployment/bifrost --tail=100 | grep -i latency -
Database optimization:
# Check for slow queries kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT query, mean_exec_time, calls FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
-
Remediation:
- If LLM: Switch to faster model or increase timeout
- If database: Add indexes, optimize queries
- If resource: Increase pod limits
Severity: Warning
Description: Pod memory or CPU usage exceeding 80%.
Impact: Performance degradation, potential pod eviction or OOM kill.
Response Steps:
-
Identify affected pod:
kubectl top pods -n cryptofunk --sort-by=memory kubectl top pods -n cryptofunk --sort-by=cpu
-
Check resource limits:
kubectl describe pod -n cryptofunk <pod-name> | grep -A 10 "Limits:"
-
Investigate memory leak:
# For Go applications, check heap profile kubectl exec -it -n cryptofunk <pod-name> -- \ curl http://localhost:6060/debug/pprof/heap > heap.prof # Analyze with pprof go tool pprof heap.prof
-
Remediation:
- Short-term: Increase resource limits
- Long-term: Fix memory leak or optimize code
-
Increase resources:
# Edit deployment to increase limits kubectl edit deployment <deployment-name> -n cryptofunk # Or use kubectl patch kubectl patch deployment <deployment-name> -n cryptofunk -p \ '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"2Gi"}}}]}}}}'
Severity: Warning
Description: Persistent volume disk usage exceeding 80%.
Impact: Database or storage failures if disk fills completely.
Response Steps:
-
Check disk usage:
kubectl exec -it -n cryptofunk deployment/postgres -- df -h -
Identify large tables:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema') ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;"
-
Apply TimescaleDB compression:
# Compress old data kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT compress_chunk(i, if_not_compressed => true) FROM show_chunks('candlesticks', older_than => INTERVAL '7 days') i;"
-
Archive old data:
# Export old trading sessions kubectl exec -it -n cryptofunk deployment/postgres -- \ pg_dump -U postgres -d cryptofunk -t trading_sessions \ --where="created_at < NOW() - INTERVAL '90 days'" > archive.sql # Delete archived data kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "DELETE FROM trading_sessions WHERE created_at < NOW() - INTERVAL '90 days';"
-
Expand volume:
# Increase PVC size (requires storage class with allowVolumeExpansion: true) kubectl patch pvc postgres-pvc -n cryptofunk -p \ '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
Severity: Warning
Description: Vector search p95 latency exceeds 2 seconds for 5 minutes.
Impact: Slow explainability queries, degraded user experience in decision similarity searches.
Response Steps:
-
Check current latency:
# Query Prometheus for p95 latency by operation curl -G http://prometheus-service:9090/api/v1/query \ --data-urlencode 'query=histogram_quantile(0.95, rate(cryptofunk_vector_search_latency_seconds_bucket[5m]))' | jq
-
Identify slow operations:
# Check which operation type is slow kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT 'semantic_search' as operation, COUNT(*) as total_searches, AVG(EXTRACT(EPOCH FROM (NOW() - created_at))) as avg_age_seconds FROM llm_decisions WHERE created_at > NOW() - INTERVAL '15 minutes';"
-
Check pgvector index health:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT schemaname, tablename, indexname, pg_size_pretty(pg_relation_size(indexrelid)) as index_size, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE indexname LIKE '%embedding%' OR indexname LIKE '%vector%';"
-
Common causes and remediation:
a. Large result sets:
# Check if returning too many results kubectl logs -n cryptofunk deployment/api --tail=100 | grep -i "vector search"
- Solution: Reduce LIMIT in vector search queries (currently should be 10)
b. Missing or outdated index:
# Rebuild IVFFlat index if needed kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "REINDEX INDEX CONCURRENTLY llm_decisions_prompt_embedding_idx;"
c. Database resource contention:
# Check active connections and locks kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT COUNT(*) as active_queries, COUNT(*) FILTER (WHERE state = 'active' AND wait_event_type = 'Lock') as waiting_on_locks FROM pg_stat_activity WHERE state = 'active';"
-
Immediate mitigation:
# If index rebuild doesn't help, tune IVFFlat parameters # (requires downtime - use during maintenance window) kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "DROP INDEX CONCURRENTLY llm_decisions_prompt_embedding_idx; CREATE INDEX llm_decisions_prompt_embedding_idx ON llm_decisions USING ivfflat (prompt_embedding vector_cosine_ops) WITH (lists = 200);" # Increased from 100 for better performance
-
Monitor recovery:
# Watch latency metrics watch -n 5 'curl -s -G http://prometheus-service:9090/api/v1/query \ --data-urlencode "query=histogram_quantile(0.95, rate(cryptofunk_vector_search_latency_seconds_bucket[5m]))" | jq'
Severity: Critical
Description: Vector search p99 latency exceeds 5 seconds for 2 minutes.
Impact: Explainability features effectively broken, user-facing timeouts.
Response Steps:
-
Immediate assessment:
# Check if vector searches are timing out kubectl logs -n cryptofunk deployment/api --tail=50 | grep -i "timeout\|error"
-
Check database health:
# Check for long-running queries kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT pid, now() - query_start as duration, state, query FROM pg_stat_activity WHERE query LIKE '%<=>%' OR query LIKE '%vector%' ORDER BY duration DESC LIMIT 5;"
-
Emergency mitigation - Kill slow queries:
# If queries are stuck, terminate them kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query LIKE '%<=>%' AND state = 'active' AND now() - query_start > interval '10 seconds';"
-
Check disk I/O:
# High disk I/O can cause slowdowns kubectl exec -it -n cryptofunk deployment/postgres -- iostat -x 2 5
-
Temporary disable explainability if needed:
# If blocking critical trading functionality, temporarily disable kubectl set env deployment/api -n cryptofunk ENABLE_VECTOR_SEARCH=false kubectl rollout status deployment/api -n cryptofunk
-
Escalate:
- This is a critical issue requiring immediate database expert involvement
- Notify #cryptofunk-critical with details
- Prepare for potential database restart or index rebuild
Severity: Warning
Description: Vector search error rate exceeds 5% for 5 minutes.
Impact: Partial failure of explainability features, some queries failing.
Response Steps:
-
Check error patterns:
# Review API logs for error details kubectl logs -n cryptofunk deployment/api --tail=200 | grep -i "vector\|error" | tail -20
-
Common error types:
a. pgvector extension issues:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT * FROM pg_extension WHERE extname = 'vector';"
- If missing: Reinstall pgvector extension
b. Invalid embedding dimensions:
# Check for dimension mismatches kubectl logs -n cryptofunk deployment/api --tail=100 | grep -i "dimension\|1536"
- Solution: Verify LLM embeddings are 1536 dimensions (OpenAI ada-002)
c. Connection pool exhaustion:
# Check if database connections are available curl http://api-service:8080/metrics | grep database_connections
-
Verify data integrity:
# Check for NULL embeddings kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT COUNT(*) as null_embeddings FROM llm_decisions WHERE prompt_embedding IS NULL AND created_at > NOW() - INTERVAL '1 hour';"
-
Remediation:
# Restart API if connection issues kubectl rollout restart deployment/api -n cryptofunk # Monitor recovery kubectl logs -f -n cryptofunk deployment/api | grep -i "vector"
-
Verify recovery:
# Check error rate has dropped watch -n 10 'curl -s http://prometheus-service:9090/api/v1/query \ --data-urlencode "query=rate(cryptofunk_vector_search_operations_total{status=\"error\"}[5m])" | jq'
Severity: Critical
Description: Vector search error rate exceeds 20% for 2 minutes.
Impact: Vector search effectively broken, explainability features unavailable.
Response Steps:
-
Immediate triage:
# Get recent error samples kubectl logs -n cryptofunk deployment/api --tail=50 | grep -A 5 "vector.*error"
-
Check pgvector extension:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';"
If not found:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c "CREATE EXTENSION IF NOT EXISTS vector;"
-
Check index corruption:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT pg_relation_size('llm_decisions_prompt_embedding_idx') as index_size;"
If 0 bytes, rebuild index:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "REINDEX INDEX CONCURRENTLY llm_decisions_prompt_embedding_idx;"
-
Emergency response - Disable feature:
# Prevent cascading failures kubectl set env deployment/api -n cryptofunk ENABLE_VECTOR_SEARCH=false
-
Escalate to DBA:
- Critical database issue requiring expert intervention
- Prepare for potential restore from backup
- Document all error messages and steps taken
Severity: Info
Description: No vector search operations detected for 30 minutes.
Impact: Potential issue with explainability features or simply no usage.
Response Steps:
-
Verify if expected:
- Check if users are actively using the system
- Review time of day (may be normal during off-hours)
-
Test vector search endpoint:
# Test similarity search curl -X POST http://api-service:8080/api/v1/decisions/similar \ -H "Content-Type: application/json" \ -d '{"query": "test query", "limit": 5}'
-
Check API routing:
# Verify endpoint is registered kubectl logs -n cryptofunk deployment/api --tail=100 | grep -i "route\|endpoint"
-
Verify database connectivity:
kubectl exec -it -n cryptofunk deployment/api -- \ wget -O- http://localhost:8080/health -
Review recent deployments:
- Check if recent API deployment broke the feature
- Review git log for changes to decision API handlers
Severity: Warning
Description: More than 90% of database connections are in use (9+/10).
Impact: New queries will be queued, including vector searches. Increased latency.
Response Steps:
-
Check current connection usage:
# From metrics curl http://api-service:8080/metrics | grep database_connections # From database kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT COUNT(*) as total_connections, COUNT(*) FILTER (WHERE state = 'active') as active, COUNT(*) FILTER (WHERE state = 'idle') as idle FROM pg_stat_activity WHERE datname = 'cryptofunk';"
-
Identify connection consumers:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT application_name, state, COUNT(*) FROM pg_stat_activity WHERE datname = 'cryptofunk' GROUP BY application_name, state ORDER BY COUNT(*) DESC;"
-
Check for connection leaks:
# Look for long-idle connections kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT pid, usename, application_name, state, now() - state_change as idle_duration FROM pg_stat_activity WHERE state = 'idle' AND datname = 'cryptofunk' ORDER BY state_change LIMIT 10;"
-
Immediate mitigation:
# Terminate old idle connections (>10 minutes) kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND now() - state_change > interval '10 minutes' AND datname = 'cryptofunk';"
-
Long-term fix:
- Review connection pool configuration in internal/db/db.go (MaxConns: 10)
- Consider increasing MaxConns if server has resources
- Implement connection pooling in PgBouncer
- Fix connection leaks in application code
Severity: Critical
Description: Zero idle database connections available.
Impact: All new queries blocked, including vector searches. System effectively frozen.
Response Steps:
-
Emergency assessment:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT pid, application_name, state, now() - query_start as duration, LEFT(query, 100) as query_preview FROM pg_stat_activity WHERE datname = 'cryptofunk' ORDER BY query_start LIMIT 20;"
-
Identify blocking queries:
kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_query, blocking_activity.query AS blocking_query FROM pg_locks blocked_locks JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted AND blocking_locks.granted;"
-
Emergency termination:
# Kill blocking query (replace <PID> from above) kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c "SELECT pg_terminate_backend(<PID>);"
-
Restart services if needed:
# Last resort - restart API to release connections kubectl rollout restart deployment/api -n cryptofunk kubectl rollout status deployment/api -n cryptofunk -
Post-incident:
- Review query that caused blockage
- Add query timeouts if missing
- Consider implementing PgBouncer connection pooler
- Increase MaxConns in internal/db/db.go if appropriate
Severity: Warning
Description: P95 vector search latency >1s while database connections >8/10.
Impact: Contention causing slow queries, potential database bottleneck.
Response Steps:
-
Confirm contention:
# Check for lock waits kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT wait_event_type, wait_event, COUNT(*) FROM pg_stat_activity WHERE state = 'active' GROUP BY wait_event_type, wait_event;"
-
Identify resource bottleneck:
# Check CPU and I/O wait kubectl exec -it -n cryptofunk deployment/postgres -- top -n 1 kubectl exec -it -n cryptofunk deployment/postgres -- iostat -x 1 5
-
Review query patterns:
# Check for sequential scans on vector searches kubectl exec -it -n cryptofunk deployment/postgres -- \ psql -U postgres -d cryptofunk -c \ "SELECT schemaname, tablename, seq_scan, idx_scan FROM pg_stat_user_tables WHERE tablename = 'llm_decisions';"
-
Remediation:
- If high seq_scan: Rebuild indexes during maintenance
- If high I/O: Consider read replicas for vector searches
- If high CPU: Optimize query plans or scale database
-
Monitor improvement:
watch -n 5 'curl -s http://prometheus-service:9090/api/v1/query \ --data-urlencode "query=cryptofunk_database_connections_active" | jq'
- #cryptofunk-alerts: All alerts (info, warning, critical)
- #cryptofunk-critical: Critical alerts only (@channel mentions)
- #cryptofunk-trading: Trading-specific alerts
- #cryptofunk-agents: Agent health alerts
- #cryptofunk-ops: System and circuit breaker alerts
- team@example.com: Regular alerts (configured via
ALERT_EMAIL_TO) - oncall@example.com: Critical alerts (configured via
CRITICAL_ALERT_EMAIL_TO)
Alert notification settings are configured in:
- Docker Compose:
docker-compose.ymlenvironment variables - Kubernetes:
deployments/k8s/base/secrets.yaml
Required environment variables:
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_FROM=cryptofunk-alerts@example.com
SMTP_USERNAME=your-username
SMTP_PASSWORD=your-password
ALERT_EMAIL_TO=team@example.com
CRITICAL_ALERT_EMAIL_TO=oncall@example.com- On-call engineer receives alert
- Follows runbook procedures
- Attempts immediate remediation
- Updates #cryptofunk-critical with status
- If issue not resolved within 30 minutes
- Senior engineer joins incident response
- Reviews remediation attempts
- Coordinates with external dependencies (exchange, LLM provider)
- If trading system down for >1 hour
- Escalate to engineering lead and product manager
- Consider emergency maintenance window
- Prepare external communication for stakeholders
Update with your team's contact information:
- On-Call Engineer: [Slack: @oncall] [Phone: +1-XXX-XXX-XXXX]
- Senior Engineer: [Slack: @senior-eng] [Phone: +1-XXX-XXX-XXXX]
- Engineering Lead: [Slack: @eng-lead] [Phone: +1-XXX-XXX-XXXX]
- Product Manager: [Slack: @pm] [Phone: +1-XXX-XXX-XXXX]
curl http://alertmanager-service:9093/api/v2/alerts | jq# Create silence for 1 hour
amtool silence add alertname=CircuitBreakerOpen --duration=1h \
--comment="Investigating circuit breaker issue"# Query Prometheus for alert history
curl -G http://prometheus-service:9090/api/v1/query \
--data-urlencode 'query=ALERTS{alertname="CircuitBreakerOpen"}[1h]' | jq# Overall system health
kubectl get pods -n cryptofunk
# Database health
kubectl exec -it -n cryptofunk deployment/postgres -- \
psql -U postgres -c "SELECT 1;"
# Recent trading activity
kubectl exec -it -n cryptofunk deployment/postgres -- \
psql -U postgres -d cryptofunk -c \
"SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '1 hour';"- Prometheus: http://prometheus-service:9090
- AlertManager: http://alertmanager-service:9093
- Grafana: http://grafana-service:3000
- API Health: http://api-service:8080/health
- Orchestrator Health: http://orchestrator-service:8080/health
For detailed architecture information, see:
docs/MCP_INTEGRATION.mddocs/LLM_AGENT_ARCHITECTURE.mdCLAUDE.md