Skip to content

Latest commit

 

History

History
1105 lines (855 loc) · 32.6 KB

File metadata and controls

1105 lines (855 loc) · 32.6 KB

AlertManager Runbook

This runbook provides operational guidance for responding to alerts generated by the CryptoFunk trading system.

Table of Contents

Alert Severity Levels

Critical

  • Response Time: Immediate (within 15 minutes)
  • Impact: Trading system is down or at risk of significant financial loss
  • Actions:
    • Check alert details in Slack #cryptofunk-critical
    • Access AlertManager UI at http://alertmanager-service:9093
    • Follow specific alert procedure below
    • Escalate to on-call engineer if not resolved within 30 minutes

Warning

  • Response Time: Within 1 hour
  • Impact: System degradation or potential future issues
  • Actions:
    • Review alert in Slack #cryptofunk-alerts
    • Investigate root cause during business hours
    • Monitor for escalation to critical

Info

  • Response Time: Best effort
  • Impact: Informational, no immediate action required
  • Actions:
    • Review during regular maintenance windows
    • Use for trend analysis and capacity planning

Alert Categories

1. Circuit Breaker Alerts

2. Trading Alerts

3. Agent Health Alerts

4. System Resource Alerts

5. Infrastructure Alerts

6. Vector Search Alerts

Alert Response Procedures

CircuitBreakerOpen

Severity: Warning (can escalate to Critical)

Description: A circuit breaker has opened for a critical service (exchange, LLM, or database).

Impact:

  • Exchange circuit breaker: Cannot place new orders
  • LLM circuit breaker: Agent decision-making degraded
  • Database circuit breaker: Cannot persist data

Response Steps:

  1. Identify the affected service:

    # Check AlertManager
    curl http://alertmanager-service:9093/api/v2/alerts | jq '.[] | select(.labels.alertname == "CircuitBreakerOpen")'
    
    # Check Prometheus metrics
    curl http://prometheus-service:9090/api/v1/query?query=circuit_breaker_state | jq
  2. Check service health:

    # For exchange circuit breaker
    kubectl logs -n cryptofunk deployment/order-executor-server --tail=100
    
    # For LLM circuit breaker
    kubectl logs -n cryptofunk deployment/bifrost --tail=100
    
    # For database circuit breaker
    kubectl logs -n cryptofunk deployment/postgres --tail=100
    kubectl exec -it -n cryptofunk deployment/postgres -- psql -U postgres -c "SELECT 1;"
  3. Check failure rate:

    # View recent failures
    circuit_breaker_failures_total{service="exchange"}
    circuit_breaker_requests_total{service="exchange",result="failure"}
  4. Remediation:

    • Exchange: Check API keys, rate limits, network connectivity
    • LLM: Check Bifrost logs, verify API keys, check provider status
    • Database: Check connection pool, disk space, PostgreSQL logs
  5. Recovery:

    • Circuit breakers automatically transition to half-open state after timeout
    • Monitor metrics for successful requests
    • If failures persist, investigate root cause before forcing reset
  6. Prevention:

    • Review error logs for patterns
    • Adjust circuit breaker thresholds if needed (internal/risk/circuit_breaker.go)
    • Implement additional retry logic or fallback mechanisms

HighErrorRate

Severity: Critical

Description: Error rate exceeds threshold (>5% for 5 minutes).

Impact: System functionality degraded, potential financial loss.

Response Steps:

  1. Check error logs:

    kubectl logs -n cryptofunk deployment/orchestrator --tail=200 | grep -i error
    kubectl logs -n cryptofunk deployment/api --tail=200 | grep -i error
  2. Identify error patterns:

    # Check Prometheus for error metrics
    rate(http_requests_total{status=~"5.."}[5m])
    rate(mcp_tool_errors_total[5m])
  3. Check dependencies:

    • Database connectivity
    • Redis availability
    • NATS messaging
    • External APIs (CoinGecko, exchange)
  4. Remediation:

    • If database: Check connections, run migrations
    • If external API: Check rate limits, verify credentials
    • If internal service: Check resource limits, restart if needed
  5. Communication:

    • Update #cryptofunk-critical with status
    • Notify stakeholders if trading is affected

TradingSessionFailure

Severity: Critical

Description: Trading session failed to start or crashed.

Impact: No trading activity, potential missed opportunities.

Response Steps:

  1. Check orchestrator status:

    kubectl logs -n cryptofunk deployment/orchestrator --tail=100
    kubectl get pods -n cryptofunk -l app.kubernetes.io/name=orchestrator
  2. Verify agent health:

    kubectl get pods -n cryptofunk -l app.kubernetes.io/component=trading-agent
    
    # Check agent_status table
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT agent_name, status, last_heartbeat FROM agent_status ORDER BY last_heartbeat DESC;"
  3. Check MCP server connectivity:

    # Test market data server
    kubectl exec -it -n cryptofunk deployment/orchestrator -- \
      curl http://market-data-server:9201/health
    
    # Test other MCP servers
    for server in technical-indicators risk-analyzer order-executor; do
      echo "Checking $server..."
      kubectl exec -it -n cryptofunk deployment/orchestrator -- \
        curl http://${server}-server:920X/health
    done
  4. Restart trading session:

    # Gracefully restart orchestrator
    kubectl rollout restart deployment/orchestrator -n cryptofunk
    
    # Monitor startup
    kubectl logs -f -n cryptofunk deployment/orchestrator
  5. Verify recovery:

    • Check trading_sessions table for new session
    • Verify agent signals are being generated
    • Monitor order placement

PositionLimitExceeded

Severity: Warning (escalates to Critical if repeated)

Description: Position size or exposure exceeded configured limits.

Impact: Risk management violation, potential for excessive loss.

Response Steps:

  1. Check current positions:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT symbol, side, quantity, entry_price, current_price, unrealized_pnl
       FROM positions WHERE status = 'OPEN' ORDER BY unrealized_pnl;"
  2. Review risk limits:

    # Check risk analyzer configuration
    kubectl logs -n cryptofunk deployment/risk-analyzer-server | grep -i "limit"
  3. Manual intervention:

    • If position is truly too large, consider reducing exposure
    • Review strategy agent that opened the position
    • Update risk limits if appropriate
  4. Investigation:

    • Check why risk agent allowed the position
    • Review agent_signals table for decision reasoning
    • Verify risk calculation logic
  5. Prevention:

    • Adjust position sizing parameters
    • Implement stricter pre-trade risk checks
    • Review and update risk limits in configs/config.yaml

MaxDrawdownExceeded

Severity: Critical

Description: Portfolio drawdown exceeded maximum threshold.

Impact: Circuit breaker triggered, trading halted.

Response Steps:

  1. Verify drawdown:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT session_id, total_pnl, max_drawdown, roi_percentage
       FROM trading_sessions ORDER BY created_at DESC LIMIT 1;"
  2. Review losing trades:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT symbol, side, quantity, entry_price, exit_price, realized_pnl
       FROM positions WHERE status = 'CLOSED' AND realized_pnl < 0
       ORDER BY realized_pnl LIMIT 20;"
  3. Assess market conditions:

    • Check for unusual market volatility
    • Review recent price movements
    • Identify if drawdown is strategy-specific or market-wide
  4. Decision point:

    • Continue trading: If drawdown is temporary and market conditions are normalizing
    • Pause trading: If market conditions are adverse or strategy is flawed
    • Modify strategy: Adjust parameters if specific strategy is underperforming
  5. Recovery:

    # If resuming trading, restart orchestrator
    kubectl rollout restart deployment/orchestrator -n cryptofunk
    
    # Monitor closely for continued losses
    kubectl logs -f -n cryptofunk deployment/risk-agent
  6. Post-mortem:

    • Analyze what led to drawdown
    • Review strategy performance metrics
    • Update risk parameters if needed

AgentDown / AgentUnhealthy

Severity: Warning

Description: Trading agent is not responding or failing health checks.

Impact: Reduced decision-making capacity, potential missed trading opportunities.

Response Steps:

  1. Identify affected agent:

    # Check pod status
    kubectl get pods -n cryptofunk -l app.kubernetes.io/component=trading-agent
    
    # Check agent health in database
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT agent_name, status, last_heartbeat, error_message
       FROM agent_status WHERE status != 'ACTIVE';"
  2. Check agent logs:

    # Replace <agent-name> with specific agent (e.g., technical-agent)
    kubectl logs -n cryptofunk deployment/<agent-name> --tail=100
  3. Common issues:

    • LLM API failure: Check Bifrost logs and API key validity
    • MCP server connectivity: Verify MCP servers are healthy
    • Database connection: Check connection pool and database health
    • Resource limits: Check if pod is OOM or CPU throttled
  4. Restart agent:

    kubectl rollout restart deployment/<agent-name> -n cryptofunk
    kubectl logs -f -n cryptofunk deployment/<agent-name>
  5. Verify recovery:

    • Check agent_status table shows ACTIVE
    • Verify agent is generating signals
    • Monitor for repeated failures

AgentHighLatency

Severity: Warning

Description: Agent decision-making taking longer than expected (>5 seconds).

Impact: Delayed trading decisions, potential missed opportunities.

Response Steps:

  1. Check agent performance metrics:

    # Query Prometheus
    histogram_quantile(0.95, rate(agent_decision_duration_seconds_bucket[5m]))
  2. Identify bottleneck:

    • LLM latency: Check Bifrost response times
    • MCP tool latency: Check tool call durations
    • Database queries: Check slow query log
    • Resource constraints: Check CPU/memory usage
  3. Investigate LLM performance:

    kubectl logs -n cryptofunk deployment/bifrost --tail=100 | grep -i latency
  4. Database optimization:

    # Check for slow queries
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT query, mean_exec_time, calls
       FROM pg_stat_statements
       ORDER BY mean_exec_time DESC LIMIT 10;"
  5. Remediation:

    • If LLM: Switch to faster model or increase timeout
    • If database: Add indexes, optimize queries
    • If resource: Increase pod limits

HighMemoryUsage / HighCPUUsage

Severity: Warning

Description: Pod memory or CPU usage exceeding 80%.

Impact: Performance degradation, potential pod eviction or OOM kill.

Response Steps:

  1. Identify affected pod:

    kubectl top pods -n cryptofunk --sort-by=memory
    kubectl top pods -n cryptofunk --sort-by=cpu
  2. Check resource limits:

    kubectl describe pod -n cryptofunk <pod-name> | grep -A 10 "Limits:"
  3. Investigate memory leak:

    # For Go applications, check heap profile
    kubectl exec -it -n cryptofunk <pod-name> -- \
      curl http://localhost:6060/debug/pprof/heap > heap.prof
    
    # Analyze with pprof
    go tool pprof heap.prof
  4. Remediation:

    • Short-term: Increase resource limits
    • Long-term: Fix memory leak or optimize code
  5. Increase resources:

    # Edit deployment to increase limits
    kubectl edit deployment <deployment-name> -n cryptofunk
    
    # Or use kubectl patch
    kubectl patch deployment <deployment-name> -n cryptofunk -p \
      '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"2Gi"}}}]}}}}'

DiskSpaceWarning

Severity: Warning

Description: Persistent volume disk usage exceeding 80%.

Impact: Database or storage failures if disk fills completely.

Response Steps:

  1. Check disk usage:

    kubectl exec -it -n cryptofunk deployment/postgres -- df -h
  2. Identify large tables:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT schemaname, tablename,
              pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
       FROM pg_tables
       WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
       ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
       LIMIT 10;"
  3. Apply TimescaleDB compression:

    # Compress old data
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT compress_chunk(i, if_not_compressed => true)
       FROM show_chunks('candlesticks', older_than => INTERVAL '7 days') i;"
  4. Archive old data:

    # Export old trading sessions
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      pg_dump -U postgres -d cryptofunk -t trading_sessions \
      --where="created_at < NOW() - INTERVAL '90 days'" > archive.sql
    
    # Delete archived data
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "DELETE FROM trading_sessions WHERE created_at < NOW() - INTERVAL '90 days';"
  5. Expand volume:

    # Increase PVC size (requires storage class with allowVolumeExpansion: true)
    kubectl patch pvc postgres-pvc -n cryptofunk -p \
      '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

VectorSearchHighLatency

Severity: Warning

Description: Vector search p95 latency exceeds 2 seconds for 5 minutes.

Impact: Slow explainability queries, degraded user experience in decision similarity searches.

Response Steps:

  1. Check current latency:

    # Query Prometheus for p95 latency by operation
    curl -G http://prometheus-service:9090/api/v1/query \
      --data-urlencode 'query=histogram_quantile(0.95, rate(cryptofunk_vector_search_latency_seconds_bucket[5m]))' | jq
  2. Identify slow operations:

    # Check which operation type is slow
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT
         'semantic_search' as operation,
         COUNT(*) as total_searches,
         AVG(EXTRACT(EPOCH FROM (NOW() - created_at))) as avg_age_seconds
       FROM llm_decisions
       WHERE created_at > NOW() - INTERVAL '15 minutes';"
  3. Check pgvector index health:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT
         schemaname, tablename, indexname,
         pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
         idx_scan, idx_tup_read, idx_tup_fetch
       FROM pg_stat_user_indexes
       WHERE indexname LIKE '%embedding%' OR indexname LIKE '%vector%';"
  4. Common causes and remediation:

    a. Large result sets:

    # Check if returning too many results
    kubectl logs -n cryptofunk deployment/api --tail=100 | grep -i "vector search"
    • Solution: Reduce LIMIT in vector search queries (currently should be 10)

    b. Missing or outdated index:

    # Rebuild IVFFlat index if needed
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "REINDEX INDEX CONCURRENTLY llm_decisions_prompt_embedding_idx;"

    c. Database resource contention:

    # Check active connections and locks
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT COUNT(*) as active_queries,
              COUNT(*) FILTER (WHERE state = 'active' AND wait_event_type = 'Lock') as waiting_on_locks
       FROM pg_stat_activity WHERE state = 'active';"
  5. Immediate mitigation:

    # If index rebuild doesn't help, tune IVFFlat parameters
    # (requires downtime - use during maintenance window)
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "DROP INDEX CONCURRENTLY llm_decisions_prompt_embedding_idx;
       CREATE INDEX llm_decisions_prompt_embedding_idx ON llm_decisions
       USING ivfflat (prompt_embedding vector_cosine_ops)
       WITH (lists = 200);"  # Increased from 100 for better performance
  6. Monitor recovery:

    # Watch latency metrics
    watch -n 5 'curl -s -G http://prometheus-service:9090/api/v1/query \
      --data-urlencode "query=histogram_quantile(0.95, rate(cryptofunk_vector_search_latency_seconds_bucket[5m]))" | jq'

VectorSearchCriticalLatency

Severity: Critical

Description: Vector search p99 latency exceeds 5 seconds for 2 minutes.

Impact: Explainability features effectively broken, user-facing timeouts.

Response Steps:

  1. Immediate assessment:

    # Check if vector searches are timing out
    kubectl logs -n cryptofunk deployment/api --tail=50 | grep -i "timeout\|error"
  2. Check database health:

    # Check for long-running queries
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT pid, now() - query_start as duration, state, query
       FROM pg_stat_activity
       WHERE query LIKE '%<=>%' OR query LIKE '%vector%'
       ORDER BY duration DESC LIMIT 5;"
  3. Emergency mitigation - Kill slow queries:

    # If queries are stuck, terminate them
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT pg_terminate_backend(pid)
       FROM pg_stat_activity
       WHERE query LIKE '%<=>%'
       AND state = 'active'
       AND now() - query_start > interval '10 seconds';"
  4. Check disk I/O:

    # High disk I/O can cause slowdowns
    kubectl exec -it -n cryptofunk deployment/postgres -- iostat -x 2 5
  5. Temporary disable explainability if needed:

    # If blocking critical trading functionality, temporarily disable
    kubectl set env deployment/api -n cryptofunk ENABLE_VECTOR_SEARCH=false
    kubectl rollout status deployment/api -n cryptofunk
  6. Escalate:

    • This is a critical issue requiring immediate database expert involvement
    • Notify #cryptofunk-critical with details
    • Prepare for potential database restart or index rebuild

VectorSearchHighErrorRate

Severity: Warning

Description: Vector search error rate exceeds 5% for 5 minutes.

Impact: Partial failure of explainability features, some queries failing.

Response Steps:

  1. Check error patterns:

    # Review API logs for error details
    kubectl logs -n cryptofunk deployment/api --tail=200 | grep -i "vector\|error" | tail -20
  2. Common error types:

    a. pgvector extension issues:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT * FROM pg_extension WHERE extname = 'vector';"
    • If missing: Reinstall pgvector extension

    b. Invalid embedding dimensions:

    # Check for dimension mismatches
    kubectl logs -n cryptofunk deployment/api --tail=100 | grep -i "dimension\|1536"
    • Solution: Verify LLM embeddings are 1536 dimensions (OpenAI ada-002)

    c. Connection pool exhaustion:

    # Check if database connections are available
    curl http://api-service:8080/metrics | grep database_connections
  3. Verify data integrity:

    # Check for NULL embeddings
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT COUNT(*) as null_embeddings
       FROM llm_decisions
       WHERE prompt_embedding IS NULL
       AND created_at > NOW() - INTERVAL '1 hour';"
  4. Remediation:

    # Restart API if connection issues
    kubectl rollout restart deployment/api -n cryptofunk
    
    # Monitor recovery
    kubectl logs -f -n cryptofunk deployment/api | grep -i "vector"
  5. Verify recovery:

    # Check error rate has dropped
    watch -n 10 'curl -s http://prometheus-service:9090/api/v1/query \
      --data-urlencode "query=rate(cryptofunk_vector_search_operations_total{status=\"error\"}[5m])" | jq'

VectorSearchCriticalErrorRate

Severity: Critical

Description: Vector search error rate exceeds 20% for 2 minutes.

Impact: Vector search effectively broken, explainability features unavailable.

Response Steps:

  1. Immediate triage:

    # Get recent error samples
    kubectl logs -n cryptofunk deployment/api --tail=50 | grep -A 5 "vector.*error"
  2. Check pgvector extension:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';"

    If not found:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c "CREATE EXTENSION IF NOT EXISTS vector;"
  3. Check index corruption:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT pg_relation_size('llm_decisions_prompt_embedding_idx') as index_size;"

    If 0 bytes, rebuild index:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "REINDEX INDEX CONCURRENTLY llm_decisions_prompt_embedding_idx;"
  4. Emergency response - Disable feature:

    # Prevent cascading failures
    kubectl set env deployment/api -n cryptofunk ENABLE_VECTOR_SEARCH=false
  5. Escalate to DBA:

    • Critical database issue requiring expert intervention
    • Prepare for potential restore from backup
    • Document all error messages and steps taken

VectorSearchNoOperations

Severity: Info

Description: No vector search operations detected for 30 minutes.

Impact: Potential issue with explainability features or simply no usage.

Response Steps:

  1. Verify if expected:

    • Check if users are actively using the system
    • Review time of day (may be normal during off-hours)
  2. Test vector search endpoint:

    # Test similarity search
    curl -X POST http://api-service:8080/api/v1/decisions/similar \
      -H "Content-Type: application/json" \
      -d '{"query": "test query", "limit": 5}'
  3. Check API routing:

    # Verify endpoint is registered
    kubectl logs -n cryptofunk deployment/api --tail=100 | grep -i "route\|endpoint"
  4. Verify database connectivity:

    kubectl exec -it -n cryptofunk deployment/api -- \
      wget -O- http://localhost:8080/health
  5. Review recent deployments:

    • Check if recent API deployment broke the feature
    • Review git log for changes to decision API handlers

DatabaseConnectionPoolExhausted

Severity: Warning

Description: More than 90% of database connections are in use (9+/10).

Impact: New queries will be queued, including vector searches. Increased latency.

Response Steps:

  1. Check current connection usage:

    # From metrics
    curl http://api-service:8080/metrics | grep database_connections
    
    # From database
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT COUNT(*) as total_connections,
              COUNT(*) FILTER (WHERE state = 'active') as active,
              COUNT(*) FILTER (WHERE state = 'idle') as idle
       FROM pg_stat_activity WHERE datname = 'cryptofunk';"
  2. Identify connection consumers:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT application_name, state, COUNT(*)
       FROM pg_stat_activity
       WHERE datname = 'cryptofunk'
       GROUP BY application_name, state
       ORDER BY COUNT(*) DESC;"
  3. Check for connection leaks:

    # Look for long-idle connections
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT pid, usename, application_name, state,
              now() - state_change as idle_duration
       FROM pg_stat_activity
       WHERE state = 'idle' AND datname = 'cryptofunk'
       ORDER BY state_change LIMIT 10;"
  4. Immediate mitigation:

    # Terminate old idle connections (>10 minutes)
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT pg_terminate_backend(pid)
       FROM pg_stat_activity
       WHERE state = 'idle'
       AND now() - state_change > interval '10 minutes'
       AND datname = 'cryptofunk';"
  5. Long-term fix:

    • Review connection pool configuration in internal/db/db.go (MaxConns: 10)
    • Consider increasing MaxConns if server has resources
    • Implement connection pooling in PgBouncer
    • Fix connection leaks in application code

DatabaseConnectionPoolCritical

Severity: Critical

Description: Zero idle database connections available.

Impact: All new queries blocked, including vector searches. System effectively frozen.

Response Steps:

  1. Emergency assessment:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT pid, application_name, state,
              now() - query_start as duration,
              LEFT(query, 100) as query_preview
       FROM pg_stat_activity
       WHERE datname = 'cryptofunk'
       ORDER BY query_start LIMIT 20;"
  2. Identify blocking queries:

    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT blocked_locks.pid AS blocked_pid,
              blocking_locks.pid AS blocking_pid,
              blocked_activity.query AS blocked_query,
              blocking_activity.query AS blocking_query
       FROM pg_locks blocked_locks
       JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
       JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
       JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
       WHERE NOT blocked_locks.granted AND blocking_locks.granted;"
  3. Emergency termination:

    # Kill blocking query (replace <PID> from above)
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c "SELECT pg_terminate_backend(<PID>);"
  4. Restart services if needed:

    # Last resort - restart API to release connections
    kubectl rollout restart deployment/api -n cryptofunk
    kubectl rollout status deployment/api -n cryptofunk
  5. Post-incident:

    • Review query that caused blockage
    • Add query timeouts if missing
    • Consider implementing PgBouncer connection pooler
    • Increase MaxConns in internal/db/db.go if appropriate

VectorSearchSlowWithHighConnections

Severity: Warning

Description: P95 vector search latency >1s while database connections >8/10.

Impact: Contention causing slow queries, potential database bottleneck.

Response Steps:

  1. Confirm contention:

    # Check for lock waits
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT wait_event_type, wait_event, COUNT(*)
       FROM pg_stat_activity
       WHERE state = 'active'
       GROUP BY wait_event_type, wait_event;"
  2. Identify resource bottleneck:

    # Check CPU and I/O wait
    kubectl exec -it -n cryptofunk deployment/postgres -- top -n 1
    kubectl exec -it -n cryptofunk deployment/postgres -- iostat -x 1 5
  3. Review query patterns:

    # Check for sequential scans on vector searches
    kubectl exec -it -n cryptofunk deployment/postgres -- \
      psql -U postgres -d cryptofunk -c \
      "SELECT schemaname, tablename, seq_scan, idx_scan
       FROM pg_stat_user_tables
       WHERE tablename = 'llm_decisions';"
  4. Remediation:

    • If high seq_scan: Rebuild indexes during maintenance
    • If high I/O: Consider read replicas for vector searches
    • If high CPU: Optimize query plans or scale database
  5. Monitor improvement:

    watch -n 5 'curl -s http://prometheus-service:9090/api/v1/query \
      --data-urlencode "query=cryptofunk_database_connections_active" | jq'

Alert Notification Channels

Slack Channels

  • #cryptofunk-alerts: All alerts (info, warning, critical)
  • #cryptofunk-critical: Critical alerts only (@channel mentions)
  • #cryptofunk-trading: Trading-specific alerts
  • #cryptofunk-agents: Agent health alerts
  • #cryptofunk-ops: System and circuit breaker alerts

Email

Configuration

Alert notification settings are configured in:

  • Docker Compose: docker-compose.yml environment variables
  • Kubernetes: deployments/k8s/base/secrets.yaml

Required environment variables:

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_FROM=cryptofunk-alerts@example.com
SMTP_USERNAME=your-username
SMTP_PASSWORD=your-password
ALERT_EMAIL_TO=team@example.com
CRITICAL_ALERT_EMAIL_TO=oncall@example.com

Escalation Procedures

Level 1: Team Response (0-30 minutes)

  • On-call engineer receives alert
  • Follows runbook procedures
  • Attempts immediate remediation
  • Updates #cryptofunk-critical with status

Level 2: Senior Engineer Escalation (30-60 minutes)

  • If issue not resolved within 30 minutes
  • Senior engineer joins incident response
  • Reviews remediation attempts
  • Coordinates with external dependencies (exchange, LLM provider)

Level 3: Emergency Escalation (60+ minutes)

  • If trading system down for >1 hour
  • Escalate to engineering lead and product manager
  • Consider emergency maintenance window
  • Prepare external communication for stakeholders

Escalation Contacts

Update with your team's contact information:

  • On-Call Engineer: [Slack: @oncall] [Phone: +1-XXX-XXX-XXXX]
  • Senior Engineer: [Slack: @senior-eng] [Phone: +1-XXX-XXX-XXXX]
  • Engineering Lead: [Slack: @eng-lead] [Phone: +1-XXX-XXX-XXXX]
  • Product Manager: [Slack: @pm] [Phone: +1-XXX-XXX-XXXX]

Useful Commands Reference

Check All Alert Status

curl http://alertmanager-service:9093/api/v2/alerts | jq

Silence an Alert

# Create silence for 1 hour
amtool silence add alertname=CircuitBreakerOpen --duration=1h \
  --comment="Investigating circuit breaker issue"

View Alert History

# Query Prometheus for alert history
curl -G http://prometheus-service:9090/api/v1/query \
  --data-urlencode 'query=ALERTS{alertname="CircuitBreakerOpen"}[1h]' | jq

Check Trading System Health

# Overall system health
kubectl get pods -n cryptofunk

# Database health
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -c "SELECT 1;"

# Recent trading activity
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '1 hour';"

Additional Resources

For detailed architecture information, see:

  • docs/MCP_INTEGRATION.md
  • docs/LLM_AGENT_ARCHITECTURE.md
  • CLAUDE.md