AlertManager Runbook

This runbook provides operational guidance for responding to alerts generated by the CryptoFunk trading system.

Alert Severity Levels
Alert Categories
Alert Response Procedures
Circuit Breaker Alerts
Trading Alerts
Agent Health Alerts
System Resource Alerts
Vector Search Alerts
Alert Notification Channels
Escalation Procedures

Alert Severity Levels

Critical

Response Time: Immediate (within 15 minutes)
Impact: Trading system is down or at risk of significant financial loss
Actions:
- Check alert details in Slack #cryptofunk-critical
- Access AlertManager UI at http://alertmanager-service:9093
- Follow specific alert procedure below
- Escalate to on-call engineer if not resolved within 30 minutes

Warning

Response Time: Within 1 hour
Impact: System degradation or potential future issues
Actions:
- Review alert in Slack #cryptofunk-alerts
- Investigate root cause during business hours
- Monitor for escalation to critical

Info

Response Time: Best effort
Impact: Informational, no immediate action required
Actions:
- Review during regular maintenance windows
- Use for trend analysis and capacity planning

Alert Categories

1. Circuit Breaker Alerts

2. Trading Alerts

3. Agent Health Alerts

4. System Resource Alerts

5. Infrastructure Alerts

6. Vector Search Alerts

Alert Response Procedures

CircuitBreakerOpen

Severity: Warning (can escalate to Critical)

Description: A circuit breaker has opened for a critical service (exchange, LLM, or database).

Impact:

Exchange circuit breaker: Cannot place new orders
LLM circuit breaker: Agent decision-making degraded
Database circuit breaker: Cannot persist data

Response Steps:

Identify the affected service:

# Check AlertManager
curl http://alertmanager-service:9093/api/v2/alerts | jq '.[] | select(.labels.alertname == "CircuitBreakerOpen")'

# Check Prometheus metrics
curl http://prometheus-service:9090/api/v1/query?query=circuit_breaker_state | jq

Check service health:

# For exchange circuit breaker
kubectl logs -n cryptofunk deployment/order-executor-server --tail=100

# For LLM circuit breaker
kubectl logs -n cryptofunk deployment/bifrost --tail=100

# For database circuit breaker
kubectl logs -n cryptofunk deployment/postgres --tail=100
kubectl exec -it -n cryptofunk deployment/postgres -- psql -U postgres -c "SELECT 1;"

Check failure rate:

# View recent failures
circuit_breaker_failures_total{service="exchange"}
circuit_breaker_requests_total{service="exchange",result="failure"}

Remediation:
- Exchange: Check API keys, rate limits, network connectivity
- LLM: Check Bifrost logs, verify API keys, check provider status
- Database: Check connection pool, disk space, PostgreSQL logs
Recovery:
- Circuit breakers automatically transition to half-open state after timeout
- Monitor metrics for successful requests
- If failures persist, investigate root cause before forcing reset
Prevention:
- Review error logs for patterns
- Adjust circuit breaker thresholds if needed (internal/risk/circuit_breaker.go)
- Implement additional retry logic or fallback mechanisms

HighErrorRate

Severity: Critical

Description: Error rate exceeds threshold (>5% for 5 minutes).

Impact: System functionality degraded, potential financial loss.

Response Steps:

Check error logs:

kubectl logs -n cryptofunk deployment/orchestrator --tail=200 | grep -i error
kubectl logs -n cryptofunk deployment/api --tail=200 | grep -i error

Identify error patterns:

# Check Prometheus for error metrics
rate(http_requests_total{status=~"5.."}[5m])
rate(mcp_tool_errors_total[5m])

Check dependencies:
- Database connectivity
- Redis availability
- NATS messaging
- External APIs (CoinGecko, exchange)
Remediation:
- If database: Check connections, run migrations
- If external API: Check rate limits, verify credentials
- If internal service: Check resource limits, restart if needed
Communication:
- Update #cryptofunk-critical with status
- Notify stakeholders if trading is affected

TradingSessionFailure

Severity: Critical

Description: Trading session failed to start or crashed.

Impact: No trading activity, potential missed opportunities.

Response Steps:

Check orchestrator status:

kubectl logs -n cryptofunk deployment/orchestrator --tail=100
kubectl get pods -n cryptofunk -l app.kubernetes.io/name=orchestrator

Verify agent health:

kubectl get pods -n cryptofunk -l app.kubernetes.io/component=trading-agent

# Check agent_status table
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT agent_name, status, last_heartbeat FROM agent_status ORDER BY last_heartbeat DESC;"

Check MCP server connectivity:

# Test market data server
kubectl exec -it -n cryptofunk deployment/orchestrator -- \
  curl http://market-data-server:9201/health

# Test other MCP servers
for server in technical-indicators risk-analyzer order-executor; do
  echo "Checking $server..."
  kubectl exec -it -n cryptofunk deployment/orchestrator -- \
    curl http://${server}-server:920X/health
done

Restart trading session:

# Gracefully restart orchestrator
kubectl rollout restart deployment/orchestrator -n cryptofunk

# Monitor startup
kubectl logs -f -n cryptofunk deployment/orchestrator

Verify recovery:
- Check trading_sessions table for new session
- Verify agent signals are being generated
- Monitor order placement

PositionLimitExceeded

Severity: Warning (escalates to Critical if repeated)

Description: Position size or exposure exceeded configured limits.

Impact: Risk management violation, potential for excessive loss.

Response Steps:

Check current positions:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT symbol, side, quantity, entry_price, current_price, unrealized_pnl
   FROM positions WHERE status = 'OPEN' ORDER BY unrealized_pnl;"

Review risk limits:

# Check risk analyzer configuration
kubectl logs -n cryptofunk deployment/risk-analyzer-server | grep -i "limit"

Manual intervention:
- If position is truly too large, consider reducing exposure
- Review strategy agent that opened the position
- Update risk limits if appropriate
Investigation:
- Check why risk agent allowed the position
- Review agent_signals table for decision reasoning
- Verify risk calculation logic
Prevention:
- Adjust position sizing parameters
- Implement stricter pre-trade risk checks
- Review and update risk limits in configs/config.yaml

MaxDrawdownExceeded

Severity: Critical

Description: Portfolio drawdown exceeded maximum threshold.

Impact: Circuit breaker triggered, trading halted.

Response Steps:

Verify drawdown:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT session_id, total_pnl, max_drawdown, roi_percentage
   FROM trading_sessions ORDER BY created_at DESC LIMIT 1;"

Review losing trades:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT symbol, side, quantity, entry_price, exit_price, realized_pnl
   FROM positions WHERE status = 'CLOSED' AND realized_pnl < 0
   ORDER BY realized_pnl LIMIT 20;"

Assess market conditions:
- Check for unusual market volatility
- Review recent price movements
- Identify if drawdown is strategy-specific or market-wide
Decision point:
- Continue trading: If drawdown is temporary and market conditions are normalizing
- Pause trading: If market conditions are adverse or strategy is flawed
- Modify strategy: Adjust parameters if specific strategy is underperforming

Recovery:

# If resuming trading, restart orchestrator
kubectl rollout restart deployment/orchestrator -n cryptofunk

# Monitor closely for continued losses
kubectl logs -f -n cryptofunk deployment/risk-agent

Post-mortem:
- Analyze what led to drawdown
- Review strategy performance metrics
- Update risk parameters if needed

AgentDown / AgentUnhealthy

Severity: Warning

Description: Trading agent is not responding or failing health checks.

Impact: Reduced decision-making capacity, potential missed trading opportunities.

Response Steps:

Identify affected agent:

# Check pod status
kubectl get pods -n cryptofunk -l app.kubernetes.io/component=trading-agent

# Check agent health in database
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT agent_name, status, last_heartbeat, error_message
   FROM agent_status WHERE status != 'ACTIVE';"

Check agent logs:

# Replace <agent-name> with specific agent (e.g., technical-agent)
kubectl logs -n cryptofunk deployment/<agent-name> --tail=100

Common issues:
- LLM API failure: Check Bifrost logs and API key validity
- MCP server connectivity: Verify MCP servers are healthy
- Database connection: Check connection pool and database health
- Resource limits: Check if pod is OOM or CPU throttled

Restart agent:

kubectl rollout restart deployment/<agent-name> -n cryptofunk
kubectl logs -f -n cryptofunk deployment/<agent-name>

Verify recovery:
- Check agent_status table shows ACTIVE
- Verify agent is generating signals
- Monitor for repeated failures

AgentHighLatency

Severity: Warning

Description: Agent decision-making taking longer than expected (>5 seconds).

Impact: Delayed trading decisions, potential missed opportunities.

Response Steps:

Check agent performance metrics:

# Query Prometheus
histogram_quantile(0.95, rate(agent_decision_duration_seconds_bucket[5m]))

Identify bottleneck:
- LLM latency: Check Bifrost response times
- MCP tool latency: Check tool call durations
- Database queries: Check slow query log
- Resource constraints: Check CPU/memory usage

Investigate LLM performance:

kubectl logs -n cryptofunk deployment/bifrost --tail=100 | grep -i latency

Database optimization:

# Check for slow queries
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT query, mean_exec_time, calls
   FROM pg_stat_statements
   ORDER BY mean_exec_time DESC LIMIT 10;"

Remediation:
- If LLM: Switch to faster model or increase timeout
- If database: Add indexes, optimize queries
- If resource: Increase pod limits

HighMemoryUsage / HighCPUUsage

Severity: Warning

Description: Pod memory or CPU usage exceeding 80%.

Impact: Performance degradation, potential pod eviction or OOM kill.

Response Steps:

Identify affected pod:

kubectl top pods -n cryptofunk --sort-by=memory
kubectl top pods -n cryptofunk --sort-by=cpu

Check resource limits:

kubectl describe pod -n cryptofunk <pod-name> | grep -A 10 "Limits:"

Investigate memory leak:

# For Go applications, check heap profile
kubectl exec -it -n cryptofunk <pod-name> -- \
  curl http://localhost:6060/debug/pprof/heap > heap.prof

# Analyze with pprof
go tool pprof heap.prof

Remediation:
- Short-term: Increase resource limits
- Long-term: Fix memory leak or optimize code

Increase resources:

# Edit deployment to increase limits
kubectl edit deployment <deployment-name> -n cryptofunk

# Or use kubectl patch
kubectl patch deployment <deployment-name> -n cryptofunk -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"<container>","resources":{"limits":{"memory":"2Gi"}}}]}}}}'

DiskSpaceWarning

Severity: Warning

Description: Persistent volume disk usage exceeding 80%.

Impact: Database or storage failures if disk fills completely.

Response Steps:

Check disk usage:

kubectl exec -it -n cryptofunk deployment/postgres -- df -h

Identify large tables:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT schemaname, tablename,
          pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
   FROM pg_tables
   WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
   ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
   LIMIT 10;"

Apply TimescaleDB compression:

# Compress old data
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT compress_chunk(i, if_not_compressed => true)
   FROM show_chunks('candlesticks', older_than => INTERVAL '7 days') i;"

Archive old data:

# Export old trading sessions
kubectl exec -it -n cryptofunk deployment/postgres -- \
  pg_dump -U postgres -d cryptofunk -t trading_sessions \
  --where="created_at < NOW() - INTERVAL '90 days'" > archive.sql

# Delete archived data
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "DELETE FROM trading_sessions WHERE created_at < NOW() - INTERVAL '90 days';"

Expand volume:

# Increase PVC size (requires storage class with allowVolumeExpansion: true)
kubectl patch pvc postgres-pvc -n cryptofunk -p \
  '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

VectorSearchHighLatency

Severity: Warning

Description: Vector search p95 latency exceeds 2 seconds for 5 minutes.

Impact: Slow explainability queries, degraded user experience in decision similarity searches.

Response Steps:

Check current latency:

# Query Prometheus for p95 latency by operation
curl -G http://prometheus-service:9090/api/v1/query \
  --data-urlencode 'query=histogram_quantile(0.95, rate(cryptofunk_vector_search_latency_seconds_bucket[5m]))' | jq

Identify slow operations:

# Check which operation type is slow
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT
     'semantic_search' as operation,
     COUNT(*) as total_searches,
     AVG(EXTRACT(EPOCH FROM (NOW() - created_at))) as avg_age_seconds
   FROM llm_decisions
   WHERE created_at > NOW() - INTERVAL '15 minutes';"

Check pgvector index health:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT
     schemaname, tablename, indexname,
     pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
     idx_scan, idx_tup_read, idx_tup_fetch
   FROM pg_stat_user_indexes
   WHERE indexname LIKE '%embedding%' OR indexname LIKE '%vector%';"

Common causes and remediation:

a. Large result sets:

# Check if returning too many results
kubectl logs -n cryptofunk deployment/api --tail=100 | grep -i "vector search"

Solution: Reduce LIMIT in vector search queries (currently should be 10)

b. Missing or outdated index:

# Rebuild IVFFlat index if needed
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "REINDEX INDEX CONCURRENTLY llm_decisions_prompt_embedding_idx;"

c. Database resource contention:

# Check active connections and locks
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT COUNT(*) as active_queries,
          COUNT(*) FILTER (WHERE state = 'active' AND wait_event_type = 'Lock') as waiting_on_locks
   FROM pg_stat_activity WHERE state = 'active';"

Immediate mitigation:

# If index rebuild doesn't help, tune IVFFlat parameters
# (requires downtime - use during maintenance window)
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "DROP INDEX CONCURRENTLY llm_decisions_prompt_embedding_idx;
   CREATE INDEX llm_decisions_prompt_embedding_idx ON llm_decisions
   USING ivfflat (prompt_embedding vector_cosine_ops)
   WITH (lists = 200);"  # Increased from 100 for better performance

Monitor recovery:

# Watch latency metrics
watch -n 5 'curl -s -G http://prometheus-service:9090/api/v1/query \
  --data-urlencode "query=histogram_quantile(0.95, rate(cryptofunk_vector_search_latency_seconds_bucket[5m]))" | jq'

VectorSearchCriticalLatency

Severity: Critical

Description: Vector search p99 latency exceeds 5 seconds for 2 minutes.

Impact: Explainability features effectively broken, user-facing timeouts.

Response Steps:

Immediate assessment:

# Check if vector searches are timing out
kubectl logs -n cryptofunk deployment/api --tail=50 | grep -i "timeout\|error"

Check database health:

# Check for long-running queries
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT pid, now() - query_start as duration, state, query
   FROM pg_stat_activity
   WHERE query LIKE '%<=>%' OR query LIKE '%vector%'
   ORDER BY duration DESC LIMIT 5;"

Emergency mitigation - Kill slow queries:

# If queries are stuck, terminate them
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE query LIKE '%<=>%'
   AND state = 'active'
   AND now() - query_start > interval '10 seconds';"

Check disk I/O:

# High disk I/O can cause slowdowns
kubectl exec -it -n cryptofunk deployment/postgres -- iostat -x 2 5

Temporary disable explainability if needed:

# If blocking critical trading functionality, temporarily disable
kubectl set env deployment/api -n cryptofunk ENABLE_VECTOR_SEARCH=false
kubectl rollout status deployment/api -n cryptofunk

Escalate:
- This is a critical issue requiring immediate database expert involvement
- Notify #cryptofunk-critical with details
- Prepare for potential database restart or index rebuild

VectorSearchHighErrorRate

Severity: Warning

Description: Vector search error rate exceeds 5% for 5 minutes.

Impact: Partial failure of explainability features, some queries failing.

Response Steps:

Check error patterns:

# Review API logs for error details
kubectl logs -n cryptofunk deployment/api --tail=200 | grep -i "vector\|error" | tail -20

Common error types:

a. pgvector extension issues:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT * FROM pg_extension WHERE extname = 'vector';"

If missing: Reinstall pgvector extension

b. Invalid embedding dimensions:

# Check for dimension mismatches
kubectl logs -n cryptofunk deployment/api --tail=100 | grep -i "dimension\|1536"

Solution: Verify LLM embeddings are 1536 dimensions (OpenAI ada-002)

c. Connection pool exhaustion:

# Check if database connections are available
curl http://api-service:8080/metrics | grep database_connections

Verify data integrity:

# Check for NULL embeddings
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT COUNT(*) as null_embeddings
   FROM llm_decisions
   WHERE prompt_embedding IS NULL
   AND created_at > NOW() - INTERVAL '1 hour';"

Remediation:

# Restart API if connection issues
kubectl rollout restart deployment/api -n cryptofunk

# Monitor recovery
kubectl logs -f -n cryptofunk deployment/api | grep -i "vector"

Verify recovery:

# Check error rate has dropped
watch -n 10 'curl -s http://prometheus-service:9090/api/v1/query \
  --data-urlencode "query=rate(cryptofunk_vector_search_operations_total{status=\"error\"}[5m])" | jq'

VectorSearchCriticalErrorRate

Severity: Critical

Description: Vector search error rate exceeds 20% for 2 minutes.

Impact: Vector search effectively broken, explainability features unavailable.

Response Steps:

Immediate triage:

# Get recent error samples
kubectl logs -n cryptofunk deployment/api --tail=50 | grep -A 5 "vector.*error"

Check pgvector extension:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';"

If not found:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c "CREATE EXTENSION IF NOT EXISTS vector;"

Check index corruption:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT pg_relation_size('llm_decisions_prompt_embedding_idx') as index_size;"

If 0 bytes, rebuild index:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "REINDEX INDEX CONCURRENTLY llm_decisions_prompt_embedding_idx;"

Emergency response - Disable feature:

# Prevent cascading failures
kubectl set env deployment/api -n cryptofunk ENABLE_VECTOR_SEARCH=false

Escalate to DBA:
- Critical database issue requiring expert intervention
- Prepare for potential restore from backup
- Document all error messages and steps taken

VectorSearchNoOperations

Severity: Info

Description: No vector search operations detected for 30 minutes.

Impact: Potential issue with explainability features or simply no usage.

Response Steps:

Verify if expected:
- Check if users are actively using the system
- Review time of day (may be normal during off-hours)

Test vector search endpoint:

# Test similarity search
curl -X POST http://api-service:8080/api/v1/decisions/similar \
  -H "Content-Type: application/json" \
  -d '{"query": "test query", "limit": 5}'

Check API routing:

# Verify endpoint is registered
kubectl logs -n cryptofunk deployment/api --tail=100 | grep -i "route\|endpoint"

Verify database connectivity:

kubectl exec -it -n cryptofunk deployment/api -- \
  wget -O- http://localhost:8080/health

Review recent deployments:
- Check if recent API deployment broke the feature
- Review git log for changes to decision API handlers

DatabaseConnectionPoolExhausted

Severity: Warning

Description: More than 90% of database connections are in use (9+/10).

Impact: New queries will be queued, including vector searches. Increased latency.

Response Steps:

Check current connection usage:

# From metrics
curl http://api-service:8080/metrics | grep database_connections

# From database
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT COUNT(*) as total_connections,
          COUNT(*) FILTER (WHERE state = 'active') as active,
          COUNT(*) FILTER (WHERE state = 'idle') as idle
   FROM pg_stat_activity WHERE datname = 'cryptofunk';"

Identify connection consumers:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT application_name, state, COUNT(*)
   FROM pg_stat_activity
   WHERE datname = 'cryptofunk'
   GROUP BY application_name, state
   ORDER BY COUNT(*) DESC;"

Check for connection leaks:

# Look for long-idle connections
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT pid, usename, application_name, state,
          now() - state_change as idle_duration
   FROM pg_stat_activity
   WHERE state = 'idle' AND datname = 'cryptofunk'
   ORDER BY state_change LIMIT 10;"

Immediate mitigation:

# Terminate old idle connections (>10 minutes)
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE state = 'idle'
   AND now() - state_change > interval '10 minutes'
   AND datname = 'cryptofunk';"

Long-term fix:
- Review connection pool configuration in internal/db/db.go (MaxConns: 10)
- Consider increasing MaxConns if server has resources
- Implement connection pooling in PgBouncer
- Fix connection leaks in application code

DatabaseConnectionPoolCritical

Severity: Critical

Description: Zero idle database connections available.

Impact: All new queries blocked, including vector searches. System effectively frozen.

Response Steps:

Emergency assessment:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT pid, application_name, state,
          now() - query_start as duration,
          LEFT(query, 100) as query_preview
   FROM pg_stat_activity
   WHERE datname = 'cryptofunk'
   ORDER BY query_start LIMIT 20;"

Identify blocking queries:

kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT blocked_locks.pid AS blocked_pid,
          blocking_locks.pid AS blocking_pid,
          blocked_activity.query AS blocked_query,
          blocking_activity.query AS blocking_query
   FROM pg_locks blocked_locks
   JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
   JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
   JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.granted AND blocking_locks.granted;"

Emergency termination:

# Kill blocking query (replace <PID> from above)
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c "SELECT pg_terminate_backend(<PID>);"

Restart services if needed:

# Last resort - restart API to release connections
kubectl rollout restart deployment/api -n cryptofunk
kubectl rollout status deployment/api -n cryptofunk

Post-incident:
- Review query that caused blockage
- Add query timeouts if missing
- Consider implementing PgBouncer connection pooler
- Increase MaxConns in internal/db/db.go if appropriate

VectorSearchSlowWithHighConnections

Severity: Warning

Description: P95 vector search latency >1s while database connections >8/10.

Impact: Contention causing slow queries, potential database bottleneck.

Response Steps:

Confirm contention:

# Check for lock waits
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT wait_event_type, wait_event, COUNT(*)
   FROM pg_stat_activity
   WHERE state = 'active'
   GROUP BY wait_event_type, wait_event;"

Identify resource bottleneck:

# Check CPU and I/O wait
kubectl exec -it -n cryptofunk deployment/postgres -- top -n 1
kubectl exec -it -n cryptofunk deployment/postgres -- iostat -x 1 5

Review query patterns:

# Check for sequential scans on vector searches
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT schemaname, tablename, seq_scan, idx_scan
   FROM pg_stat_user_tables
   WHERE tablename = 'llm_decisions';"

Remediation:
- If high seq_scan: Rebuild indexes during maintenance
- If high I/O: Consider read replicas for vector searches
- If high CPU: Optimize query plans or scale database

Monitor improvement:

watch -n 5 'curl -s http://prometheus-service:9090/api/v1/query \
  --data-urlencode "query=cryptofunk_database_connections_active" | jq'

Alert Notification Channels

Slack Channels

#cryptofunk-alerts: All alerts (info, warning, critical)
#cryptofunk-critical: Critical alerts only (@channel mentions)
#cryptofunk-trading: Trading-specific alerts
#cryptofunk-agents: Agent health alerts
#cryptofunk-ops: System and circuit breaker alerts

Email

team@example.com: Regular alerts (configured via ALERT_EMAIL_TO)
oncall@example.com: Critical alerts (configured via CRITICAL_ALERT_EMAIL_TO)

Configuration

Alert notification settings are configured in:

Docker Compose: docker-compose.yml environment variables
Kubernetes: deployments/k8s/base/secrets.yaml

Required environment variables:

SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_FROM=cryptofunk-alerts@example.com
SMTP_USERNAME=your-username
SMTP_PASSWORD=your-password
ALERT_EMAIL_TO=team@example.com
CRITICAL_ALERT_EMAIL_TO=oncall@example.com

Escalation Procedures

Level 1: Team Response (0-30 minutes)

On-call engineer receives alert
Follows runbook procedures
Attempts immediate remediation
Updates #cryptofunk-critical with status

Level 2: Senior Engineer Escalation (30-60 minutes)

If issue not resolved within 30 minutes
Senior engineer joins incident response
Reviews remediation attempts
Coordinates with external dependencies (exchange, LLM provider)

Level 3: Emergency Escalation (60+ minutes)

If trading system down for >1 hour
Escalate to engineering lead and product manager
Consider emergency maintenance window
Prepare external communication for stakeholders

Escalation Contacts

Update with your team's contact information:

On-Call Engineer: [Slack: @oncall] [Phone: +1-XXX-XXX-XXXX]
Senior Engineer: [Slack: @senior-eng] [Phone: +1-XXX-XXX-XXXX]
Engineering Lead: [Slack: @eng-lead] [Phone: +1-XXX-XXX-XXXX]
Product Manager: [Slack: @pm] [Phone: +1-XXX-XXX-XXXX]

Useful Commands Reference

Check All Alert Status

curl http://alertmanager-service:9093/api/v2/alerts | jq

Silence an Alert

# Create silence for 1 hour
amtool silence add alertname=CircuitBreakerOpen --duration=1h \
  --comment="Investigating circuit breaker issue"

View Alert History

# Query Prometheus for alert history
curl -G http://prometheus-service:9090/api/v1/query \
  --data-urlencode 'query=ALERTS{alertname="CircuitBreakerOpen"}[1h]' | jq

Check Trading System Health

# Overall system health
kubectl get pods -n cryptofunk

# Database health
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -c "SELECT 1;"

# Recent trading activity
kubectl exec -it -n cryptofunk deployment/postgres -- \
  psql -U postgres -d cryptofunk -c \
  "SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '1 hour';"

Additional Resources

Prometheus: http://prometheus-service:9090
AlertManager: http://alertmanager-service:9093
Grafana: http://grafana-service:3000
API Health: http://api-service:8080/health
Orchestrator Health: http://orchestrator-service:8080/health

For detailed architecture information, see:

docs/MCP_INTEGRATION.md
docs/LLM_AGENT_ARCHITECTURE.md
CLAUDE.md

FilesExpand file tree

ALERT_RUNBOOK.md

Latest commit

History

ALERT_RUNBOOK.md

File metadata and controls

AlertManager Runbook

Table of Contents

Alert Severity Levels

Critical

Warning

Info

Alert Categories

1. Circuit Breaker Alerts

2. Trading Alerts

3. Agent Health Alerts

4. System Resource Alerts

5. Infrastructure Alerts

6. Vector Search Alerts

Alert Response Procedures

CircuitBreakerOpen

HighErrorRate

TradingSessionFailure

PositionLimitExceeded

MaxDrawdownExceeded

AgentDown / AgentUnhealthy

AgentHighLatency

HighMemoryUsage / HighCPUUsage

DiskSpaceWarning

VectorSearchHighLatency

VectorSearchCriticalLatency

VectorSearchHighErrorRate

VectorSearchCriticalErrorRate

VectorSearchNoOperations

DatabaseConnectionPoolExhausted

DatabaseConnectionPoolCritical

VectorSearchSlowWithHighConnections

Alert Notification Channels

Slack Channels

Email

Configuration

Escalation Procedures

Level 1: Team Response (0-30 minutes)

Level 2: Senior Engineer Escalation (30-60 minutes)

Level 3: Emergency Escalation (60+ minutes)

Escalation Contacts

Useful Commands Reference

Check All Alert Status

Silence an Alert

View Alert History

Check Trading System Health

Additional Resources