Quick reference for common issues, fixes, and troubleshooting procedures for the Facebook Messenger AI Bot.
This runbook should be updated whenever new issues are discovered or procedures change. Keep it synchronized with the codebase and operational reality. See AGENTS.md and TESTING.md for instructions on when and how to update this runbook.
Symptoms:
- Agent responses are not based on the reference document
- Agent provides information outside the knowledge base
- High escalation rate due to out-of-scope queries
Diagnosis:
- Check agent confidence scores in Logfire logs:
# If using Logfire cloud: View logs in Logfire dashboard # If local: Check structured logs for confidence scores grep -i "confidence" logs/*.log | grep -E "confidence=[0-9.]+"
- Review agent service logs for low confidence patterns:
grep "low confidence\|escalat\|requires_escalation" logs/*.log
- Check reference document is properly loaded:
uv run python -m src.cli.setup_cli verify # If CLI has verify command
Fix:
- Review reference document content for completeness
- Update reference document if source website changed:
- Re-run CLI setup:
uv run python -m src.cli.setup_cli setup - Or manually update via Supabase dashboard
- Re-run CLI setup:
- Verify agent system prompt in
src/services/agent_service.pyline 49 - Adjust confidence threshold (currently 0.7) in agent logic if needed
- Test with evaluation set:
uv run pytest tests/e2e/test_webhook_message_flow.py -v
Symptoms:
- Health check returns non-200 status
- Response latency increases significantly
- Fallback to OpenAI is being used frequently
Diagnosis:
- Check Copilot SDK availability:
curl -X GET http://localhost:5909/health -v
- Check if Copilot CLI is running:
ps aux | grep copilot - Review Logfire logs for health check failures:
# Search for Copilot health check logs grep -i "copilot.*health\|copilot.*available\|copilot.*unavailable" logs/*.log # Or check Logfire dashboard for structured logs with correlation IDs
- Check Copilot service logs in
src/services/copilot_service.pyinitialization
Fix:
- Verify GitHub Copilot CLI is installed:
github-copilot --version
- Start Copilot CLI if not running:
github-copilot-cli start
- Check COPILOT_CLI_HOST environment variable:
echo $COPILOT_CLI_HOST
- Verify port 5909 is not blocked:
lsof -i :5909
- Restart the FastAPI application if health check still fails
- Monitor fallback rate - if > 5%, investigate Copilot CLI stability
Symptoms:
- Messages not being sent to users
- HTTP 429 (Too Many Requests) errors
- HTTP 401 (Unauthorized) errors
- HTTP 403 (Forbidden) errors
Diagnosis:
- Check Facebook API response in logs:
grep "facebook_service\|send_message" logs/*.log | grep -i "error\|429\|401\|403"
- Verify Facebook page access token:
echo $FACEBOOK_PAGE_ACCESS_TOKEN | wc -c # Should be ~100+ characters
- Check token expiration with Facebook Graph API:
curl "https://graph.facebook.com/v18.0/debug_token?input_token={token}&access_token={token}" - Check rate limit status:
grep "rate_limit\|429" logs/*.log
Fix:
-
For 401/403 errors (invalid or expired token):
- Generate new page access token from Facebook App Dashboard
- Update FACEBOOK_PAGE_ACCESS_TOKEN in Railway environment
- Restart application
-
For 429 errors (rate limiting):
- Implement exponential backoff (already done in
src/services/facebook_service.py) - Reduce concurrent message sends if applicable
- Check if another process is sending messages on same page
- Contact Facebook support if limit is too restrictive
- Implement exponential backoff (already done in
-
For other HTTP errors:
- Check Facebook API documentation for error codes
- Verify page ID and recipient IDs are correct
- Check message content (too long, invalid format)
Symptoms:
- "Failed to connect to database" errors
- Queries timing out
- Connection pool exhausted errors
- 500 Internal Server Error on webhook requests
Diagnosis:
- Check Supabase connection status:
curl -X POST https://<supabase-url>/rest/v1/rpc/health_check \ -H "apikey: $SUPABASE_SERVICE_KEY"
- Verify Supabase credentials:
echo "URL: $SUPABASE_URL" echo "KEY length: ${#SUPABASE_SERVICE_KEY}"
- Check database connection logs:
grep "database\|connection\|timeout" logs/*.log | grep -i error
- Verify network connectivity:
ping <supabase-url>
Fix:
- Verify SUPABASE_URL and SUPABASE_SERVICE_KEY in environment:
# In Railway dashboard: Settings → Environment - Check Supabase project status in Supabase dashboard
- Restart application to refresh connection pool:
# In Railway: Deployment → Restart - Check if connection pool is exhausted:
- Review repository.py connection pooling settings
- May need to increase pool size if high concurrency
- Check for long-running queries blocking connections
- Review Supabase logs in Supabase dashboard for errors
Symptoms:
- Escalation rate > 20% (alert threshold)
- Agent confidence consistently < 0.7
- Many messages being routed to human review
Diagnosis:
- Check escalation rate in logs:
grep "requires_escalation" logs/*.log | grep true | wc -l
- Review escalation reasons:
grep "escalation_reason" logs/*.log
- Analyze confidence score distribution:
grep "confidence" logs/*.log | cut -d= -f2 | sort -n | tail -20
- Check if reference document is empty or corrupted:
# Via Supabase: SELECT * FROM reference_documents LIMIT 1;
Fix:
-
If reference document is incomplete:
- Run setup again:
uv run python -m src.cli.setup_cli setup - Verify website is still accessible and content is relevant
- Check for website structure changes
- Run setup again:
-
If confidence threshold is too strict:
- Review current threshold (0.7) in
src/services/agent_service.pyline 57 - Consider lowering if threshold is unreasonably high
- Document any changes to RUNBOOK.md
- Review current threshold (0.7) in
-
If question types are out of scope:
- Update agent system prompt with new use cases
- Add examples to the agent context
- Review GUARDRAILS.md for escalation rules
-
Monitor after changes:
uv run pytest tests/integration/test_agent_integration.py -v
Symptoms:
- Webhook returns 200 but message not processed
- Agent service times out
- Message appears in logs but no response sent
Diagnosis:
- Check webhook processing logs:
grep "POST /webhook\|process_message" logs/*.log
- Verify message payload format:
grep "MessengerWebhookPayload" logs/*.log | grep -i error
- Check for async task completion:
grep "background_task\|BackgroundTasks" logs/*.log
- Check agent service performance:
grep "Agent response latency\|timeout" logs/*.log
Fix:
-
Verify webhook payload is valid JSON:
- Check Facebook webhook documentation
- Test with manual webhook call:
uv run pytest tests/e2e/test_webhook_verification.py
-
Check agent service latency:
- If > 2s, investigate Copilot SDK response time
- May need to increase timeouts in config
-
Verify background tasks are completing:
- Check FastAPI app state:
app.state.background_tasks - Ensure no unhandled exceptions in task
- Check FastAPI app state:
-
Review message processing flow in
src/api/webhook.py
Symptoms:
- Facebook webhook setup fails during verification
- 403 Forbidden on GET /webhook
- Facebook shows "Webhook URL couldn't be validated"
Diagnosis:
- Check verify token:
echo $FACEBOOK_VERIFY_TOKEN
- Check webhook endpoint is responding:
curl -X GET "http://localhost:8000/webhook?hub.mode=subscribe&hub.challenge=test&hub.verify_token=$FACEBOOK_VERIFY_TOKEN" - Check webhook logs:
grep "verify_webhook\|webhook verification" logs/*.log
Fix:
-
Verify FACEBOOK_VERIFY_TOKEN matches in both:
.envfile locally- Railway environment variables
- Facebook App Webhook Settings
-
Ensure webhook URL is publicly accessible:
- Railway URL should be:
https://<railway-url>/webhook - Test with curl from another machine
- Railway URL should be:
-
Check webhook endpoint code in
src/api/webhook.py:- Verify verification logic is correct
- Ensure token comparison is exact (case-sensitive)
-
Test webhook verification:
uv run pytest tests/e2e/test_webhook_verification.py -v
# Check application health
curl http://localhost:8000/health
# Check Copilot SDK health
curl http://localhost:5909/health
# Check Supabase connection
curl -X POST https://<supabase-url>/rest/v1/rpc/health_check \
-H "apikey: $SUPABASE_SERVICE_KEY"
# Check Facebook API connectivity
curl "https://graph.facebook.com/v18.0/me?access_token=$FACEBOOK_PAGE_ACCESS_TOKEN"Local Development (Console Logs):
# View application logs with debug level
LOG_LEVEL=DEBUG uv run uvicorn src.main:app --reload
# Search logs for errors
grep -i error logs/*.log
# Search logs for specific service
grep "CopilotService\|FacebookService\|AgentService" logs/*.log
# Filter by log level
grep "\[ERROR\]\|\[WARNING\]" logs/*.log
# Real-time log monitoring
tail -f logs/*.logLogfire Structured Logs:
# Logfire automatically instruments FastAPI, Pydantic, and services
# All logs are structured with correlation IDs for request tracing
# Search for correlation IDs to trace a request
grep "correlation_id" logs/*.log
# View request/response traces
grep "POST /webhook\|GET /health" logs/*.log
# View agent execution traces
grep "Processing agent response\|Agent response generated" logs/*.log
# View database operation timing
grep "database_query\|query_duration" logs/*.log
# If using Logfire cloud (with LOGFIRE_TOKEN set):
# - View logs in Logfire dashboard
# - Use correlation IDs to trace complete request flows
# - Filter by service, log level, or time range# Connect to Supabase (via Supabase CLI)
supabase db pull
# Query bot configurations
# Via Supabase dashboard: SELECT * FROM bot_configurations;
# Query recent messages
# Via Supabase dashboard: SELECT * FROM message_history ORDER BY created_at DESC LIMIT 100;
# Query reference documents
# Via Supabase dashboard: SELECT id, bot_config_id, LENGTH(content) as content_size FROM reference_documents;Test conversations from Test the bot (in-flow) or uv run python -m src.cli.setup_cli test are stored in Supabase.
- Tables:
test_sessions(one per REPL run:reference_doc_id,source_url,tone) andtest_messages(each user/bot exchange). - Session ID: When a test REPL starts, the CLI prints
Session ID: <uuid> — view in Supabase: test_sessions / test_messages. Use that UUID to filter. - View in Supabase: Open Table Editor (or SQL) →
test_sessionsfor config,test_messagesfor history. Filtertest_messagesbytest_session_id= the echoed session ID to see the current run. - If Supabase is unavailable during a test run, the CLI warns and the REPL continues without persisting.
# Run agent with debug output
LOG_LEVEL=DEBUG uv run uvicorn src.main:app --reload
# Test agent with sample message
uv run python -c "
from src.services.agent_service import MessengerAgentService
from src.services.copilot_service import CopilotService
from src.models.agent_models import AgentContext
import asyncio
async def test():
copilot = CopilotService('http://localhost:5909')
agent = MessengerAgentService(copilot)
context = AgentContext(
bot_config_id='test-123',
reference_doc='# Test\nThis is a test document.',
tone='professional',
recent_messages=[]
)
response = await agent.respond(context, 'What is this about?')
print(f'Response: {response}')
asyncio.run(test())
"
# Run evaluation tests
uv run pytest tests/unit/test_agent_service.py -v --hypothesis-show-statistics
# View agent execution traces in Logfire
grep -i "processing agent response\|agent response generated\|confidence\|escalation" logs/*.log
# Trace a specific request by correlation ID
# (Get correlation_id from webhook logs, then search)
grep "correlation_id=<id>" logs/*.log# Check Copilot SDK status
uv run python -c "
from src.services.copilot_service import CopilotService
import asyncio
async def check():
copilot = CopilotService('http://localhost:5909')
available = await copilot.is_available()
print(f'Copilot available: {available}')
asyncio.run(check())
"
# Test Copilot fallback to OpenAI
COPILOT_ENABLED=false uv run uvicorn src.main:app
# Monitor Copilot response times in Logfire logs
grep -i "copilot.*response\|copilot.*timing\|copilot.*duration" logs/*.log
# Check for fallback events
grep -i "fallback.*openai\|using.*openai.*fallback" logs/*.log
# View Copilot health check logs
grep -i "copilot.*health\|copilot.*available" logs/*.log# Test webhook by sending test message
uv run pytest tests/e2e/test_webhook_message_flow.py -v
# Verify page token is valid
curl "https://graph.facebook.com/v18.0/debug_token?input_token=$FACEBOOK_PAGE_ACCESS_TOKEN&access_token=$FACEBOOK_PAGE_ACCESS_TOKEN"
# Send test message to Facebook
curl -X POST https://graph.facebook.com/v18.0/me/messages \
-H "Content-Type: application/json" \
-d "{
\"recipient\": {\"id\": \"<user_id>\"},
\"message\": {\"text\": \"Test message\"}
}" \
-d "access_token=$FACEBOOK_PAGE_ACCESS_TOKEN"# Run all tests
uv run pytest
# Run with coverage report
uv run pytest --cov=src --cov-report=term-missing
# Run specific test category
uv run pytest tests/unit/ -v
uv run pytest tests/integration/ -v
uv run pytest tests/e2e/ -v
# Run Hypothesis tests with statistics
uv run pytest tests/unit/test_hypothesis.py -v --hypothesis-show-statistics
# Run stateful tests
uv run pytest tests/stateful/ -v| Alert | Threshold | Action | Monitoring |
|---|---|---|---|
| Response Latency (p95) | > 2 seconds | Investigate Copilot SDK or database | Check Logfire traces for agent_response_latency_seconds or request timing |
| Error Rate | > 2% for 5 min | Page on-call engineer | Monitor HTTP 5xx errors in Logfire logs or Sentry |
| Escalation Rate | > 20% | Review agent prompt & reference doc | Count requires_escalation=true in Logfire logs |
| Copilot SDK Availability | < 99% uptime | Restart Copilot CLI or fallback to OpenAI | Monitor health check frequency in Logfire logs |
| Copilot Fallback Rate | > 5% | Investigate Copilot SDK stability | Count OpenAI fallback events in Logfire logs |
| Facebook API Error Rate | > 10% (any HTTP error) | Check token/rate limits | Monitor HTTP errors in Logfire facebook_service logs |
| Database Connection Failures | > 3 consecutive | Restart app or check Supabase | Monitor connection timeouts in Logfire logs |
| Message Processing Timeout | > 30 seconds | Investigate bottleneck (Copilot, DB, FB API) | Check task processing latency in Logfire traces |
| Logfire Logging Failures | Any | Check Logfire configuration | Monitor for missing correlation IDs or structured log format issues |
Health Check:
curl http://localhost:5909/healthCommon Issues:
- Service unavailable: Check Copilot CLI is running, port 5909 is accessible
- Timeout errors: Increase timeout (default 2 seconds) in
is_available()method - Fallback behavior: Automatically uses OpenAI when unavailable, logged with
logfire.info()
Fallback Logic:
- If
is_available()returns False, agent uses OpenAI instead of Copilot - Fallback is logged with structured Logfire logs, monitor with:
grep -i "fallback.*openai\|using.*openai.*fallback" logs/*.log - Logfire traces include timing, health check status, and fallback events with correlation IDs
Logfire Logging:
- Health check timing and availability status logged
- API call success/failure with response times
- Fallback events include context (reason, timing, response)
- All logs include correlation IDs for request tracing
Common Issues:
- Low confidence scores: Reference document incomplete or question out of scope
- Off-topic responses: Agent system prompt needs examples of in-scope queries
- Response too long: Responses are truncated to 300 characters (Facebook Messenger limit)
- Escalation loops: Check escalation reason in logs
Debugging:
- Add debug output:
LOG_LEVEL=DEBUGin environment - Review agent logs:
grep -i "messenger.*agent\|processing agent response\|agent response generated" logs/*.log - Check confidence threshold: Currently 0.7 in
respond()method - View structured Logfire logs for confidence scores, escalation decisions, and timing
- Trace complete request flow using correlation IDs from Logfire logs
Common Issues:
- 401 Unauthorized: Token expired or invalid
- 429 Rate Limited: Too many requests in short time
- 400 Bad Request: Message format invalid or recipient ID wrong
Debugging:
# Check token validity
curl "https://graph.facebook.com/v18.0/me?access_token=$FACEBOOK_PAGE_ACCESS_TOKEN"
# Test message send
curl -X POST https://graph.facebook.com/v18.0/me/messages \
-d "access_token=$FACEBOOK_PAGE_ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{"recipient":{"id":"<user_id>"},"message":{"text":"test"}}'Retry Logic:
- Implements exponential backoff (max 3 retries)
- Check retry logs:
grep "retry\|backoff" logs/*.log
Common Issues:
- Timeout errors: Website too slow, network issues
- Parse errors: HTML structure changed or unexpected format
- Empty content: Website blocks scraping or content not found
Debugging:
# Test scraping manually
uv run python -c "
from src.services.scraper import scrape_website
import asyncio
async def test():
chunks = await scrape_website('https://example.com')
print(f'Chunks: {len(chunks)}')
asyncio.run(test())
"
# Check scraper logs
grep "ScraperService\|scrape_website\|chunk" logs/*.logCommon Issues:
- Connection pool exhausted: Too many concurrent requests
- Query timeouts: Large dataset or slow query
- Foreign key constraints: Data integrity issues
Debugging:
# Check active connections in Supabase dashboard
# Settings → Database → Connections
# Monitor query performance in Logfire logs
grep -i "database.*query\|query.*duration\|database.*timing" logs/*.log
# Check for connection pool issues
grep -i "pool.*exhausted\|connection.*failed\|database.*error" logs/*.log
# View database operation traces with correlation IDs
grep "correlation_id" logs/*.log | grep -i "database\|repository"Logfire Logging:
- Database operation timing logged for all queries
- Query success/failure rates tracked
- Bot configuration lookups include timing
- Message history storage metrics logged
- All operations include correlation IDs for request tracing
Configuration:
- Centralized logging setup in
src/logging_config.py - Environment-aware: Console formatting for local, JSON for production
- FastAPI and Pydantic instrumentation enabled automatically
- Optional cloud logging with
LOGFIRE_TOKENenvironment variable
Common Issues:
- Logs not appearing: Check
LOG_LEVELenvironment variable (default: INFO) - Missing correlation IDs: Ensure
CorrelationIDMiddlewareis first middleware - PII in logs: Verify
logfire_enable_pii_masking=Truein settings - Cloud logging not working: Check
LOGFIRE_TOKENis set correctly
Debugging:
# Check Logfire configuration
uv run python -c "
from src.config import get_settings
settings = get_settings()
print(f'Log Level: {settings.log_level}')
print(f'PII Masking: {settings.logfire_enable_pii_masking}')
print(f'Request Logging: {settings.logfire_enable_request_logging}')
print(f'Logfire Token Set: {bool(settings.logfire_token)}')
"
# Verify Logfire initialization
grep -i "logfire.*configure\|logfire.*initialized" logs/*.log
# Check for correlation IDs in logs
grep "correlation_id" logs/*.log | head -5
# View structured log format
tail -20 logs/*.log | grep -v "^$"Logfire Features:
- Request Tracing: Automatic FastAPI request/response tracing with timing
- Pydantic Validation: Model validation errors logged automatically
- PydanticAI Tracing: Agent execution and decision logging
- Correlation IDs: Request tracing across all services
- Structured Logs: JSON format in production for log aggregation
- PII Masking: Automatic masking of sensitive data (tokens, PII)
# Required variables (will fail without these)
✓ FACEBOOK_PAGE_ACCESS_TOKEN # Long-lived page token
✓ FACEBOOK_VERIFY_TOKEN # Custom webhook token
✓ SUPABASE_URL # Supabase project URL
✓ SUPABASE_SERVICE_KEY # Service role key
# Optional but recommended
✓ COPILOT_CLI_HOST # Default: http://localhost:5909
✓ COPILOT_ENABLED # Default: True
✓ OPENAI_API_KEY # Fallback LLM
✓ LOGFIRE_TOKEN # Logfire cloud logging (optional, enables cloud dashboard)
✓ LOG_LEVEL # Default: INFO (DEBUG, INFO, WARNING, ERROR, CRITICAL)
✓ LOGFIRE_ENABLE_PII_MASKING # Default: True (mask sensitive data in logs)
✓ LOGFIRE_ENABLE_REQUEST_LOGGING # Default: True (HTTP request/response logging)
✓ SENTRY_DSN # Error tracking (optional)
# Check all are set
env | grep -i facebook
env | grep -i supabase
env | grep -i copilot
env | grep -i openai| Error | Cause | Solution |
|---|---|---|
| "Failed to connect to Supabase" | Invalid credentials or network issue | Verify SUPABASE_URL and SUPABASE_SERVICE_KEY |
| "Copilot SDK unavailable, using OpenAI fallback" | Copilot CLI not running or unreachable | Start GitHub Copilot CLI: github-copilot-cli start |
| "Facebook API returned 401" | Invalid or expired token | Generate new page access token in Facebook App |
| "Webhook verification failed" | Token mismatch | Verify FACEBOOK_VERIFY_TOKEN matches Facebook settings |
| "Agent confidence too low" | Question out of scope or poor reference doc | Review reference document and update if needed |
| "Database connection timeout" | Connection pool exhausted or network issue | Restart app or check Supabase status |
| "Logfire configuration failed" | Invalid Logfire token or network issue | Verify LOGFIRE_TOKEN (if using cloud) or check local logging setup |
| "Missing correlation ID in logs" | CorrelationIDMiddleware not properly configured | Ensure middleware is added first in src/main.py |
| "PII detected in logs" | PII masking disabled or misconfigured | Set LOGFIRE_ENABLE_PII_MASKING=True and verify mask_pii() usage |
-
Pre-deployment checklist:
# Format code uv run ruff format . # Run linter uv run ruff check . # Run all tests uv run pytest --cov=src --cov-report=term-missing
-
Post-deployment (on Railway):
# Health check curl https://<railway-url>/health # Verify webhook endpoint curl https://<railway-url>/webhook?hub.mode=subscribe&hub.verify_token=test&hub.challenge=test123
-
Environment variable verification:
- Check Railway Environment variables are set
- Verify no secrets in logs:
grep -i token logs/*.log(should show masked tokens if PII masking enabled) - Verify Logfire configuration: Check
LOG_LEVEL,LOGFIRE_TOKEN(if using cloud),LOGFIRE_ENABLE_PII_MASKING
-
Facebook App Setup:
- Go to Facebook App → Messenger → Settings
- Webhook URL:
https://<railway-url>/webhook - Verify Token: Matches FACEBOOK_VERIFY_TOKEN
- Subscribe to:
messages,messaging_postbacks
-
Test Webhook:
# Local testing curl -X POST http://localhost:8000/webhook \ -H "Content-Type: application/json" \ -d '{"object":"page","entry":[{"messaging":[{"sender":{"id":"test"},"message":{"text":"hello"}}]}]}'
-
Verify in Facebook:
- Check webhook status in App Dashboard
- Should show "Verified" status
- Review recent requests/errors
This runbook is a living document and should be updated when:
- New operational issues are discovered — Add to "Common Issues & Fixes" section
- New debugging procedures are developed — Add to "Debug Commands" section
- New alert thresholds are established — Update "Alert Thresholds" table
- Service behavior changes — Update "Service-Specific Troubleshooting" section
- Environment or deployment changes — Update relevant sections
For instructions on when and how to update this runbook, see:
AGENTS.md— "Operational Documentation Maintenance" sectionTESTING.md— "Operational Issue Documentation" section