- Service Health Checks
- Common Issues and Resolution
- Incident Response Procedures
- Performance Troubleshooting
- Disaster Recovery
- Scaling Procedures
- Security Incidents
# Check all services
curl -s https://api.domain.com/api/v1/health | jq .
# Check specific components
docker compose ps
docker compose exec api curl -s localhost:8000/api/v1/health
docker compose exec postgres pg_isready
docker compose exec redis redis-cli ping# API response times
curl -w "@curl-format.txt" -o /dev/null -s https://api.domain.com/api/v1/health
# Database connections
docker compose exec postgres psql -U rendiff_user -d rendiff -c \
"SELECT count(*) FROM pg_stat_activity WHERE datname = 'rendiff';"
# Queue depth
docker compose exec redis redis-cli llen celery
# Worker status
docker compose exec worker-cpu celery -A worker.main inspect activeSymptoms:
- P95 latency > 5 seconds
- Timeouts on /convert endpoint
- User complaints about slow processing
Diagnosis:
# Check CPU usage
docker stats --no-stream
# Check database slow queries
docker compose exec postgres psql -U rendiff_user -d rendiff -c \
"SELECT query, mean_exec_time, calls FROM pg_stat_statements
WHERE mean_exec_time > 1000 ORDER BY mean_exec_time DESC LIMIT 10;"
# Check Redis memory
docker compose exec redis redis-cli info memoryResolution:
-
Scale API containers:
docker compose up -d --scale api=4
-
Clear slow queries:
# Analyze and optimize slow queries docker compose exec postgres psql -U rendiff_user -d rendiff -c \ "ANALYZE jobs; REINDEX TABLE jobs;"
-
Increase connection pool:
# Update DATABASE_POOL_SIZE in .env DATABASE_POOL_SIZE=40 docker compose restart api
Symptoms:
- Jobs remain in "queued" status
- Queue depth increasing
- No worker activity
Diagnosis:
# Check worker status
docker compose logs --tail=100 worker-cpu | grep ERROR
# Check queue status
docker compose exec redis redis-cli llen high
docker compose exec redis redis-cli llen default
docker compose exec redis redis-cli llen low
# Check worker processes
docker compose exec worker-cpu ps aux | grep celeryResolution:
-
Restart workers:
docker compose restart worker-cpu worker-gpu
-
Scale workers:
docker compose up -d --scale worker-cpu=6
-
Clear stuck jobs:
# Move stuck jobs back to queue docker compose exec api python -c " from api.models.job import Job, JobStatus from api.database import SessionLocal db = SessionLocal() stuck_jobs = db.query(Job).filter( Job.status == JobStatus.PROCESSING, Job.updated_at < datetime.now() - timedelta(hours=1) ).all() for job in stuck_jobs: job.status = JobStatus.QUEUED db.commit() "
Symptoms:
- "No space left on device" errors
- Jobs failing during output write
- Upload failures
Diagnosis:
# Check disk usage
df -h /storage
# Find large files
du -sh /storage/* | sort -hr | head -20
# Check for orphaned files
find /storage -type f -mtime +7 -name "*.tmp" -lsResolution:
-
Clean temporary files:
# Remove old temporary files find /storage/tmp -type f -mtime +1 -delete # Clean orphaned job files docker compose exec api python scripts/cleanup-storage.py
-
Archive old files to S3:
# Archive files older than 7 days aws s3 sync /storage/output/ s3://archive-bucket/output/ \ --exclude "*" --include "*.mp4" --include "*.webm" \ --exclude "$(date +%Y%m)*"
-
Expand storage:
# Resize volume (AWS) aws ec2 modify-volume --volume-id vol-xxx --size 500 # Resize filesystem sudo resize2fs /dev/xvdf
| Level | Response Time | Examples |
|---|---|---|
| SEV1 | 15 minutes | Complete outage, data loss |
| SEV2 | 30 minutes | Degraded performance, partial outage |
| SEV3 | 2 hours | Minor issues, single component failure |
| SEV4 | Next business day | Cosmetic issues, documentation |
Initial Response (0-15 min):
-
Acknowledge incident:
# Send initial notification ./scripts/notify-incident.sh SEV1 "FFmpeg API Complete Outage"
-
Quick diagnostics:
# Check all services docker compose ps # Check recent deployments git log --oneline -10 # Check system resources free -m df -h
-
Immediate mitigation:
# Restart all services docker compose down docker compose up -d # Enable maintenance mode docker compose exec api redis-cli set maintenance_mode true
Investigation (15-30 min):
-
Collect logs:
# Aggregate recent logs mkdir -p /tmp/incident-$(date +%Y%m%d-%H%M%S) cd /tmp/incident-* docker compose logs --since 1h > docker-logs.txt journalctl --since "1 hour ago" > system-logs.txt
-
Check metrics:
- Open Grafana dashboard
- Look for anomalies in last 2 hours
- Check error rates and latency
-
Root cause analysis:
# Check for OOM kills dmesg | grep -i "killed process" # Check for disk issues grep -i "error\|fail" /var/log/syslog # Database issues docker compose exec postgres tail -100 /var/log/postgresql/postgresql.log
Recovery (30-60 min):
-
Restore service:
# If configuration issue, rollback git checkout HEAD~1 -- compose.yml docker compose up -d # If database issue, restore from backup ./scripts/disaster-recovery.sh --mode latest
-
Verify recovery:
# Run smoke tests ./scripts/smoke-test.sh # Check metrics curl -s http://localhost:9090/metrics | grep up
-
Post-incident:
# Disable maintenance mode docker compose exec api redis-cli del maintenance_mode # Send recovery notification ./scripts/notify-incident.sh RESOLVED "FFmpeg API Service Restored"
# Incident Report: [INCIDENT-ID]
**Date:** [DATE]
**Severity:** [SEV1/2/3/4]
**Duration:** [START] - [END]
**Impact:** [# of users affected, % of requests failed]
## Summary
[Brief description of what happened]
## Timeline
- **[TIME]** - Initial detection
- **[TIME]** - Incident acknowledged
- **[TIME]** - Root cause identified
- **[TIME]** - Fix implemented
- **[TIME]** - Service restored
## Root Cause
[Detailed explanation of why this happened]
## Resolution
[What was done to fix the issue]
## Impact
- **Users affected:** [number]
- **Requests failed:** [number]
- **Data loss:** [yes/no]
## Lessons Learned
1. [What went well]
2. [What went poorly]
3. [What was lucky]
## Action Items
- [ ] [Preventive measure 1]
- [ ] [Preventive measure 2]
- [ ] [Process improvement]Check processing metrics:
# Average processing time by operation
docker compose exec postgres psql -U rendiff_user -d rendiff -c "
SELECT
operations->0->>'type' as operation,
AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) as avg_seconds,
COUNT(*) as job_count
FROM jobs
WHERE status = 'completed'
AND completed_at > NOW() - INTERVAL '1 day'
GROUP BY operations->0->>'type'
ORDER BY avg_seconds DESC;"Optimize FFmpeg settings:
# Check current FFmpeg threads
docker compose exec worker-cpu cat /proc/cpuinfo | grep processor | wc -l
# Update worker concurrency
WORKER_CONCURRENCY=2 # Reduce to give more CPU per job
docker compose restart worker-cpuCheck slow queries:
# Enable query logging
docker compose exec postgres psql -U rendiff_user -d rendiff -c \
"ALTER SYSTEM SET log_min_duration_statement = 1000;"
docker compose exec postgres psql -U rendiff_user -d rendiff -c \
"SELECT pg_reload_conf();"
# View slow query log
docker compose exec postgres tail -f /var/log/postgresql/postgresql.log | grep durationOptimize database:
# Update statistics
docker compose exec postgres vacuumdb -U rendiff_user -d rendiff -z
# Reindex tables
docker compose exec postgres reindexdb -U rendiff_user -d rendiff
# Check table sizes
docker compose exec postgres psql -U rendiff_user -d rendiff -c "
SELECT
schemaname AS table_schema,
tablename AS table_name,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"-
Stop application:
docker compose stop api worker-cpu worker-gpu
-
List available backups:
./scripts/disaster-recovery.sh --mode list
-
Restore from backup:
# Restore latest ./scripts/disaster-recovery.sh --mode latest # Restore specific backup ./scripts/disaster-recovery.sh --mode specific \ --timestamp 20250127_120000
-
Verify restoration:
# Check data integrity docker compose exec postgres psql -U rendiff_user -d rendiff -c \ "SELECT COUNT(*) FROM jobs;" # Run application tests docker compose run --rm api pytest tests/
-
Resume service:
docker compose up -d api worker-cpu worker-gpu
# Enable WAL archiving (preventive)
docker compose exec postgres psql -U postgres -c "
ALTER SYSTEM SET wal_level = replica;
ALTER SYSTEM SET archive_mode = on;
ALTER SYSTEM SET archive_command = 'aws s3 cp %p s3://backup-bucket/wal/%f';
"
# Perform PITR
pg_basebackup -h localhost -D /recovery -U postgres -Fp -Xs -P-
Plan maintenance window:
# Enable maintenance mode docker compose exec api redis-cli set maintenance_mode true ex 3600
-
Scale instance (AWS):
# Stop instance aws ec2 stop-instances --instance-ids i-xxxxx # Modify instance type aws ec2 modify-instance-attribute --instance-id i-xxxxx \ --instance-type c5.4xlarge # Start instance aws ec2 start-instances --instance-ids i-xxxxx
-
Verify and adjust:
# Update resource limits docker compose down # Edit compose.yml with new limits docker compose up -d
-
Add worker nodes:
# Deploy to new node scp -r . newnode:/opt/rendiff/ ssh newnode "cd /opt/rendiff && docker compose up -d worker-cpu"
-
Scale services:
# API servers docker compose up -d --scale api=6 # CPU workers docker compose up -d --scale worker-cpu=10 # GPU workers (if available) docker compose up -d --scale worker-gpu=4
-
Update load balancer:
# Add new backend to Traefik docker compose exec traefik traefik healthcheck
-
Immediate response:
# Identify compromised key docker compose exec postgres psql -U rendiff_user -d rendiff -c " SELECT api_key_hash, last_used_at, request_count FROM api_keys WHERE last_used_at > NOW() - INTERVAL '1 hour' ORDER BY request_count DESC;" # Revoke key ./scripts/manage-api-keys.sh revoke <key-hash>
-
Investigate:
# Check access logs docker compose logs api | grep <key-hash> > suspicious-activity.log # Check for data exfiltration docker compose exec postgres psql -U rendiff_user -d rendiff -c " SELECT COUNT(*), SUM(output_size) FROM jobs WHERE api_key = '<key-hash>' AND created_at > NOW() - INTERVAL '24 hours';"
-
Remediate:
# Rotate all keys for affected user ./scripts/manage-api-keys.sh rotate-user <user-id> # Enable additional monitoring docker compose exec api redis-cli set "monitor:api_key:<key-hash>" true
-
Enable rate limiting:
# Update Traefik rate limits docker compose exec traefik redis-cli set "ratelimit:global" 100 # Enable DDoS protection mode docker compose exec api python -c " from api.config import settings settings.ENABLE_DDOS_PROTECTION = True "
-
Block malicious IPs:
# Analyze access patterns docker compose logs traefik | awk '{print $1}' | sort | uniq -c | sort -rn | head -20 # Block suspicious IPs iptables -A INPUT -s MALICIOUS_IP -j DROP
-
Scale and cache:
# Enable aggressive caching docker compose exec redis redis-cli config set maxmemory 4gb # Scale API servers docker compose up -d --scale api=10
# Service health
curl -s localhost:8000/api/v1/health | jq .
# Queue status
docker compose exec redis redis-cli info clients
# Active jobs
docker compose exec worker-cpu celery -A worker.main inspect active
# Database connections
docker compose exec postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
# Memory usage
docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}"
# Disk usage
df -h | grep -E "Filesystem|storage"
# Network connections
netstat -an | grep ESTABLISHED | wc -l
# Error logs
docker compose logs --since 10m | grep -i error
# Performance metrics
curl -s localhost:9090/metrics | grep -E "http_request_duration|ffmpeg_job_duration"- On-Call Engineer: Use PagerDuty
- Database Admin: dba-team@company.com
- Infrastructure: infra-team@company.com
- Security Team: security@company.com
- Management Escalation: cto@company.com