Rendiff Operational Runbooks

Service Health Checks
Common Issues and Resolution
Incident Response Procedures
Performance Troubleshooting
Disaster Recovery
Scaling Procedures
Security Incidents

Service Health Checks

🟢 Quick Health Check

# Check all services
curl -s https://api.domain.com/api/v1/health | jq .

# Check specific components
docker compose ps
docker compose exec api curl -s localhost:8000/api/v1/health
docker compose exec postgres pg_isready
docker compose exec redis redis-cli ping

🔍 Deep Health Check

# API response times
curl -w "@curl-format.txt" -o /dev/null -s https://api.domain.com/api/v1/health

# Database connections
docker compose exec postgres psql -U rendiff_user -d rendiff -c \
  "SELECT count(*) FROM pg_stat_activity WHERE datname = 'rendiff';"

# Queue depth
docker compose exec redis redis-cli llen celery

# Worker status
docker compose exec worker-cpu celery -A worker.main inspect active

Common Issues and Resolution

🚨 Issue: High API Response Times

Symptoms:

P95 latency > 5 seconds
Timeouts on /convert endpoint
User complaints about slow processing

Diagnosis:

# Check CPU usage
docker stats --no-stream

# Check database slow queries
docker compose exec postgres psql -U rendiff_user -d rendiff -c \
  "SELECT query, mean_exec_time, calls FROM pg_stat_statements 
   WHERE mean_exec_time > 1000 ORDER BY mean_exec_time DESC LIMIT 10;"

# Check Redis memory
docker compose exec redis redis-cli info memory

Resolution:

Scale API containers:
```
docker compose up -d --scale api=4
```

Clear slow queries:

# Analyze and optimize slow queries
docker compose exec postgres psql -U rendiff_user -d rendiff -c \
  "ANALYZE jobs; REINDEX TABLE jobs;"

Increase connection pool:

# Update DATABASE_POOL_SIZE in .env
DATABASE_POOL_SIZE=40
docker compose restart api

🚨 Issue: Jobs Stuck in Queue

Symptoms:

Jobs remain in "queued" status
Queue depth increasing
No worker activity

Diagnosis:

# Check worker status
docker compose logs --tail=100 worker-cpu | grep ERROR

# Check queue status
docker compose exec redis redis-cli llen high
docker compose exec redis redis-cli llen default
docker compose exec redis redis-cli llen low

# Check worker processes
docker compose exec worker-cpu ps aux | grep celery

Resolution:

Restart workers:

docker compose restart worker-cpu worker-gpu

Scale workers:

docker compose up -d --scale worker-cpu=6

Clear stuck jobs:

# Move stuck jobs back to queue
docker compose exec api python -c "
from api.models.job import Job, JobStatus
from api.database import SessionLocal
db = SessionLocal()
stuck_jobs = db.query(Job).filter(
    Job.status == JobStatus.PROCESSING,
    Job.updated_at < datetime.now() - timedelta(hours=1)
).all()
for job in stuck_jobs:
    job.status = JobStatus.QUEUED
db.commit()
"

🚨 Issue: Storage Full

Symptoms:

"No space left on device" errors
Jobs failing during output write
Upload failures

Diagnosis:

# Check disk usage
df -h /storage

# Find large files
du -sh /storage/* | sort -hr | head -20

# Check for orphaned files
find /storage -type f -mtime +7 -name "*.tmp" -ls

Resolution:

Clean temporary files:

# Remove old temporary files
find /storage/tmp -type f -mtime +1 -delete

# Clean orphaned job files
docker compose exec api python scripts/cleanup-storage.py

Archive old files to S3:

# Archive files older than 7 days
aws s3 sync /storage/output/ s3://archive-bucket/output/ \
  --exclude "*" --include "*.mp4" --include "*.webm" \
  --exclude "$(date +%Y%m)*"

Expand storage:

# Resize volume (AWS)
aws ec2 modify-volume --volume-id vol-xxx --size 500

# Resize filesystem
sudo resize2fs /dev/xvdf

Incident Response Procedures

📋 Severity Levels

Level	Response Time	Examples
SEV1	15 minutes	Complete outage, data loss
SEV2	30 minutes	Degraded performance, partial outage
SEV3	2 hours	Minor issues, single component failure
SEV4	Next business day	Cosmetic issues, documentation

🚨 SEV1: Complete Service Outage

Initial Response (0-15 min):

Acknowledge incident:

# Send initial notification
./scripts/notify-incident.sh SEV1 "FFmpeg API Complete Outage"

Quick diagnostics:

# Check all services
docker compose ps

# Check recent deployments
git log --oneline -10

# Check system resources
free -m
df -h

Immediate mitigation:

# Restart all services
docker compose down
docker compose up -d

# Enable maintenance mode
docker compose exec api redis-cli set maintenance_mode true

Investigation (15-30 min):

Collect logs:

# Aggregate recent logs
mkdir -p /tmp/incident-$(date +%Y%m%d-%H%M%S)
cd /tmp/incident-*

docker compose logs --since 1h > docker-logs.txt
journalctl --since "1 hour ago" > system-logs.txt

Check metrics:
- Open Grafana dashboard
- Look for anomalies in last 2 hours
- Check error rates and latency

Root cause analysis:

# Check for OOM kills
dmesg | grep -i "killed process"

# Check for disk issues
grep -i "error\|fail" /var/log/syslog

# Database issues
docker compose exec postgres tail -100 /var/log/postgresql/postgresql.log

Recovery (30-60 min):

Restore service:

# If configuration issue, rollback
git checkout HEAD~1 -- compose.yml
docker compose up -d

# If database issue, restore from backup
./scripts/disaster-recovery.sh --mode latest

Verify recovery:

# Run smoke tests
./scripts/smoke-test.sh

# Check metrics
curl -s http://localhost:9090/metrics | grep up

Post-incident:

# Disable maintenance mode
docker compose exec api redis-cli del maintenance_mode

# Send recovery notification
./scripts/notify-incident.sh RESOLVED "FFmpeg API Service Restored"

📝 Incident Report Template

# Incident Report: [INCIDENT-ID]

**Date:** [DATE]
**Severity:** [SEV1/2/3/4]
**Duration:** [START] - [END]
**Impact:** [# of users affected, % of requests failed]

## Summary
[Brief description of what happened]

## Timeline
- **[TIME]** - Initial detection
- **[TIME]** - Incident acknowledged
- **[TIME]** - Root cause identified
- **[TIME]** - Fix implemented
- **[TIME]** - Service restored

## Root Cause
[Detailed explanation of why this happened]

## Resolution
[What was done to fix the issue]

## Impact
- **Users affected:** [number]
- **Requests failed:** [number]
- **Data loss:** [yes/no]

## Lessons Learned
1. [What went well]
2. [What went poorly]
3. [What was lucky]

## Action Items
- [ ] [Preventive measure 1]
- [ ] [Preventive measure 2]
- [ ] [Process improvement]

Performance Troubleshooting

🐌 Slow Video Processing

Check processing metrics:

# Average processing time by operation
docker compose exec postgres psql -U rendiff_user -d rendiff -c "
SELECT 
    operations->0->>'type' as operation,
    AVG(EXTRACT(EPOCH FROM (completed_at - started_at))) as avg_seconds,
    COUNT(*) as job_count
FROM jobs 
WHERE status = 'completed' 
    AND completed_at > NOW() - INTERVAL '1 day'
GROUP BY operations->0->>'type'
ORDER BY avg_seconds DESC;"

Optimize FFmpeg settings:

# Check current FFmpeg threads
docker compose exec worker-cpu cat /proc/cpuinfo | grep processor | wc -l

# Update worker concurrency
WORKER_CONCURRENCY=2  # Reduce to give more CPU per job
docker compose restart worker-cpu

📊 Database Performance

Check slow queries:

# Enable query logging
docker compose exec postgres psql -U rendiff_user -d rendiff -c \
  "ALTER SYSTEM SET log_min_duration_statement = 1000;"

docker compose exec postgres psql -U rendiff_user -d rendiff -c \
  "SELECT pg_reload_conf();"

# View slow query log
docker compose exec postgres tail -f /var/log/postgresql/postgresql.log | grep duration

Optimize database:

# Update statistics
docker compose exec postgres vacuumdb -U rendiff_user -d rendiff -z

# Reindex tables
docker compose exec postgres reindexdb -U rendiff_user -d rendiff

# Check table sizes
docker compose exec postgres psql -U rendiff_user -d rendiff -c "
SELECT
    schemaname AS table_schema,
    tablename AS table_name,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;"

Disaster Recovery

🔥 Complete Database Recovery

Stop application:

docker compose stop api worker-cpu worker-gpu

List available backups:

./scripts/disaster-recovery.sh --mode list

Restore from backup:

# Restore latest
./scripts/disaster-recovery.sh --mode latest

# Restore specific backup
./scripts/disaster-recovery.sh --mode specific \
  --timestamp 20250127_120000

Verify restoration:

# Check data integrity
docker compose exec postgres psql -U rendiff_user -d rendiff -c \
  "SELECT COUNT(*) FROM jobs;"

# Run application tests
docker compose run --rm api pytest tests/

Resume service:

docker compose up -d api worker-cpu worker-gpu

💾 Point-in-Time Recovery

# Enable WAL archiving (preventive)
docker compose exec postgres psql -U postgres -c "
ALTER SYSTEM SET wal_level = replica;
ALTER SYSTEM SET archive_mode = on;
ALTER SYSTEM SET archive_command = 'aws s3 cp %p s3://backup-bucket/wal/%f';
"

# Perform PITR
pg_basebackup -h localhost -D /recovery -U postgres -Fp -Xs -P

Scaling Procedures

⬆️ Vertical Scaling (Resize)

Plan maintenance window:

# Enable maintenance mode
docker compose exec api redis-cli set maintenance_mode true ex 3600

Scale instance (AWS):

# Stop instance
aws ec2 stop-instances --instance-ids i-xxxxx

# Modify instance type
aws ec2 modify-instance-attribute --instance-id i-xxxxx \
  --instance-type c5.4xlarge

# Start instance
aws ec2 start-instances --instance-ids i-xxxxx

Verify and adjust:

# Update resource limits
docker compose down
# Edit compose.yml with new limits
docker compose up -d

➡️ Horizontal Scaling

Add worker nodes:

# Deploy to new node
scp -r . newnode:/opt/rendiff/
ssh newnode "cd /opt/rendiff && docker compose up -d worker-cpu"

Scale services:

# API servers
docker compose up -d --scale api=6

# CPU workers
docker compose up -d --scale worker-cpu=10

# GPU workers (if available)
docker compose up -d --scale worker-gpu=4

Update load balancer:

# Add new backend to Traefik
docker compose exec traefik traefik healthcheck

Security Incidents

🔐 Suspected API Key Compromise

Immediate response:

# Identify compromised key
docker compose exec postgres psql -U rendiff_user -d rendiff -c "
SELECT api_key_hash, last_used_at, request_count 
FROM api_keys 
WHERE last_used_at > NOW() - INTERVAL '1 hour'
ORDER BY request_count DESC;"

# Revoke key
./scripts/manage-api-keys.sh revoke <key-hash>

Investigate:

# Check access logs
docker compose logs api | grep <key-hash> > suspicious-activity.log

# Check for data exfiltration
docker compose exec postgres psql -U rendiff_user -d rendiff -c "
SELECT COUNT(*), SUM(output_size) 
FROM jobs 
WHERE api_key = '<key-hash>' 
  AND created_at > NOW() - INTERVAL '24 hours';"

Remediate:

# Rotate all keys for affected user
./scripts/manage-api-keys.sh rotate-user <user-id>

# Enable additional monitoring
docker compose exec api redis-cli set "monitor:api_key:<key-hash>" true

🛡️ DDoS Attack Response

Enable rate limiting:

# Update Traefik rate limits
docker compose exec traefik redis-cli set "ratelimit:global" 100

# Enable DDoS protection mode
docker compose exec api python -c "
from api.config import settings
settings.ENABLE_DDOS_PROTECTION = True
"

Block malicious IPs:

# Analyze access patterns
docker compose logs traefik | awk '{print $1}' | sort | uniq -c | sort -rn | head -20

# Block suspicious IPs
iptables -A INPUT -s MALICIOUS_IP -j DROP

Scale and cache:

# Enable aggressive caching
docker compose exec redis redis-cli config set maxmemory 4gb

# Scale API servers
docker compose up -d --scale api=10

Monitoring Commands Reference

# Service health
curl -s localhost:8000/api/v1/health | jq .

# Queue status
docker compose exec redis redis-cli info clients

# Active jobs
docker compose exec worker-cpu celery -A worker.main inspect active

# Database connections
docker compose exec postgres psql -c "SELECT count(*) FROM pg_stat_activity;"

# Memory usage
docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}"

# Disk usage
df -h | grep -E "Filesystem|storage"

# Network connections
netstat -an | grep ESTABLISHED | wc -l

# Error logs
docker compose logs --since 10m | grep -i error

# Performance metrics
curl -s localhost:9090/metrics | grep -E "http_request_duration|ffmpeg_job_duration"

Emergency Contacts

On-Call Engineer: Use PagerDuty
Database Admin: dba-team@company.com
Infrastructure: infra-team@company.com
Security Team: security@company.com
Management Escalation: cto@company.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rendiff Operational Runbooks

Table of Contents

Service Health Checks

🟢 Quick Health Check

🔍 Deep Health Check

Common Issues and Resolution

🚨 Issue: High API Response Times

🚨 Issue: Jobs Stuck in Queue

🚨 Issue: Storage Full

Incident Response Procedures

📋 Severity Levels

🚨 SEV1: Complete Service Outage

📝 Incident Report Template

Performance Troubleshooting

🐌 Slow Video Processing

📊 Database Performance

Disaster Recovery

🔥 Complete Database Recovery

💾 Point-in-Time Recovery

Scaling Procedures

⬆️ Vertical Scaling (Resize)

➡️ Horizontal Scaling

Security Incidents

🔐 Suspected API Key Compromise

🛡️ DDoS Attack Response

Monitoring Commands Reference

Emergency Contacts

Useful Links

FilesExpand file tree

RUNBOOKS.md

Latest commit

History

RUNBOOKS.md

File metadata and controls

Rendiff Operational Runbooks

Table of Contents

Service Health Checks

🟢 Quick Health Check

🔍 Deep Health Check

Common Issues and Resolution

🚨 Issue: High API Response Times

🚨 Issue: Jobs Stuck in Queue

🚨 Issue: Storage Full

Incident Response Procedures

📋 Severity Levels

🚨 SEV1: Complete Service Outage

📝 Incident Report Template

Performance Troubleshooting

🐌 Slow Video Processing

📊 Database Performance

Disaster Recovery

🔥 Complete Database Recovery

💾 Point-in-Time Recovery

Scaling Procedures

⬆️ Vertical Scaling (Resize)

➡️ Horizontal Scaling

Security Incidents

🔐 Suspected API Key Compromise

🛡️ DDoS Attack Response

Monitoring Commands Reference

Emergency Contacts

Useful Links