Skip to content

Latest commit

 

History

History
252 lines (182 loc) · 5.8 KB

File metadata and controls

252 lines (182 loc) · 5.8 KB

pgtrace operations runbook

Operational guide for pgtrace deployment and maintenance.

Incident workflow

  1. Open Incident Room (last 15m)
  2. Look at dominant wait type
  3. Open Lock Graph if Lock waits
  4. Identify top blocker
  5. Export case bundle

Maintenance

Retention

Retention runs automatically via agent or can be scheduled via pg_cron.

Default retention:

  • events: 24 hours
  • lock_edges: 24 hours
  • rollup tables: 24 hours
  • repl_samples/wal_samples: 7 days
  • markers: 30 days

Migration from non-partitioned to partitioned schema

  1. Stop agent
  2. Backup store DB
  3. Run optional partitioning migrations
  4. Migrate data (dump/restore or dual-write approach)
  5. Restart agent

See store/migrations_optional/partitioned/ for details.

Backup and Recovery

Backup Procedures

Store Database Backup

Full Backup (Recommended Daily)

# Using pg_dump
pg_dump -h localhost -U postgres -d pgtrace_store \
  --format=custom \
  --file=pgtrace_store_$(date +%Y%m%d_%H%M%S).dump

# Using pg_dumpall (for multiple databases)
pg_dumpall -h localhost -U postgres > pgtrace_all_$(date +%Y%m%d_%H%M%S).sql

Continuous Archiving (WAL-based)

For point-in-time recovery, enable PostgreSQL WAL archiving:

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'cp %p /backup/pgtrace/wal/%f'

Automated Backup Script

#!/bin/bash
# backup-pgtrace.sh

BACKUP_DIR="/backup/pgtrace"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30

mkdir -p "$BACKUP_DIR"

# Full backup
pg_dump -h localhost -U postgres -d pgtrace_store \
  --format=custom \
  --file="$BACKUP_DIR/pgtrace_store_$DATE.dump"

# Compress old backups
gzip "$BACKUP_DIR/pgtrace_store_$DATE.dump"

# Remove old backups
find "$BACKUP_DIR" -name "pgtrace_store_*.dump.gz" -mtime +$RETENTION_DAYS -delete

echo "Backup completed: pgtrace_store_$DATE.dump.gz"

Schedule with cron:

0 2 * * * /usr/local/bin/backup-pgtrace.sh

Recovery Procedures

Restore from Full Backup

# Stop agent and server
systemctl stop pgtrace-agent
systemctl stop pgtrace-server

# Drop and recreate database (CAUTION: data loss)
dropdb -h localhost -U postgres pgtrace_store
createdb -h localhost -U postgres pgtrace_store

# Restore from backup
pg_restore -h localhost -U postgres -d pgtrace_store \
  --clean \
  --if-exists \
  pgtrace_store_20260113_020000.dump

# Restart services
systemctl start pgtrace-server
systemctl start pgtrace-agent

Point-in-Time Recovery (PITR)

If WAL archiving is enabled:

# Stop PostgreSQL
systemctl stop postgresql

# Restore base backup
pg_basebackup -h source-host -U replication -D /var/lib/postgresql/data -Ft -P

# Configure recovery
cat > /var/lib/postgresql/data/recovery.conf <<EOF
restore_command = 'cp /backup/pgtrace/wal/%f %p'
recovery_target_time = '2026-01-13 10:30:00'
EOF

# Start PostgreSQL (will recover to target time)
systemctl start postgresql

Backup Verification

Regularly verify backups:

# Test restore to temporary database
createdb -h localhost -U postgres pgtrace_store_test
pg_restore -h localhost -U postgres -d pgtrace_store_test \
  --no-owner --no-privileges \
  pgtrace_store_20260113_020000.dump

# Verify data
psql -h localhost -U postgres -d pgtrace_store_test \
  -c "SELECT COUNT(*) FROM pgtrace.events;"

# Cleanup
dropdb -h localhost -U postgres pgtrace_store_test

Backup Storage

  • Store backups on separate storage (NFS, S3, etc.)
  • Encrypt backups containing sensitive data
  • Test restore procedures regularly
  • Document backup locations and access procedures
  • Keep multiple backup generations (daily, weekly, monthly)

Monitoring

Health Checks

  • Check agent logs: journalctl -u pgtrace-agent -f
  • Monitor server logs: journalctl -u pgtrace-server -f
  • Health endpoint: curl http://localhost:4000/api/v1/health
  • Database size monitoring
  • Agent heartbeat events in database

Metrics

Prometheus metrics endpoint: http://localhost:4000/metrics

Key metrics to monitor:

  • Database connection pool size
  • HTTP request rate and latency
  • Memory usage
  • Process count
  • Rate limit hits

Alerting

Recommended alerts:

  • Agent not sending heartbeats (check for heartbeat events)
  • Database connection failures
  • High error rate in logs
  • Database size approaching limits
  • Rate limit violations
  • High memory usage

Troubleshooting

  • Agent not collecting: Check source DB permissions
  • High store DB load: Check retention is running
  • Missing data: Check backpressure markers, verify agent is running
  • API errors: Check rate limits, authentication, database connections
  • High latency: Check database performance, connection pool size

Capacity Planning

Database Size Estimation

Approximate storage requirements per node:

  • Events: ~1KB per event, default 20k events/sec = ~20MB/sec = ~1.7TB/day (with 24h retention: ~1.7TB)
  • Lock edges: ~200 bytes per edge, depends on contention
  • Rollups: ~10KB per second = ~864MB/day (negligible)

Recommended:

  • Start with 100GB+ for store database
  • Monitor growth rate
  • Adjust retention periods based on storage capacity
  • Consider partitioning for high-volume deployments

Performance Tuning

Database Tuning

-- Increase work_mem for complex queries
ALTER SYSTEM SET work_mem = '256MB';
SELECT pg_reload_conf();

-- Tune shared_buffers (25% of RAM for dedicated DB server)
ALTER SYSTEM SET shared_buffers = '4GB';
SELECT pg_reload_conf();

-- Increase maintenance_work_mem for VACUUM
ALTER SYSTEM SET maintenance_work_mem = '1GB';
SELECT pg_reload_conf();

Connection Pooling

Adjust pool size based on concurrent connections:

  • Default: 10 connections
  • High load: 20-50 connections
  • Monitor connection usage

Agent Intervals

Adjust polling intervals based on load:

  • Increase intervals under high load
  • Monitor agent CPU usage
  • Check degraded mode markers