pgtrace operations runbook

Operational guide for pgtrace deployment and maintenance.

Incident workflow

Open Incident Room (last 15m)
Look at dominant wait type
Open Lock Graph if Lock waits
Identify top blocker
Export case bundle

Maintenance

Retention

Retention runs automatically via agent or can be scheduled via pg_cron.

Default retention:

events: 24 hours
lock_edges: 24 hours
rollup tables: 24 hours
repl_samples/wal_samples: 7 days
markers: 30 days

Migration from non-partitioned to partitioned schema

Stop agent
Backup store DB
Run optional partitioning migrations
Migrate data (dump/restore or dual-write approach)
Restart agent

See store/migrations_optional/partitioned/ for details.

Backup and Recovery

Backup Procedures

Store Database Backup

Full Backup (Recommended Daily)

# Using pg_dump
pg_dump -h localhost -U postgres -d pgtrace_store \
  --format=custom \
  --file=pgtrace_store_$(date +%Y%m%d_%H%M%S).dump

# Using pg_dumpall (for multiple databases)
pg_dumpall -h localhost -U postgres > pgtrace_all_$(date +%Y%m%d_%H%M%S).sql

Continuous Archiving (WAL-based)

For point-in-time recovery, enable PostgreSQL WAL archiving:

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'cp %p /backup/pgtrace/wal/%f'

Automated Backup Script

#!/bin/bash
# backup-pgtrace.sh

BACKUP_DIR="/backup/pgtrace"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30

mkdir -p "$BACKUP_DIR"

# Full backup
pg_dump -h localhost -U postgres -d pgtrace_store \
  --format=custom \
  --file="$BACKUP_DIR/pgtrace_store_$DATE.dump"

# Compress old backups
gzip "$BACKUP_DIR/pgtrace_store_$DATE.dump"

# Remove old backups
find "$BACKUP_DIR" -name "pgtrace_store_*.dump.gz" -mtime +$RETENTION_DAYS -delete

echo "Backup completed: pgtrace_store_$DATE.dump.gz"

Schedule with cron:

0 2 * * * /usr/local/bin/backup-pgtrace.sh

Recovery Procedures

Restore from Full Backup

# Stop agent and server
systemctl stop pgtrace-agent
systemctl stop pgtrace-server

# Drop and recreate database (CAUTION: data loss)
dropdb -h localhost -U postgres pgtrace_store
createdb -h localhost -U postgres pgtrace_store

# Restore from backup
pg_restore -h localhost -U postgres -d pgtrace_store \
  --clean \
  --if-exists \
  pgtrace_store_20260113_020000.dump

# Restart services
systemctl start pgtrace-server
systemctl start pgtrace-agent

Point-in-Time Recovery (PITR)

If WAL archiving is enabled:

# Stop PostgreSQL
systemctl stop postgresql

# Restore base backup
pg_basebackup -h source-host -U replication -D /var/lib/postgresql/data -Ft -P

# Configure recovery
cat > /var/lib/postgresql/data/recovery.conf <<EOF
restore_command = 'cp /backup/pgtrace/wal/%f %p'
recovery_target_time = '2026-01-13 10:30:00'
EOF

# Start PostgreSQL (will recover to target time)
systemctl start postgresql

Backup Verification

Regularly verify backups:

# Test restore to temporary database
createdb -h localhost -U postgres pgtrace_store_test
pg_restore -h localhost -U postgres -d pgtrace_store_test \
  --no-owner --no-privileges \
  pgtrace_store_20260113_020000.dump

# Verify data
psql -h localhost -U postgres -d pgtrace_store_test \
  -c "SELECT COUNT(*) FROM pgtrace.events;"

# Cleanup
dropdb -h localhost -U postgres pgtrace_store_test

Backup Storage

Store backups on separate storage (NFS, S3, etc.)
Encrypt backups containing sensitive data
Test restore procedures regularly
Document backup locations and access procedures
Keep multiple backup generations (daily, weekly, monthly)

Monitoring

Health Checks

Check agent logs: journalctl -u pgtrace-agent -f
Monitor server logs: journalctl -u pgtrace-server -f
Health endpoint: curl http://localhost:4000/api/v1/health
Database size monitoring
Agent heartbeat events in database

Metrics

Prometheus metrics endpoint: http://localhost:4000/metrics

Key metrics to monitor:

Database connection pool size
HTTP request rate and latency
Memory usage
Process count
Rate limit hits

Alerting

Recommended alerts:

Agent not sending heartbeats (check for heartbeat events)
Database connection failures
High error rate in logs
Database size approaching limits
Rate limit violations
High memory usage

Troubleshooting

Agent not collecting: Check source DB permissions
High store DB load: Check retention is running
Missing data: Check backpressure markers, verify agent is running
API errors: Check rate limits, authentication, database connections
High latency: Check database performance, connection pool size

Capacity Planning

Database Size Estimation

Approximate storage requirements per node:

Events: ~1KB per event, default 20k events/sec = ~20MB/sec = ~1.7TB/day (with 24h retention: ~1.7TB)
Lock edges: ~200 bytes per edge, depends on contention
Rollups: ~10KB per second = ~864MB/day (negligible)

Recommended:

Start with 100GB+ for store database
Monitor growth rate
Adjust retention periods based on storage capacity
Consider partitioning for high-volume deployments

Performance Tuning

Database Tuning

-- Increase work_mem for complex queries
ALTER SYSTEM SET work_mem = '256MB';
SELECT pg_reload_conf();

-- Tune shared_buffers (25% of RAM for dedicated DB server)
ALTER SYSTEM SET shared_buffers = '4GB';
SELECT pg_reload_conf();

-- Increase maintenance_work_mem for VACUUM
ALTER SYSTEM SET maintenance_work_mem = '1GB';
SELECT pg_reload_conf();

Connection Pooling

Adjust pool size based on concurrent connections:

Default: 10 connections
High load: 20-50 connections
Monitor connection usage

Agent Intervals

Adjust polling intervals based on load:

Increase intervals under high load
Monitor agent CPU usage
Check degraded mode markers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pgtrace operations runbook

Incident workflow

Maintenance

Retention

Migration from non-partitioned to partitioned schema

Backup and Recovery

Backup Procedures

Store Database Backup

Recovery Procedures

Restore from Full Backup

Point-in-Time Recovery (PITR)

Backup Verification

Backup Storage

Monitoring

Health Checks

Metrics

Alerting

Troubleshooting

Capacity Planning

Database Size Estimation

Performance Tuning

FilesExpand file tree

OPERATIONS.md

Latest commit

History

OPERATIONS.md

File metadata and controls

pgtrace operations runbook

Incident workflow

Maintenance

Retention

Migration from non-partitioned to partitioned schema

Backup and Recovery

Backup Procedures

Store Database Backup

Recovery Procedures

Restore from Full Backup

Point-in-Time Recovery (PITR)

Backup Verification

Backup Storage

Monitoring

Health Checks

Metrics

Alerting

Troubleshooting

Capacity Planning

Database Size Estimation

Performance Tuning