Operational guide for pgtrace deployment and maintenance.
- Open Incident Room (last 15m)
- Look at dominant wait type
- Open Lock Graph if Lock waits
- Identify top blocker
- Export case bundle
Retention runs automatically via agent or can be scheduled via pg_cron.
Default retention:
- events: 24 hours
- lock_edges: 24 hours
- rollup tables: 24 hours
- repl_samples/wal_samples: 7 days
- markers: 30 days
- Stop agent
- Backup store DB
- Run optional partitioning migrations
- Migrate data (dump/restore or dual-write approach)
- Restart agent
See store/migrations_optional/partitioned/ for details.
Full Backup (Recommended Daily)
# Using pg_dump
pg_dump -h localhost -U postgres -d pgtrace_store \
--format=custom \
--file=pgtrace_store_$(date +%Y%m%d_%H%M%S).dump
# Using pg_dumpall (for multiple databases)
pg_dumpall -h localhost -U postgres > pgtrace_all_$(date +%Y%m%d_%H%M%S).sqlContinuous Archiving (WAL-based)
For point-in-time recovery, enable PostgreSQL WAL archiving:
# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'cp %p /backup/pgtrace/wal/%f'Automated Backup Script
#!/bin/bash
# backup-pgtrace.sh
BACKUP_DIR="/backup/pgtrace"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30
mkdir -p "$BACKUP_DIR"
# Full backup
pg_dump -h localhost -U postgres -d pgtrace_store \
--format=custom \
--file="$BACKUP_DIR/pgtrace_store_$DATE.dump"
# Compress old backups
gzip "$BACKUP_DIR/pgtrace_store_$DATE.dump"
# Remove old backups
find "$BACKUP_DIR" -name "pgtrace_store_*.dump.gz" -mtime +$RETENTION_DAYS -delete
echo "Backup completed: pgtrace_store_$DATE.dump.gz"Schedule with cron:
0 2 * * * /usr/local/bin/backup-pgtrace.sh# Stop agent and server
systemctl stop pgtrace-agent
systemctl stop pgtrace-server
# Drop and recreate database (CAUTION: data loss)
dropdb -h localhost -U postgres pgtrace_store
createdb -h localhost -U postgres pgtrace_store
# Restore from backup
pg_restore -h localhost -U postgres -d pgtrace_store \
--clean \
--if-exists \
pgtrace_store_20260113_020000.dump
# Restart services
systemctl start pgtrace-server
systemctl start pgtrace-agentIf WAL archiving is enabled:
# Stop PostgreSQL
systemctl stop postgresql
# Restore base backup
pg_basebackup -h source-host -U replication -D /var/lib/postgresql/data -Ft -P
# Configure recovery
cat > /var/lib/postgresql/data/recovery.conf <<EOF
restore_command = 'cp /backup/pgtrace/wal/%f %p'
recovery_target_time = '2026-01-13 10:30:00'
EOF
# Start PostgreSQL (will recover to target time)
systemctl start postgresqlRegularly verify backups:
# Test restore to temporary database
createdb -h localhost -U postgres pgtrace_store_test
pg_restore -h localhost -U postgres -d pgtrace_store_test \
--no-owner --no-privileges \
pgtrace_store_20260113_020000.dump
# Verify data
psql -h localhost -U postgres -d pgtrace_store_test \
-c "SELECT COUNT(*) FROM pgtrace.events;"
# Cleanup
dropdb -h localhost -U postgres pgtrace_store_test- Store backups on separate storage (NFS, S3, etc.)
- Encrypt backups containing sensitive data
- Test restore procedures regularly
- Document backup locations and access procedures
- Keep multiple backup generations (daily, weekly, monthly)
- Check agent logs:
journalctl -u pgtrace-agent -f - Monitor server logs:
journalctl -u pgtrace-server -f - Health endpoint:
curl http://localhost:4000/api/v1/health - Database size monitoring
- Agent heartbeat events in database
Prometheus metrics endpoint: http://localhost:4000/metrics
Key metrics to monitor:
- Database connection pool size
- HTTP request rate and latency
- Memory usage
- Process count
- Rate limit hits
Recommended alerts:
- Agent not sending heartbeats (check for heartbeat events)
- Database connection failures
- High error rate in logs
- Database size approaching limits
- Rate limit violations
- High memory usage
- Agent not collecting: Check source DB permissions
- High store DB load: Check retention is running
- Missing data: Check backpressure markers, verify agent is running
- API errors: Check rate limits, authentication, database connections
- High latency: Check database performance, connection pool size
Approximate storage requirements per node:
- Events: ~1KB per event, default 20k events/sec = ~20MB/sec = ~1.7TB/day (with 24h retention: ~1.7TB)
- Lock edges: ~200 bytes per edge, depends on contention
- Rollups: ~10KB per second = ~864MB/day (negligible)
Recommended:
- Start with 100GB+ for store database
- Monitor growth rate
- Adjust retention periods based on storage capacity
- Consider partitioning for high-volume deployments
Database Tuning
-- Increase work_mem for complex queries
ALTER SYSTEM SET work_mem = '256MB';
SELECT pg_reload_conf();
-- Tune shared_buffers (25% of RAM for dedicated DB server)
ALTER SYSTEM SET shared_buffers = '4GB';
SELECT pg_reload_conf();
-- Increase maintenance_work_mem for VACUUM
ALTER SYSTEM SET maintenance_work_mem = '1GB';
SELECT pg_reload_conf();Connection Pooling
Adjust pool size based on concurrent connections:
- Default: 10 connections
- High load: 20-50 connections
- Monitor connection usage
Agent Intervals
Adjust polling intervals based on load:
- Increase intervals under high load
- Monitor agent CPU usage
- Check degraded mode markers