Google Cloud infrastructure for Pilot Apps — Terraform, Ansible, and Make-based deployment pipeline.
- Terraform — GCE instances, networking, DNS, service accounts, org policies
- Ansible — VM configuration (OS tools, ZFS, Tailscale, Google Ops Agent)
- Secrets —
pass(1)password store (GPG-backed); entries undergcp-devops/*
- Google Cloud SDK
- Terraform
- Ansible
pass(1)initialized with a GPG key (store entries undergcp-devops/*)
# Authenticate with Google Cloud
make google-auth
# Plan infrastructure changes (prod — default)
make terraform-plan
# Plan infrastructure changes (dev)
make terraform-plan google_project=lab5-mailpilot-dev1
# Full deployment (terraform + ansible); prod requires `confirm=prd1`
make deploy confirm=prd1PostgreSQL data is protected by two complementary backup layers:
| Layer | Method | Schedule | Retention | Storage |
|---|---|---|---|---|
| ZFS Snapshot | Sanoid auto-snapshot | Hourly | 12h / 7d / 4w | Local disk |
| Disk Snapshot | GCE persistent disk snapshot | Daily 02:00 UTC | 14 days | GCP snapshots |
Sanoid manages automatic ZFS snapshots for the PostgreSQL data dataset:
data/postgresql/data— PostgreSQL data files (recordsize=8K)
Retention: 12 hourly, 7 daily, 4 weekly. Managed by sanoid.timer.
The mailpilot-1-pgsql data disk has a GCE snapshot policy (data-disk-daily-snapshots) that takes a daily snapshot at 02:00 UTC, retained for 14 days. Snapshots persist even if the disk is deleted.
All recovery commands run on the GCE instance. Connect with make gce-ssh.
Best for: accidental data loss within the last few hours/days. Fastest recovery method.
# List available snapshots
sudo zfs list -t snapshot -r data/postgresql
# Stop PostgreSQL
sudo systemctl stop postgresql
# Roll back the data dataset
sudo zfs rollback -r data/postgresql/data@autosnap_2026-02-16_12:00:00_hourly
# Start PostgreSQL
sudo systemctl start postgresqlTo recover individual files without a full rollback, access the .zfs/snapshot directory:
# Browse snapshot contents (no mount needed)
ls /var/lib/postgresql/18/main/.zfs/snapshot/
sudo cp /var/lib/postgresql/18/main/.zfs/snapshot/autosnap_2026-02-16_12:00:00_hourly/path/to/file /tmp/Best for: disk failure or full infrastructure rebuild.
# From your local machine — list available snapshots
gcloud compute snapshots list --project=lab5-mailpilot-prd1 \
--filter="sourceDisk:mailpilot-1-pgsql" --sort-by=~creationTimestamp
# Create a new disk from snapshot
gcloud compute disks create mailpilot-1-pgsql \
--project=lab5-mailpilot-prd1 \
--zone=us-east5-b \
--source-snapshot=<SNAPSHOT_NAME> \
--type=pd-balanced
# Re-run Terraform and Ansible to rebuild the instance with the restored disk
make terraform-apply
make gce-configure# List ZFS snapshots and verify Sanoid is running
sudo zfs list -t snapshot -r data/postgresql
systemctl status sanoid.timer
# GCE disk snapshots (from local machine)
gcloud compute snapshots list --project=lab5-mailpilot-prd1 \
--filter="sourceDisk:mailpilot-1-pgsql"Run a full ZFS rollback restore drill on dev (lab5-mailpilot-dev1) at least once per quarter. Untested backups are unreliable backups — a quarterly drill confirms the snapshot pipeline still produces a recoverable PostgreSQL state end-to-end.
Procedure:
- Connect:
make gce-ssh google_project=lab5-mailpilot-dev1 - Pick a recent non-critical snapshot from
sudo zfs list -t snapshot -r data/postgresql - Walk Scenario 1 above (stop PostgreSQL →
zfs rollback→ start PostgreSQL) - Confirm the service comes back healthy and recent rows are visible
Record the drill date and outcome in the team runbook. If two quarters pass without a successful drill, treat the backup chain as unverified and escalate before relying on it.