- Introduction
- Environment Setup
- Database Initialization
- Prepare
pgstreamConfiguration - Preparing Snapshot Data
- Run
pgstream - Verify Snapshot
- Troubleshooting
- Summary
This tutorial will showcase the use of pgstream to snapshot data from a PostgreSQL database. For this tutorial, we'll use a PostgreSQL target.
- A source PostgreSQL database
- A target PostgreSQL database
- pgstream (see installation instructions for more details)
The first step is to start the two PostgreSQL databases that will be used as source and target for the snapshot. The pgstream repository provides a docker installation that will be used for the purposes of this tutorial, but can be replaced by any available PostgreSQL servers.
To start the docker provided PostgreSQL servers, run the following command:
# Start two PostgreSQL databases using Docker.
# The source database will run on port 5432, and the target database will run on port 7654.
docker-compose -f build/docker/docker-compose.yml --profile pg2pg upThis will start two PostgreSQL databases on ports 5432 and 7654.
Normally we need to initialise pgstream on the source database. The initialisation step creates the pgstream schema in the configured Postgres database, along with the tables/functions/triggers required for DDL tracking. It also creates the replication slot. However, this is only required if we're going to be using the replication slot. If we're using a PostgreSQL target for the snapshot, pgstream supports using pg_dump/pg_restore for the schema snapshot, which removes the need to keep any pgstream state in the source PostgreSQL database.
In order to run pgstream, we need to provide the configuration required to run the PostgreSQL to PostgreSQL snapshot. First, we configure the listener module that will be producing the snapshot of the source PostgreSQL database. This requires the PostgreSQL database URL, which will be the one from the docker PostgreSQL server we started and setup in the previous steps.
# URL of the source PostgreSQL database. This is where the snapshot will be taken from.
PGSTREAM_POSTGRES_LISTENER_URL="postgres://postgres:postgres@localhost:5432?sslmode=disable"The snapshot listener needs to be configured to indicate the tables/schemas that need to be snapshotted. If the tables are not schema qualified, the public schema will be assumed. Wildcards are supported. For example, test_schema.* will snapshot all tables in the test_schema schema, and test will snapshot the public.test table.
# Tables to snapshot. Use "*" to snapshot all tables in the public schema.
PGSTREAM_POSTGRES_SNAPSHOT_TABLES="*"In order to exclude some tables, e.g to snapshot all the tables in the public schema except for test_table and test_table_2, PGSTREAM_POSTGRES_SNAPSHOT_EXCLUDED_TABLES can be used.
PGSTREAM_POSTGRES_SNAPSHOT_TABLES="public.*"
PGSTREAM_POSTGRES_SNAPSHOT_EXCLUDED_TABLES="public.test_table public.test_table_2"Further configuration can be provided to optimize the performance of the snapshot process, mostly focusing on the concurrency.
# Number of tables being snapshotted in parallel for a given schema. Defaults to 4.
PGSTREAM_POSTGRES_SNAPSHOT_SCHEMA_WORKERS=4
# Number of concurrent workers that will be used per table by the snapshotting process. Defaults to 4.
PGSTREAM_POSTGRES_SNAPSHOT_TABLE_WORKERS=4
# Size of the batches the table workers from PGSTREAM_POSTGRES_SNAPSHOT_TABLE_WORKERS will read concurrently. Defaults to 80MiB.
PGSTREAM_POSTGRES_SNAPSHOT_BATCH_BYTES=83886080
# Number of schemas that will be processed in parallel by the snapshotting process. Defaults to 1.
PGSTREAM_POSTGRES_SNAPSHOT_WORKERS=1The snapshot listener can also be configured to record and update the status of snapshot requests in a dedicated table snapshot_requests under the pgstream schema. This allows to only perform a given snapshot once by keeping track of what's already been completed. All it's needed is the URL of the database where the table should be created. For this tutorial, we'll use the source database.
PGSTREAM_POSTGRES_SNAPSHOT_STORE_URL="postgres://postgres:postgres@localhost:5432?sslmode=disable"With the listener side ready, the next step is to configure the processor. Since we want the snapshot reach a PostgreSQL database, we will set the PostgreSQL writer configuration variables. The only required value is the URL of the target database, where the snapshotted schema/data from the source database will be streamed. We use the URL of the docker PostgreSQL database we started earlier (note the port is the only difference between the source and the target PostgreSQL databases).
PGSTREAM_POSTGRES_WRITER_TARGET_URL="postgres://postgres:postgres@localhost:7654?sslmode=disable"If we need to disable triggers on the target database during the snapshot process(ie., to avoid foreign key constraint violations), we can do so by setting the following variable:
PGSTREAM_POSTGRES_WRITER_DISABLE_TRIGGERS=trueFor further granularity, we can also configure the action that should be taken when an insert has a conflict.
# Insert on conflict action. Options are update, nothing or error (error is the default behaviour)
PGSTREAM_POSTGRES_WRITER_ON_CONFLICT_ACTION=nothingThe PostgreSQL writer uses batching under the hood to reduce the number of IO calls to the target database and improve performance. The batch size and send timeout can both be configured to be able to better fit the different traffic patterns. The writer will send a batch when the timeout or the batch size is reached, whichever happens first.
# Number of DML queries that will be batched and sent together in a given transaction. It defaults to 100.
PGSTREAM_POSTGRES_WRITER_BATCH_SIZE=25
# Max delay between batch sending. The batches will be sent every 5s by default.
PGSTREAM_POSTGRES_WRITER_BATCH_TIMEOUT=5sThe full configuration for this tutorial can be put into a snapshot2pg_tutorial.env file to be used in the next step. An equivalent snapshot2pg_tutorial.yaml configuration can be found below the environment one, and can be used interchangeably.
# Listener config
PGSTREAM_POSTGRES_LISTENER_URL="postgres://postgres:postgres@localhost:5432?sslmode=disable"
PGSTREAM_POSTGRES_SNAPSHOT_TABLES="*"
PGSTREAM_POSTGRES_SNAPSHOT_SCHEMA_WORKERS=4
PGSTREAM_POSTGRES_SNAPSHOT_TABLE_WORKERS=4
PGSTREAM_POSTGRES_SNAPSHOT_BATCH_BYTES=83886080
PGSTREAM_POSTGRES_SNAPSHOT_WORKERS=1
PGSTREAM_POSTGRES_SNAPSHOT_STORE_URL="postgres://postgres:postgres@localhost:5432?sslmode=disable"
# Processor config
PGSTREAM_POSTGRES_WRITER_TARGET_URL="postgres://postgres:postgres@localhost:7654?sslmode=disable"
PGSTREAM_POSTGRES_WRITER_BATCH_SIZE=25
PGSTREAM_POSTGRES_WRITER_BATCH_TIMEOUT=5s
PGSTREAM_POSTGRES_WRITER_DISABLE_TRIGGERS=true
PGSTREAM_POSTGRES_WRITER_ON_CONFLICT_ACTION=nothingsource:
postgres:
url: "postgres://postgres:postgres@localhost:5432?sslmode=disable"
mode: snapshot # options are replication, snapshot or snapshot_and_replication
snapshot: # when mode is snapshot or snapshot_and_replication
mode: full # options are data_and, schema or data
tables: ["*"] # tables to snapshot, can be a list of table names or a pattern
recorder:
repeatable_snapshots: true # whether to repeat snapshots that have already been taken
postgres_url: "postgres://postgres:postgres@localhost:5432?sslmode=disable" # URL of the database where the snapshot status is recorded
snapshot_workers: 4 # number of schemas to be snapshotted in parallel
data: # when mode is full or data
schema_workers: 4 # number of schema tables to be snapshotted in parallel
table_workers: 4 # number of workers to snapshot a table in parallel
batch_bytes: 83886080 # bytes to read per batch (defaults to 80MiB)
schema: # when mode is full or schema
pgdump_pgrestore:
clean_target_db: false # whether to clean the target database before restoring
target:
postgres:
url: "postgres://postgres:postgres@localhost:7654?sslmode=disable"
batch:
timeout: 5000 # batch timeout in milliseconds
size: 25 # number of messages in a batch
disable_triggers: true # whether to disable triggers on the target database
on_conflict_action: "nothing" # options are update, nothing or errorNow we can connect to the source database and create a table and populate it with some data that we'll want to snapshot.
β psql postgresql://postgres:postgres@localhost:5432/postgresCREATE TABLE test(id SERIAL PRIMARY KEY, name TEXT);
INSERT INTO test(name) VALUES('alice'),('bob'),('charlie');With the configuration ready, we can now run pgstream. In this case we set the log level as trace to provide more context for debugging and have more visibility into what pgstream is doing under the hood. Once the snapshot finishes, the process will stop.
# with the environment configuration
pgstream snapshot -c snapshot2pg_tutorial.env --log-level trace
# with the yaml configuration
pgstream snapshot -c snapshot2pg_tutorial.yaml --log-level trace
# with the CLI flags and relying on defaults
pgstream snapshot --postgres-url "postgres://postgres:postgres@localhost:5432?sslmode=disable" --target postgres --target-url "postgres://postgres:postgres@localhost:7654?sslmode=disable" --tables "*" --log-level traceIf we connect to the target database, we should now see the test table created and populated with the data from the snapshot.
β psql postgresql://postgres:postgres@localhost:7654/postgres \d+ test
+--------+---------+-----------+----------+--------------+-------------+
| Column | Type | Modifiers | Storage | Stats target | Description |
|--------+---------+-----------+----------+--------------+-------------|
| id | integer | not null | plain | <null> | <null> |
| name | text | | extended | <null> | <null> |
+--------+---------+-----------+----------+--------------+-------------+
Indexes:
"test_pkey" PRIMARY KEY, btree (id)
Has OIDs: nopostgres@localhost:postgres> SELECT * FROM test;
+----+---------+
| id | name |
|----+---------|
| 1 | alice |
| 2 | bob |
| 3 | charlie |
+----+---------+We can also check the status of the snapshot by querying the pgstream.snapshot_requests table on the source database. It should contain the details of the snapshot that has been requested.
β psql postgresql://postgres:postgres@localhost:5432/postgrespostgres@localhost:postgres> SELECT * FROM pgstream.snapshot_requests;
+--------+-------------+-------------+-------------------------------+-------------------------------+-----------+--------+
| req_id | schema_name | table_names | created_at | updated_at | status | errors |
|--------+-------------+-------------+-------------------------------+-------------------------------+-----------+--------|
| 1 | public | ['*'] | 2025-03-18 11:27:21.683361+00 | 2025-03-18 11:27:21.843207+00 | completed | <null> |
+--------+-------------+-------------+-------------------------------+-------------------------------+-----------+--------+If the snapshot has completed with errors, it will be retried on the next run of pgstream.
Here are some common issues you might encounter while following this tutorial and how to resolve them:
- Cause: The PostgreSQL database is not running or the connection URL is incorrect.
- Solution:
- Ensure the Docker containers for the source and target databases are running.
- Verify the database URLs in the configuration file (
snapshot2pg_tutorial.env). - Test the connection using
psql:psql postgresql://postgres:postgres@localhost:5432/postgres
- Cause: There may be issues with the configuration or permissions.
- Solution:
- Check the pgstream logs for detailed error messages:
pgstream run -c snapshot2pg_tutorial.env --log-level trace
- Ensure the source database user has the necessary permissions to read the schema and data.
- Check the
pgstream.snapshot_requeststable for error details:SELECT * FROM pgstream.snapshot_requests;
- Check the pgstream logs for detailed error messages:
- Cause: The snapshot process did not complete successfully or the target database URL is incorrect.
- Solution:
-
Verify the target database URL in the configuration file.
-
Check the pgstream logs to confirm the snapshot process completed without errors.
-
Check the
pgstream.snapshot_requeststable for error details:SELECT * FROM pgstream.snapshot_requests;
-
Query the target database to ensure the data was replicated:
SELECT * FROM test;
-
- Cause: The configuration file contains invalid or missing values.
- Solution:
- Double-check the
snapshot2pg_tutorial.envfile for typos or missing variables. - Refer to the pgstream configuration documentation for details on required variables.
- Run the
pgstream statuscommand to validate the configuration is correct.
- Double-check the
- Cause: The database user does not have sufficient privileges.
- Solution:
- Ensure the user has the necessary permissions to create tables, replication slots, and perform snapshots.
- Grant the required privileges:
GRANT ALL PRIVILEGES ON DATABASE postgres TO postgres;
If you encounter issues not listed here, consult the pgstream documentation or open an issue on the project's GitHub repository.
In this tutorial, we successfully configured pgstream to snapshot data from a source PostgreSQL database to a target PostgreSQL database. We covered the following steps:
- Set up the source and target PostgreSQL databases using Docker.
- Configured the
pgstreamlistener and processor for snapshotting. - Created a sample table in the source database and populated it with data.
- Ran
pgstreamto perform the snapshot. - Verified that the data was successfully replicated to the target database.
This process demonstrates how pgstream can be used to efficiently snapshot data between PostgreSQL databases. For more advanced use cases, such as continuous replication or applying transformations, refer to the other pgstream tutorials.