@@ -7,7 +7,7 @@ also find us in `#tools-data-diff` in the [Locally Optimistic Slack.][slack]**
77** data-diff** is a command-line tool and Python library to efficiently diff
88rows across two different databases.
99
10- * ⇄ Verifies across [ many different databases] [ dbs ] (e.g. Postgres -> Snowflake)
10+ * ⇄ Verifies across [ many different databases] [ dbs ] (e.g. PostgreSQL -> Snowflake)
1111* 🔍 Outputs [ diff of rows] ( #example-command-and-output ) in detail
1212* 🚨 Simple CLI/API to create monitoring and alerts
1313* 🔥 Verify 25M+ rows in <10s, and 1B+ rows in ~ 5min.
@@ -28,7 +28,7 @@ comparing every row.
2828
2929** †:** The implementation for downloading all rows that ` data-diff ` and
3030` count(*) ` is compared to is not optimal. It is a single Python multi-threaded
31- process. The performance is fairly driver-specific, e.g. Postgres' performs 10x
31+ process. The performance is fairly driver-specific, e.g. PostgreSQL's performs 10x
3232better than MySQL.
3333
3434## Table of Contents
@@ -45,7 +45,7 @@ better than MySQL.
4545## Common use-cases
4646
4747* ** Verify data migrations.** Verify that all data was copied when doing a
48- critical data migration. For example, migrating from Heroku Postgres to Amazon RDS.
48+ critical data migration. For example, migrating from Heroku PostgreSQL to Amazon RDS.
4949* ** Verifying data pipelines.** Moving data from a relational database to a
5050 warehouse/data lake with Fivetran, Airbyte, Debezium, or some other pipeline.
5151* ** Alerting and maintaining data integrity SLOs.** You can create and monitor
@@ -63,13 +63,13 @@ better than MySQL.
6363
6464## Example Command and Output
6565
66- Below we run a comparison with the CLI for 25M rows in Postgres where the
66+ Below we run a comparison with the CLI for 25M rows in PostgreSQL where the
6767right-hand table is missing single row with ` id=12500048 ` :
6868
6969```
7070$ data-diff \
71- postgres ://postgres :password@localhost/postgres rating \
72- postgres ://postgres :password@localhost/postgres rating_del1 \
71+ postgresql ://user :password@localhost/database rating \
72+ postgresql ://user :password@localhost/database rating_del1 \
7373 --bisection-threshold 100000 \ # for readability, try default first
7474 --bisection-factor 6 \ # for readability, try default first
7575 --update-column timestamp \
@@ -111,7 +111,7 @@ $ data-diff \
111111
112112| Database | Connection string | Status |
113113| ---------------| -----------------------------------------------------------------------------------------| --------|
114- | Postgres | ` postgres ://user:password@hostname:5432/database` | 💚 |
114+ | PostgreSQL | ` postgresql ://user:password@hostname:5432/database` | 💚 |
115115| MySQL | ` mysql://user:password@hostname:5432/database ` | 💚 |
116116| Snowflake | ` snowflake://user:password@account/database/SCHEMA?warehouse=WAREHOUSE&role=role ` | 💚 |
117117| Oracle | ` oracle://username:password@hostname/database ` | 💛 |
@@ -140,9 +140,28 @@ Requires Python 3.7+ with pip.
140140
141141``` pip install data-diff ```
142142
143- or when you need extras like mysql and postgres
143+ ## Install drivers
144144
145- ``` pip install "data-diff[mysql,pgsql]" ```
145+ To connect to a database, we need to have its driver installed, in the form of a Python library.
146+
147+ While you may install them manually, we offer an easy way to install them along with data-diff:
148+
149+ - ` pip install 'data-diff[mysql]' `
150+
151+ - ` pip install 'data-diff[postgresql]' `
152+
153+ - ` pip install 'data-diff[snowflake]' `
154+
155+ - ` pip install 'data-diff[presto]' `
156+
157+ - ` pip install 'data-diff[oracle]' `
158+
159+ - For BigQuery, see: https://pypi.org/project/google-cloud-bigquery/
160+
161+
162+ Users can also install several drivers at once:
163+
164+ ``` pip install 'data-diff[mysql,postgresql,snowflake]' ```
146165
147166# How to use
148167
@@ -185,7 +204,7 @@ logging.basicConfig(level=logging.INFO)
185204
186205from data_diff import connect_to_table, diff_tables
187206
188- table1 = connect_to_table(" postgres :///" , " table_name" , " id" )
207+ table1 = connect_to_table(" postgresql :///" , " table_name" , " id" )
189208table2 = connect_to_table(" mysql:///" , " table_name" , " id" )
190209
191210for different_row in diff_tables(table1, table2):
@@ -201,11 +220,11 @@ In this section we'll be doing a walk-through of exactly how **data-diff**
201220works, and how to tune ` --bisection-factor ` and ` --bisection-threshold ` .
202221
203222Let's consider a scenario with an ` orders ` table with 1M rows. Fivetran is
204- replicating it contionously from Postgres to Snowflake:
223+ replicating it contionously from PostgreSQL to Snowflake:
205224
206225```
207226┌─────────────┐ ┌─────────────┐
208- │ Postgres │ │ Snowflake │
227+ │ PostgreSQL │ │ Snowflake │
209228├─────────────┤ ├─────────────┤
210229│ │ │ │
211230│ │ │ │
@@ -233,7 +252,7 @@ of the table. Then it splits the table into `--bisection-factor=10` segments of
233252
234253```
235254┌──────────────────────┐ ┌──────────────────────┐
236- │ Postgres │ │ Snowflake │
255+ │ PostgreSQL │ │ Snowflake │
237256├──────────────────────┤ ├──────────────────────┤
238257│ id=1..100k │ │ id=1..100k │
239258├──────────────────────┤ ├──────────────────────┤
@@ -281,7 +300,7 @@ are the same except `id=100k..200k`:
281300
282301```
283302┌──────────────────────┐ ┌──────────────────────┐
284- │ Postgres │ │ Snowflake │
303+ │ PostgreSQL │ │ Snowflake │
285304├──────────────────────┤ ├──────────────────────┤
286305│ checksum=0102 │ │ checksum=0102 │
287306├──────────────────────┤ mismatch! ├──────────────────────┤
@@ -306,7 +325,7 @@ and compare them in memory in **data-diff**.
306325
307326```
308327┌──────────────────────┐ ┌──────────────────────┐
309- │ Postgres │ │ Snowflake │
328+ │ PostgreSQL │ │ Snowflake │
310329├──────────────────────┤ ├──────────────────────┤
311330│ id=100k..110k │ │ id=100k..110k │
312331├──────────────────────┤ ├──────────────────────┤
@@ -337,7 +356,7 @@ If you pass `--stats` you'll see e.g. what % of rows were different.
337356 queries.
338357* Consider increasing the number of simultaneous threads executing
339358 queries per database with ` --threads ` . For databases that limit concurrency
340- per query, e.g. Postgres /MySQL, this can improve performance dramatically.
359+ per query, e.g. PostgreSQL /MySQL, this can improve performance dramatically.
341360* If you are only interested in _ whether_ something changed, pass ` --limit 1 ` .
342361 This can be useful if changes are very rare. This is often faster than doing a
343362 ` count(*) ` , for the reason mentioned above.
@@ -419,7 +438,7 @@ Now you can insert it into the testing database(s):
419438``` shell-session
420439# It's optional to seed more than one to run data-diff(1) against.
421440$ poetry run preql -f dev/prepare_db.pql mysql://mysql:Password1@127.0.0.1:3306/mysql
422- $ poetry run preql -f dev/prepare_db.pql postgres ://postgres:Password1@127.0.0.1:5432/postgres
441+ $ poetry run preql -f dev/prepare_db.pql postgresql ://postgres:Password1@127.0.0.1:5432/postgres
423442
424443# Cloud databases
425444$ poetry run preql -f dev/prepare_db.pql snowflake://<uri>
@@ -430,7 +449,7 @@ $ poetry run preql -f dev/prepare_db.pql bigquery:///<project>
430449** 5. Run ** data-diff** against seeded database**
431450
432451``` bash
433- poetry run python3 -m data_diff postgres ://postgres:Password1@localhost/postgres rating postgres ://postgres:Password1@localhost/postgres rating_del1 --verbose
452+ poetry run python3 -m data_diff postgresql ://postgres:Password1@localhost/postgres rating postgresql ://postgres:Password1@localhost/postgres rating_del1 --verbose
434453```
435454
436455# License
0 commit comments