Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Commit 474de40

Browse files
authored
Merge pull request #76 from datafold/import_errors
Better errors for missing imports
2 parents b88972a + be727d9 commit 474de40

File tree

10 files changed

+98
-45
lines changed

10 files changed

+98
-45
lines changed

README.md

Lines changed: 37 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ also find us in `#tools-data-diff` in the [Locally Optimistic Slack.][slack]**
77
**data-diff** is a command-line tool and Python library to efficiently diff
88
rows across two different databases.
99

10-
* ⇄ Verifies across [many different databases][dbs] (e.g. Postgres -> Snowflake)
10+
* ⇄ Verifies across [many different databases][dbs] (e.g. PostgreSQL -> Snowflake)
1111
* 🔍 Outputs [diff of rows](#example-command-and-output) in detail
1212
* 🚨 Simple CLI/API to create monitoring and alerts
1313
* 🔥 Verify 25M+ rows in <10s, and 1B+ rows in ~5min.
@@ -28,7 +28,7 @@ comparing every row.
2828

2929
**†:** The implementation for downloading all rows that `data-diff` and
3030
`count(*)` is compared to is not optimal. It is a single Python multi-threaded
31-
process. The performance is fairly driver-specific, e.g. Postgres' performs 10x
31+
process. The performance is fairly driver-specific, e.g. PostgreSQL's performs 10x
3232
better than MySQL.
3333

3434
## Table of Contents
@@ -45,7 +45,7 @@ better than MySQL.
4545
## Common use-cases
4646

4747
* **Verify data migrations.** Verify that all data was copied when doing a
48-
critical data migration. For example, migrating from Heroku Postgres to Amazon RDS.
48+
critical data migration. For example, migrating from Heroku PostgreSQL to Amazon RDS.
4949
* **Verifying data pipelines.** Moving data from a relational database to a
5050
warehouse/data lake with Fivetran, Airbyte, Debezium, or some other pipeline.
5151
* **Alerting and maintaining data integrity SLOs.** You can create and monitor
@@ -63,13 +63,13 @@ better than MySQL.
6363

6464
## Example Command and Output
6565

66-
Below we run a comparison with the CLI for 25M rows in Postgres where the
66+
Below we run a comparison with the CLI for 25M rows in PostgreSQL where the
6767
right-hand table is missing single row with `id=12500048`:
6868

6969
```
7070
$ data-diff \
71-
postgres://postgres:password@localhost/postgres rating \
72-
postgres://postgres:password@localhost/postgres rating_del1 \
71+
postgresql://user:password@localhost/database rating \
72+
postgresql://user:password@localhost/database rating_del1 \
7373
--bisection-threshold 100000 \ # for readability, try default first
7474
--bisection-factor 6 \ # for readability, try default first
7575
--update-column timestamp \
@@ -111,7 +111,7 @@ $ data-diff \
111111

112112
| Database | Connection string | Status |
113113
|---------------|-----------------------------------------------------------------------------------------|--------|
114-
| Postgres | `postgres://user:password@hostname:5432/database` | 💚 |
114+
| PostgreSQL | `postgresql://user:password@hostname:5432/database` | 💚 |
115115
| MySQL | `mysql://user:password@hostname:5432/database` | 💚 |
116116
| Snowflake | `snowflake://user:password@account/database/SCHEMA?warehouse=WAREHOUSE&role=role` | 💚 |
117117
| Oracle | `oracle://username:password@hostname/database` | 💛 |
@@ -140,9 +140,28 @@ Requires Python 3.7+ with pip.
140140

141141
```pip install data-diff```
142142

143-
or when you need extras like mysql and postgres
143+
## Install drivers
144144

145-
```pip install "data-diff[mysql,pgsql]"```
145+
To connect to a database, we need to have its driver installed, in the form of a Python library.
146+
147+
While you may install them manually, we offer an easy way to install them along with data-diff:
148+
149+
- `pip install 'data-diff[mysql]'`
150+
151+
- `pip install 'data-diff[postgresql]'`
152+
153+
- `pip install 'data-diff[snowflake]'`
154+
155+
- `pip install 'data-diff[presto]'`
156+
157+
- `pip install 'data-diff[oracle]'`
158+
159+
- For BigQuery, see: https://pypi.org/project/google-cloud-bigquery/
160+
161+
162+
Users can also install several drivers at once:
163+
164+
```pip install 'data-diff[mysql,postgresql,snowflake]'```
146165

147166
# How to use
148167

@@ -185,7 +204,7 @@ logging.basicConfig(level=logging.INFO)
185204

186205
from data_diff import connect_to_table, diff_tables
187206

188-
table1 = connect_to_table("postgres:///", "table_name", "id")
207+
table1 = connect_to_table("postgresql:///", "table_name", "id")
189208
table2 = connect_to_table("mysql:///", "table_name", "id")
190209

191210
for different_row in diff_tables(table1, table2):
@@ -201,11 +220,11 @@ In this section we'll be doing a walk-through of exactly how **data-diff**
201220
works, and how to tune `--bisection-factor` and `--bisection-threshold`.
202221

203222
Let's consider a scenario with an `orders` table with 1M rows. Fivetran is
204-
replicating it contionously from Postgres to Snowflake:
223+
replicating it contionously from PostgreSQL to Snowflake:
205224

206225
```
207226
┌─────────────┐ ┌─────────────┐
208-
Postgres │ │ Snowflake │
227+
PostgreSQL │ │ Snowflake │
209228
├─────────────┤ ├─────────────┤
210229
│ │ │ │
211230
│ │ │ │
@@ -233,7 +252,7 @@ of the table. Then it splits the table into `--bisection-factor=10` segments of
233252

234253
```
235254
┌──────────────────────┐ ┌──────────────────────┐
236-
Postgres │ │ Snowflake │
255+
PostgreSQL │ │ Snowflake │
237256
├──────────────────────┤ ├──────────────────────┤
238257
│ id=1..100k │ │ id=1..100k │
239258
├──────────────────────┤ ├──────────────────────┤
@@ -281,7 +300,7 @@ are the same except `id=100k..200k`:
281300

282301
```
283302
┌──────────────────────┐ ┌──────────────────────┐
284-
Postgres │ │ Snowflake │
303+
PostgreSQL │ │ Snowflake │
285304
├──────────────────────┤ ├──────────────────────┤
286305
│ checksum=0102 │ │ checksum=0102 │
287306
├──────────────────────┤ mismatch! ├──────────────────────┤
@@ -306,7 +325,7 @@ and compare them in memory in **data-diff**.
306325

307326
```
308327
┌──────────────────────┐ ┌──────────────────────┐
309-
Postgres │ │ Snowflake │
328+
PostgreSQL │ │ Snowflake │
310329
├──────────────────────┤ ├──────────────────────┤
311330
│ id=100k..110k │ │ id=100k..110k │
312331
├──────────────────────┤ ├──────────────────────┤
@@ -337,7 +356,7 @@ If you pass `--stats` you'll see e.g. what % of rows were different.
337356
queries.
338357
* Consider increasing the number of simultaneous threads executing
339358
queries per database with `--threads`. For databases that limit concurrency
340-
per query, e.g. Postgres/MySQL, this can improve performance dramatically.
359+
per query, e.g. PostgreSQL/MySQL, this can improve performance dramatically.
341360
* If you are only interested in _whether_ something changed, pass `--limit 1`.
342361
This can be useful if changes are very rare. This is often faster than doing a
343362
`count(*)`, for the reason mentioned above.
@@ -419,7 +438,7 @@ Now you can insert it into the testing database(s):
419438
```shell-session
420439
# It's optional to seed more than one to run data-diff(1) against.
421440
$ poetry run preql -f dev/prepare_db.pql mysql://mysql:Password1@127.0.0.1:3306/mysql
422-
$ poetry run preql -f dev/prepare_db.pql postgres://postgres:Password1@127.0.0.1:5432/postgres
441+
$ poetry run preql -f dev/prepare_db.pql postgresql://postgres:Password1@127.0.0.1:5432/postgres
423442
424443
# Cloud databases
425444
$ poetry run preql -f dev/prepare_db.pql snowflake://<uri>
@@ -430,7 +449,7 @@ $ poetry run preql -f dev/prepare_db.pql bigquery:///<project>
430449
**5. Run **data-diff** against seeded database**
431450

432451
```bash
433-
poetry run python3 -m data_diff postgres://postgres:Password1@localhost/postgres rating postgres://postgres:Password1@localhost/postgres rating_del1 --verbose
452+
poetry run python3 -m data_diff postgresql://postgres:Password1@localhost/postgres rating postgresql://postgres:Password1@localhost/postgres rating_del1 --verbose
434453
```
435454

436455
# License

data_diff/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ def diff_tables(
5656
"""Efficiently finds the diff between table1 and table2.
5757
5858
Example:
59-
>>> table1 = connect_to_table('postgres:///', 'Rating', 'id')
59+
>>> table1 = connect_to_table('postgresql:///', 'Rating', 'id')
6060
>>> list(diff_tables(table1, table1))
6161
[]
6262

data_diff/database.py

Lines changed: 47 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
import math
2-
from functools import lru_cache
2+
from functools import lru_cache, wraps
33
from itertools import zip_longest
44
import re
55
from abc import ABC, abstractmethod
@@ -23,20 +23,40 @@ def parse_table_name(t):
2323
return tuple(t.split("."))
2424

2525

26-
def import_postgres():
26+
def import_helper(package: str = None, text=""):
27+
def dec(f):
28+
@wraps(f)
29+
def _inner():
30+
try:
31+
return f()
32+
except ModuleNotFoundError as e:
33+
s = text
34+
if package:
35+
s += f"You can install it using 'pip install data-diff[{package}]'."
36+
raise ModuleNotFoundError(f"{e}\n\n{s}\n")
37+
38+
return _inner
39+
40+
return dec
41+
42+
43+
@import_helper("postgresql")
44+
def import_postgresql():
2745
import psycopg2
2846
import psycopg2.extras
2947

3048
psycopg2.extensions.set_wait_callback(psycopg2.extras.wait_select)
3149
return psycopg2
3250

3351

52+
@import_helper("mysql")
3453
def import_mysql():
3554
import mysql.connector
3655

3756
return mysql.connector
3857

3958

59+
@import_helper("snowflake")
4060
def import_snowflake():
4161
import snowflake.connector
4262

@@ -55,12 +75,20 @@ def import_oracle():
5575
return cx_Oracle
5676

5777

78+
@import_helper("presto")
5879
def import_presto():
5980
import prestodb
6081

6182
return prestodb
6283

6384

85+
@import_helper(text="Please install BigQuery and configure your google-cloud access.")
86+
def import_bigquery():
87+
from google.cloud import bigquery
88+
89+
return bigquery
90+
91+
6492
class ConnectError(Exception):
6593
pass
6694

@@ -344,7 +372,6 @@ def _normalize_table_path(self, path: DbPath) -> DbPath:
344372

345373
return path
346374

347-
348375
def parse_table_name(self, name: str) -> DbPath:
349376
return parse_table_name(name)
350377

@@ -356,19 +383,25 @@ class ThreadedDatabase(Database):
356383
"""
357384

358385
def __init__(self, thread_count=1):
386+
self._init_error = None
359387
self._queue = ThreadPoolExecutor(thread_count, initializer=self.set_conn)
360388
self.thread_local = threading.local()
361389

362390
def set_conn(self):
363391
assert not hasattr(self.thread_local, "conn")
364-
self.thread_local.conn = self.create_connection()
392+
try:
393+
self.thread_local.conn = self.create_connection()
394+
except ModuleNotFoundError as e:
395+
self._init_error = e
365396

366397
def _query(self, sql_code: str):
367398
r = self._queue.submit(self._query_in_worker, sql_code)
368399
return r.result()
369400

370401
def _query_in_worker(self, sql_code: str):
371402
"This method runs in a worker thread"
403+
if self._init_error:
404+
raise self._init_error
372405
return _query_conn(self.thread_local.conn, sql_code)
373406

374407
def close(self):
@@ -394,7 +427,7 @@ def close(self):
394427
TIMESTAMP_PRECISION_POS = 20 # len("2022-06-03 12:24:35.") == 20
395428

396429

397-
class Postgres(ThreadedDatabase):
430+
class PostgreSQL(ThreadedDatabase):
398431
DATETIME_TYPES = {
399432
"timestamp with time zone": TimestampTZ,
400433
"timestamp without time zone": Timestamp,
@@ -418,16 +451,16 @@ def __init__(self, host, port, user, password, *, database, thread_count, **kw):
418451
super().__init__(thread_count=thread_count)
419452

420453
def _convert_db_precision_to_digits(self, p: int) -> int:
421-
# Subtracting 2 due to wierd precision issues in Postgres
454+
# Subtracting 2 due to wierd precision issues in PostgreSQL
422455
return super()._convert_db_precision_to_digits(p) - 2
423456

424457
def create_connection(self):
425-
postgres = import_postgres()
458+
pg = import_postgresql()
426459
try:
427-
c = postgres.connect(**self.args)
460+
c = pg.connect(**self.args)
428461
# c.cursor().execute("SET TIME ZONE 'UTC'")
429462
return c
430-
except postgres.OperationalError as e:
463+
except pg.OperationalError as e:
431464
raise ConnectError(*e.args) from e
432465

433466
def quote(self, s: str):
@@ -689,9 +722,9 @@ def _parse_type(
689722
return UnknownColType(type_repr)
690723

691724

692-
class Redshift(Postgres):
725+
class Redshift(PostgreSQL):
693726
NUMERIC_TYPES = {
694-
**Postgres.NUMERIC_TYPES,
727+
**PostgreSQL.NUMERIC_TYPES,
695728
"double": Float,
696729
"real": Float,
697730
}
@@ -774,7 +807,7 @@ class BigQuery(Database):
774807
ROUNDS_ON_PREC_LOSS = False # Technically BigQuery doesn't allow implicit rounding or truncation
775808

776809
def __init__(self, project, *, dataset, **kw):
777-
from google.cloud import bigquery
810+
bigquery = import_bigquery()
778811

779812
self._client = bigquery.Client(project, **kw)
780813
self.project = project
@@ -972,7 +1005,7 @@ def match_path(self, dsn):
9721005

9731006

9741007
MATCH_URI_PATH = {
975-
"postgres": MatchUriPath(Postgres, ["database?"], help_str="postgres://<user>:<pass>@<host>/<database>"),
1008+
"postgresql": MatchUriPath(PostgreSQL, ["database?"], help_str="postgresql://<user>:<pass>@<host>/<database>"),
9761009
"mysql": MatchUriPath(MySQL, ["database?"], help_str="mysql://<user>:<pass>@<host>/<database>"),
9771010
"oracle": MatchUriPath(Oracle, ["database?"], help_str="oracle://<user>:<pass>@<host>/<database>"),
9781011
"mssql": MatchUriPath(MsSQL, ["database?"], help_str="mssql://<user>:<pass>@<host>/<database>"),
@@ -1001,7 +1034,7 @@ def connect_to_uri(db_uri: str, thread_count: Optional[int] = 1) -> Database:
10011034
Note: For non-cloud databases, a low thread-pool size may be a performance bottleneck.
10021035
10031036
Supported schemes:
1004-
- postgres
1037+
- postgresql
10051038
- mysql
10061039
- mssql
10071040
- oracle

docs/index.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Introduction
1111
**Data-diff** is a command-line tool and Python library to efficiently diff
1212
rows across two different databases.
1313

14-
⇄ Verifies across many different databases (e.g. *Postgres* -> *Snowflake*) !
14+
⇄ Verifies across many different databases (e.g. *PostgreSQL* -> *Snowflake*) !
1515

1616
🔍 Outputs diff of rows in detail
1717

@@ -32,11 +32,11 @@ Requires Python 3.7+ with pip.
3232

3333
pip install data-diff
3434

35-
or when you need extras like mysql and postgres:
35+
or when you need extras like mysql and postgresql:
3636

3737
::
3838

39-
pip install "data-diff[mysql,pgsql]"
39+
pip install "data-diff[mysql,postgresql]"
4040

4141

4242
How to use from Python
@@ -50,7 +50,7 @@ How to use from Python
5050
5151
from data_diff import connect_to_table, diff_tables
5252
53-
table1 = connect_to_table("postgres:///", "table_name", "id")
53+
table1 = connect_to_table("postgresql:///", "table_name", "id")
5454
table2 = connect_to_table("mysql:///", "table_name", "id")
5555
5656
for sign, columns in diff_tables(table1, table2):

poetry.lock

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)