Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Commit ffa03c6

Browse files
authored
Merge branch 'master' into feat/stats-for-dbt
2 parents 63fff1c + be5256c commit ffa03c6

File tree

11 files changed

+131
-53
lines changed

11 files changed

+131
-53
lines changed

README.md

Lines changed: 60 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,65 @@
1-
<p align="left">
1+
<p align="center">
22
<a href="https://datafold.com/"><img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="30%" /></a>
33
</p>
44

5-
<h1 align="left">
6-
data-diff: compare datasets fast, within or across SQL databases
7-
</h1>
5+
<h2 align="center">
6+
data-diff: Compare datasets fast, within or across SQL databases
87

8+
![data-diff-logo](docs/data-diff-logo.png)
9+
</h2>
910
<br>
1011

12+
# Use Cases
13+
14+
## Data Migration & Replication Testing
15+
Compare source to target and check for discrepancies when moving data between systems:
16+
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
17+
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
18+
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
19+
20+
21+
## Data Development Testing
22+
Test SQL code and preview changes by comparing development/staging environment data to production:
23+
1. Make a change to some SQL code
24+
2. Run the SQL code to create a new dataset
25+
3. Compare the dataset with its production version or another iteration
26+
27+
<p align="left">
28+
<img alt="dbt" src="https://seeklogo.com/images/D/dbt-logo-E4B0ED72A2-seeklogo.com.png" width="10%" />
29+
</p>
30+
31+
<details>
32+
<summary> data-diff integrates with dbt Core to seamlessly compare local development to production datasets
33+
34+
</summary>
35+
36+
![data-development-testing](docs/development_testing.png)
37+
38+
</details>
39+
40+
> [dbt Cloud users should check out Datafold's out-of-the-box deployment testing integration](https://www.datafold.com/data-deployment-testing)
41+
42+
:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
43+
44+
**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)**
45+
46+
Also available in a [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=Datafold.datafold-vscode)
47+
48+
Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support
49+
50+
1151
# How it works
1252

1353
When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:
1454

15-
## joindiff
55+
## `joindiff`
1656
- Recommended for comparing data within the same database
1757
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
1858
- Fully relies on the underlying database engine for computation
1959
- Requires both datasets to be queryable with a single SQL query
2060
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
2161

22-
## hashdiff
62+
## `hashdiff`
2363
- Recommended for comparing datasets across different databases
2464
- Can also be helpful in diffing very large tables with few expected differences within the same database
2565
- Employs a divide-and-conquer algorithm based on hashing and binary search
@@ -52,59 +92,32 @@ data-diff \
5292
Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.
5393

5494

55-
# Use cases
56-
57-
## Data Migration & Replication Testing
58-
Compare source to target and check for discrepancies when moving data between systems:
59-
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
60-
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
61-
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)
62-
63-
64-
## Data Development Testing
65-
Test SQL code and preview changes by comparing development/staging environment data to production:
66-
1. Make a change to some SQL code
67-
2. Run the SQL code to create a new dataset
68-
3. Compare the dataset with its production version or another iteration
69-
70-
<p align="left">
71-
<img alt="dbt" src="https://seeklogo.com/images/D/dbt-logo-E4B0ED72A2-seeklogo.com.png" width="10%" />
72-
</p>
73-
74-
`data-diff` integrates with dbt Core and dbt Cloud to seamlessly compare local development to production datasets.
75-
76-
:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**
77-
78-
**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)**
79-
80-
Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support
81-
8295
# Supported databases
8396

8497

8598
| Database | Status | Connection string |
8699
|---------------|-------------------------------------------------------------------------------------------------------------------------------------|--------|
87-
| PostgreSQL >=10 | 💚 | `postgresql://<user>:<password>@<host>:5432/<database>` |
88-
| MySQL | 💚 | `mysql://<user>:<password>@<hostname>:5432/<database>` |
89-
| Snowflake | 💚 | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` |
90-
| BigQuery | 💚 | `bigquery://<project>/<dataset>` |
91-
| Redshift | 💚 | `redshift://<username>:<password>@<hostname>:5439/<database>` |
92-
| Oracle | 💛 | `oracle://<username>:<password>@<hostname>/database` |
93-
| Presto | 💛 | `presto://<username>:<password>@<hostname>:8080/<database>` |
94-
| Databricks | 💛 | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>` |
95-
| Trino | 💛 | `trino://<username>:<password>@<hostname>:8080/<database>` |
96-
| Clickhouse | 💛 | `clickhouse://<username>:<password>@<hostname>:9000/<database>` |
97-
| Vertica | 💛 | `vertica://<username>:<password>@<hostname>:5433/<database>` |
98-
| DuckDB | 💛 | |
100+
| PostgreSQL >=10 | 🟢 | `postgresql://<user>:<password>@<host>:5432/<database>` |
101+
| MySQL | 🟢 | `mysql://<user>:<password>@<hostname>:5432/<database>` |
102+
| Snowflake | 🟢 | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` |
103+
| BigQuery | 🟢 | `bigquery://<project>/<dataset>` |
104+
| Redshift | 🟢 | `redshift://<username>:<password>@<hostname>:5439/<database>` |
105+
| Oracle | 🟡 | `oracle://<username>:<password>@<hostname>/database` |
106+
| Presto | 🟡 | `presto://<username>:<password>@<hostname>:8080/<database>` |
107+
| Databricks | 🟡 | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>` |
108+
| Trino | 🟡 | `trino://<username>:<password>@<hostname>:8080/<database>` |
109+
| Clickhouse | 🟡 | `clickhouse://<username>:<password>@<hostname>:9000/<database>` |
110+
| Vertica | 🟡 | `vertica://<username>:<password>@<hostname>:5433/<database>` |
111+
| DuckDB | 🟡 | |
99112
| ElasticSearch | 📝 | |
100113
| Planetscale | 📝 | |
101114
| Pinot | 📝 | |
102115
| Druid | 📝 | |
103116
| Kafka | 📝 | |
104117
| SQLite | 📝 | |
105118

106-
* 💚: Implemented and thoroughly tested.
107-
* 💛: Implemented, but not thoroughly tested yet.
119+
* 🟢: Implemented and thoroughly tested.
120+
* 🟡: Implemented, but not thoroughly tested yet.
108121
* ⏳: Implementation in progress.
109122
* 📝: Implementation planned. Contributions welcome.
110123

data_diff/config.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,12 @@
44
import toml
55

66

7+
_ARRAY_FIELDS = (
8+
"key_columns",
9+
"columns",
10+
)
11+
12+
713
class ConfigParseError(Exception):
814
pass
915

@@ -38,6 +44,11 @@ def _apply_config(config: Dict[str, Any], run_name: str, kw: Dict[str, Any]):
3844
for index in "12":
3945
run_args[index] = {attr: kw.pop(f"{attr}{index}") for attr in ("database", "table")}
4046

47+
# Make sure array fields are decoded as list, since array fields in toml are decoded as list, but TableSegment object requires tuple type.
48+
for field in _ARRAY_FIELDS:
49+
if isinstance(run_args.get(field), list):
50+
run_args[field] = tuple(run_args[field])
51+
4152
# Process databases + tables
4253
for index in "12":
4354
try:

data_diff/sqeleton/databases/postgresql.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
from typing import List
12
from ..abcs.database_types import (
23
DbPath,
34
JSON,
@@ -92,6 +93,10 @@ def quote(self, s: str):
9293
def to_string(self, s: str):
9394
return f"{s}::varchar"
9495

96+
def concat(self, items: List[str]) -> str:
97+
joined_exprs = " || ".join(items)
98+
return f"({joined_exprs})"
99+
95100
def _convert_db_precision_to_digits(self, p: int) -> int:
96101
# Subtracting 2 due to wierd precision issues in PostgreSQL
97102
return super()._convert_db_precision_to_digits(p) - 2

data_diff/tracking.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ def bool_ask_for_email() -> bool:
4242
"""
4343
Checks the .datadiff.toml profile file for the asked_for_email key
4444
45-
Returns False immediately if --no-tracking
45+
Returns False immediately if --no-tracking or not in an interactive terminal
4646
4747
If found, return False (already asked for email)
4848
@@ -51,7 +51,8 @@ def bool_ask_for_email() -> bool:
5151
Returns:
5252
bool: decision on whether to prompt the user for their email
5353
"""
54-
if g_tracking_enabled:
54+
console = get_console()
55+
if g_tracking_enabled and console.is_interactive:
5556
profile = _load_profile()
5657

5758
if "asked_for_email" not in profile:

data_diff/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.8.1"
1+
__version__ = "0.8.3"

docs/data-diff-logo.png

40.8 KB
Loading

docs/development_testing.png

69.7 KB
Loading

poetry.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "data-diff"
3-
version = "0.8.1"
3+
version = "0.8.3"
44
description = "Command-line tool and Python library to efficiently diff rows across two different databases."
55
authors = ["Datafold <data-diff@datafold.com>"]
66
license = "MIT"
@@ -36,7 +36,7 @@ cryptography = {version="*", optional=true}
3636
trino = {version="^0.314.0", optional=true}
3737
presto-python-client = {version="*", optional=true}
3838
clickhouse-driver = {version="*", optional=true}
39-
duckdb = {version="^0.7.0", optional=true}
39+
duckdb = {version="*", optional=true}
4040
dbt-artifacts-parser = {version="^0.4.0"}
4141
dbt-core = {version="^1.0.0"}
4242
keyring = "*"

tests/test_config.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ def test_basic(self):
1515
1616
[run.default]
1717
update_column = "timestamp"
18+
key_columns = ["id"]
19+
columns = ["name", "age"]
1820
verbose = true
1921
threads = 2
2022
@@ -39,6 +41,8 @@ def test_basic(self):
3941
assert res["table2"] == "rating_del1"
4042
assert res["threads1"] == 11
4143
assert res["threads2"] == 22
44+
assert res["key_columns"] == ("id",)
45+
assert res["columns"] == ("name", "age")
4246

4347
res = apply_config_from_string(config, "pg_pg", {"update_column": "foo", "table2": "bar"})
4448
assert res["update_column"] == "foo"

0 commit comments

Comments
 (0)