Skip to content

Commit 7c9e4ad

Browse files
dtsongclaude
andauthored
feat: Docker Compose local test infrastructure with seed data (#24)
* fix: validate column types in set operations and fix PostgreSQL timestamp edge cases Add type-class validation to TableOp.schema and TableOp.type so that unions, intersects, and minus operations reject mismatched column types early with a clear QueryBuilderError instead of silently producing incorrect results. (#5) Fix PostgreSQL timestamp normalization: use timestamptz(6) cast for TimestampTZ columns to preserve timezone info during bounds comparison, and replace hardcoded TIMESTAMP_PRECISION_POS with length()-based calculation to correctly pad years with >4 digits. (#12) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add Docker Compose local test infrastructure with seed data Add SQL seed data (PostgreSQL + MySQL) with ~1000 rows and deliberate diffs for showcasing data-diff. Default connection strings for all docker-compose databases, add profiles to keep lightweight default (PG + MySQL only), and add Makefile for developer ergonomics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address code review findings from PR #24 Critical: - Fix _add_padding double-truncation regression for rounding branch (split into _truncate_and_pad and _zero_pad for correct behavior) Important: - Fix non-rounding timestamp path to use timestamptz cast for TimestampTZ - Add None guard to TableOp.type to avoid misleading errors - Use QueryBuilderError consistently for schema length mismatch - Revert Presto/Trino/Vertica conn defaults to None (CI doesn't test them) - Remove unused Presto/Trino from CI docker compose command - Add comprehensive tests for all timestamp paths and edge cases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review findings for robustness and developer ergonomics - Revert ClickHouse default conn string to None so `make test` skips ClickHouse when the container isn't running; set URI explicitly in CI - Add None-schema guard in TableOp.schema with clear error message - Return None (not optimistic type) when one side of TableOp.type is unknown - Fix Makefile comment to accurately reflect PG + MySQL (not ClickHouse) - Add comment explaining why Join.schema skips cross-table type validation - Add tests for TableOp.type mismatch and matching branches Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add CI comment for profile flag and normalize conn string defaults - Add comment explaining why --profile full is needed (ClickHouse is profile-gated; only explicitly named services start) - Add `or None` to Databricks and MsSQL conn strings to handle empty env vars consistently with all other optional databases Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent b5045c4 commit 7c9e4ad

9 files changed

Lines changed: 334 additions & 27 deletions

File tree

.github/workflows/ci.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,8 @@ jobs:
4242
run: uv tool run ty check --python-version 3.10
4343

4444
- name: Build the stack
45-
run: docker compose up -d --wait mysql postgres presto trino clickhouse
45+
# --profile full unlocks profile-gated clickhouse; only named services start
46+
run: docker compose --profile full up -d --wait mysql postgres clickhouse
4647

4748
- name: Run tests
4849
env:

Makefile

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
.PHONY: up up-full down test-unit test demo
2+
3+
## Start PostgreSQL + MySQL (lightweight, fast startup)
4+
up:
5+
docker compose up -d --wait postgres mysql
6+
7+
## Start all services including ClickHouse, Presto, Trino, Vertica
8+
up-full:
9+
docker compose --profile full up -d --wait
10+
11+
## Stop all services and remove volumes
12+
down:
13+
docker compose --profile full down -v
14+
15+
## Run unit tests (no database required)
16+
test-unit:
17+
uv run pytest tests/test_query.py tests/test_utils.py -x
18+
19+
## Run full test suite against PG + MySQL (starts containers if needed)
20+
## To also test Presto/Trino/Vertica, run `make up-full` first and set:
21+
## export DATADIFF_PRESTO_URI="presto://test@localhost:8080/memory/default"
22+
## export DATADIFF_TRINO_URI="trino://test@localhost:8081/memory/default"
23+
## export DATADIFF_VERTICA_URI="vertica://vertica:Password1@localhost:5433/vertica"
24+
test: up
25+
uv run pytest tests/ \
26+
-o addopts="--timeout=300 --tb=short" \
27+
--ignore=tests/test_database_types.py \
28+
--ignore=tests/test_dbt_config_validators.py \
29+
--ignore=tests/test_main.py
30+
31+
## Run data-diff against seed data to showcase diffing
32+
demo: up
33+
@echo "=== PostgreSQL: ratings_source vs ratings_target ==="
34+
uv run python -m data_diff \
35+
postgresql://postgres:Password1@localhost/postgres \
36+
ratings_source ratings_target \
37+
--key-columns id \
38+
--columns rating
39+
@echo ""
40+
@echo "=== MySQL: ratings_source vs ratings_target ==="
41+
uv run python -m data_diff \
42+
mysql://mysql:Password1@localhost/mysql \
43+
ratings_source ratings_target \
44+
--key-columns id \
45+
--columns rating

data_diff/databases/postgresql.py

Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@
2525
CHECKSUM_HEXDIGITS,
2626
CHECKSUM_OFFSET,
2727
MD5_HEXDIGITS,
28-
TIMESTAMP_PRECISION_POS,
2928
BaseDialect,
3029
ConnectError,
3130
ThreadedDatabase,
@@ -115,8 +114,14 @@ def md5_as_hex(self, s: str) -> str:
115114
return f"md5({s})"
116115

117116
def normalize_timestamp(self, value: str, coltype: TemporalType) -> str:
118-
def _add_padding(coltype: TemporalType, timestamp6: str):
119-
return f"RPAD(LEFT({timestamp6}, {TIMESTAMP_PRECISION_POS + coltype.precision}), {TIMESTAMP_PRECISION_POS + 6}, '0')"
117+
def _truncate_and_pad(coltype: TemporalType, timestamp6: str):
118+
"""Truncate a 6-digit-precision timestamp to target precision, then zero-pad back to 6 digits."""
119+
truncated = f"LEFT({timestamp6}, length({timestamp6}) - (6 - {coltype.precision}))"
120+
return f"RPAD({truncated}, length({timestamp6}), '0')"
121+
122+
def _zero_pad(coltype: TemporalType, already_truncated: str):
123+
"""Zero-pad an already-truncated timestamp back to 6 fractional digits."""
124+
return f"RPAD({already_truncated}, length({already_truncated}) + (6 - {coltype.precision}), '0')"
120125

121126
try:
122127
is_date = coltype.is_date
@@ -141,30 +146,28 @@ def _add_padding(coltype: TemporalType, timestamp6: str):
141146
null_case_end = "END"
142147

143148
# 294277 or 4714 BC would be out of range, make sure we can't round to that
144-
# TODO test timezones for overflow?
145149
max_timestamp = "294276-12-31 23:59:59.0000"
146150
min_timestamp = "4713-01-01 00:00:00.00 BC"
147-
timestamp = f"least('{max_timestamp}'::timestamp(6), {value}::timestamp(6))"
148-
timestamp = f"greatest('{min_timestamp}'::timestamp(6), {timestamp})"
151+
ts_type = "timestamptz(6)" if isinstance(coltype, TimestampTZ) else "timestamp(6)"
152+
timestamp = f"least('{max_timestamp}'::{ts_type}, {value}::{ts_type})"
153+
timestamp = f"greatest('{min_timestamp}'::{ts_type}, {timestamp})"
149154

150155
interval = format((0.5 * (10 ** (-coltype.precision))), f".{coltype.precision + 1}f")
151156

152157
rounded_timestamp = (
153-
f"left(to_char(least('{max_timestamp}'::timestamp, {timestamp})"
158+
f"left(to_char(least('{max_timestamp}'::{ts_type}, {timestamp})"
154159
f"+ interval '{interval}', 'YYYY-mm-dd HH24:MI:SS.US'),"
155-
f"length(to_char(least('{max_timestamp}'::timestamp, {timestamp})"
160+
f"length(to_char(least('{max_timestamp}'::{ts_type}, {timestamp})"
156161
f"+ interval '{interval}', 'YYYY-mm-dd HH24:MI:SS.US')) - (6-{coltype.precision}))"
157162
)
158163

159-
padded = _add_padding(coltype, rounded_timestamp)
164+
padded = _zero_pad(coltype, rounded_timestamp)
160165
return f"{null_case_begin} {padded} {null_case_end}"
161166

162-
# TODO years with > 4 digits not padded correctly
163-
# current w/ precision 6: 294276-12-31 23:59:59.0000
164-
# should be 294276-12-31 23:59:59.000000
165167
else:
166-
rounded_timestamp = f"to_char({value}::timestamp(6), 'YYYY-mm-dd HH24:MI:SS.US')"
167-
padded = _add_padding(coltype, rounded_timestamp)
168+
ts_type = "timestamptz(6)" if isinstance(coltype, TimestampTZ) else "timestamp(6)"
169+
rounded_timestamp = f"to_char({value}::{ts_type}, 'YYYY-mm-dd HH24:MI:SS.US')"
170+
padded = _truncate_and_pad(coltype, rounded_timestamp)
168171
return padded
169172

170173
def normalize_number(self, value: str, coltype: FractionalType) -> str:

data_diff/queries/ast_classes.py

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -483,7 +483,8 @@ class Join(ExprNode, ITable, Root):
483483
def schema(self) -> Schema:
484484
if not self.columns:
485485
raise ValueError("Join must specify columns explicitly (SELECT * not yet implemented).")
486-
s = self.source_tables[0].schema # TODO validate types match between both tables
486+
# No cross-table type validation needed: join combines columns from both tables rather than unioning rows
487+
s = self.source_tables[0].schema
487488
return type(s)({c.name: c.type for c in self.columns})
488489

489490
def on(self, *exprs) -> Self:
@@ -553,15 +554,28 @@ class TableOp(ExprNode, ITable, Root):
553554

554555
@property
555556
def type(self):
556-
# TODO ensure types of both tables are compatible
557-
return self.table1.type
557+
t1 = self.table1.type
558+
t2 = self.table2.type
559+
if t1 is None or t2 is None:
560+
return None
561+
if type(t1) is not type(t2):
562+
raise QueryBuilderError(f"Type mismatch in {self.op}: got {type(t1).__name__} and {type(t2).__name__}")
563+
return t1
558564

559565
@property
560566
def schema(self) -> Schema:
561567
s1 = self.table1.schema
562568
s2 = self.table2.schema
569+
if s1 is None or s2 is None:
570+
raise QueryBuilderError(f"Cannot validate {self.op}: one or both tables have no schema defined")
563571
if len(s1) != len(s2):
564-
raise ValueError(f"TableOp requires tables with matching schema lengths, got {len(s1)} and {len(s2)}.")
572+
raise QueryBuilderError(f"Schema length mismatch in {self.op}: got {len(s1)} and {len(s2)} columns")
573+
for (name1, type1), (name2, type2) in zip(s1.items(), s2.items()):
574+
if type(type1) is not type(type2):
575+
raise QueryBuilderError(
576+
f"Type mismatch in {self.op}: column {name1!r} is {type(type1).__name__} "
577+
f"but column {name2!r} is {type(type2).__name__}"
578+
)
565579
return s1
566580

567581

dev/seed/mysql/01_seed.sql

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
-- Seed data for demonstrating data-diff capabilities.
2+
-- Auto-executed by MySQL on first container startup.
3+
4+
CREATE TABLE ratings_source (
5+
id INT PRIMARY KEY,
6+
user_id INT NOT NULL,
7+
movie_id INT NOT NULL,
8+
rating DECIMAL(2,1) NOT NULL,
9+
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
10+
);
11+
12+
CREATE TABLE ratings_target (
13+
id INT PRIMARY KEY,
14+
user_id INT NOT NULL,
15+
movie_id INT NOT NULL,
16+
rating DECIMAL(2,1) NOT NULL,
17+
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
18+
);
19+
20+
-- Populate source with 1000 rows via stored procedure (MySQL lacks generate_series)
21+
DELIMITER //
22+
CREATE PROCEDURE seed_ratings()
23+
BEGIN
24+
DECLARE i INT DEFAULT 1;
25+
WHILE i <= 1000 DO
26+
INSERT INTO ratings_source (id, user_id, movie_id, rating, created_at)
27+
VALUES (
28+
i,
29+
1 + (i % 200),
30+
1 + (i % 50),
31+
1 + (i % 5),
32+
DATE_ADD('2025-01-01', INTERVAL i MINUTE)
33+
);
34+
SET i = i + 1;
35+
END WHILE;
36+
END //
37+
DELIMITER ;
38+
39+
CALL seed_ratings();
40+
DROP PROCEDURE seed_ratings;
41+
42+
-- Copy all rows into target
43+
INSERT INTO ratings_target SELECT * FROM ratings_source;
44+
45+
-- Introduce diffs:
46+
-- 5 deleted rows (IDs 10-14 missing from target)
47+
DELETE FROM ratings_target WHERE id BETWEEN 10 AND 14;
48+
49+
-- 5 extra rows in target only (IDs 1001-1005)
50+
INSERT INTO ratings_target (id, user_id, movie_id, rating, created_at) VALUES
51+
(1001, 201, 51, 4.0, '2025-06-01 00:00:00'),
52+
(1002, 202, 52, 3.0, '2025-06-02 00:00:00'),
53+
(1003, 203, 53, 5.0, '2025-06-03 00:00:00'),
54+
(1004, 204, 54, 2.0, '2025-06-04 00:00:00'),
55+
(1005, 205, 55, 1.0, '2025-06-05 00:00:00');
56+
57+
-- 10 updated ratings (IDs 100-109 have different ratings in target)
58+
UPDATE ratings_target SET rating = rating + 0.5 WHERE id BETWEEN 100 AND 109 AND rating < 5.0;
59+
UPDATE ratings_target SET rating = 1.0 WHERE id BETWEEN 100 AND 109 AND rating >= 5.0;

dev/seed/postgres/01_seed.sql

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
-- Seed data for demonstrating data-diff capabilities.
2+
-- Auto-executed by PostgreSQL on first container startup.
3+
4+
CREATE TABLE ratings_source (
5+
id INTEGER PRIMARY KEY,
6+
user_id INTEGER NOT NULL,
7+
movie_id INTEGER NOT NULL,
8+
rating NUMERIC(2,1) NOT NULL,
9+
created_at TIMESTAMP NOT NULL DEFAULT now()
10+
);
11+
12+
CREATE TABLE ratings_target (
13+
id INTEGER PRIMARY KEY,
14+
user_id INTEGER NOT NULL,
15+
movie_id INTEGER NOT NULL,
16+
rating NUMERIC(2,1) NOT NULL,
17+
created_at TIMESTAMP NOT NULL DEFAULT now()
18+
);
19+
20+
-- Populate source with 1000 rows
21+
INSERT INTO ratings_source (id, user_id, movie_id, rating, created_at)
22+
SELECT
23+
g AS id,
24+
1 + (g % 200) AS user_id,
25+
1 + (g % 50) AS movie_id,
26+
(1 + (g % 5))::NUMERIC(2,1) AS rating,
27+
'2025-01-01'::TIMESTAMP + (g || ' minutes')::INTERVAL AS created_at
28+
FROM generate_series(1, 1000) AS g;
29+
30+
-- Copy all rows into target
31+
INSERT INTO ratings_target SELECT * FROM ratings_source;
32+
33+
-- Introduce diffs:
34+
-- 5 deleted rows (IDs 10-14 missing from target)
35+
DELETE FROM ratings_target WHERE id BETWEEN 10 AND 14;
36+
37+
-- 5 extra rows in target only (IDs 1001-1005)
38+
INSERT INTO ratings_target (id, user_id, movie_id, rating, created_at) VALUES
39+
(1001, 201, 51, 4.0, '2025-06-01 00:00:00'),
40+
(1002, 202, 52, 3.0, '2025-06-02 00:00:00'),
41+
(1003, 203, 53, 5.0, '2025-06-03 00:00:00'),
42+
(1004, 204, 54, 2.0, '2025-06-04 00:00:00'),
43+
(1005, 205, 55, 1.0, '2025-06-05 00:00:00');
44+
45+
-- 10 updated ratings (IDs 100-109 have different ratings in target)
46+
UPDATE ratings_target SET rating = rating + 0.5 WHERE id BETWEEN 100 AND 109 AND rating < 5.0;
47+
UPDATE ratings_target SET rating = 1.0 WHERE id BETWEEN 100 AND 109 AND rating >= 5.0;

docker-compose.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ services:
1212
restart: always
1313
volumes:
1414
- postgresql-data:/var/lib/postgresql/data:delegated
15+
- ./dev/seed/postgres:/docker-entrypoint-initdb.d:ro
1516
ports:
1617
- '5432:5432'
1718
expose:
@@ -42,6 +43,7 @@ services:
4243
restart: always
4344
volumes:
4445
- mysql-data:/var/lib/mysql:delegated
46+
- ./dev/seed/mysql:/docker-entrypoint-initdb.d:ro
4547
user: mysql
4648
ports:
4749
- '3306:3306'
@@ -61,6 +63,7 @@ services:
6163
clickhouse:
6264
container_name: dd-clickhouse
6365
image: clickhouse/clickhouse-server:24.3
66+
profiles: [full]
6467
restart: always
6568
volumes:
6669
- clickhouse-data:/var/lib/clickhouse:delegated
@@ -88,6 +91,7 @@ services:
8891

8992
# prestodb.dbapi.connect(host="127.0.0.1", user="presto").cursor().execute('SELECT * FROM system.runtime.nodes')
9093
presto:
94+
profiles: [full]
9195
container_name: dd-presto
9296
build:
9397
context: ./dev
@@ -101,6 +105,7 @@ services:
101105
- local
102106

103107
trino:
108+
profiles: [full]
104109
container_name: dd-trino
105110
image: 'trinodb/trino:439'
106111
hostname: trino
@@ -118,6 +123,7 @@ services:
118123

119124
vertica:
120125
container_name: dd-vertica
126+
profiles: [full]
121127
image: vertica/vertica-ce:24.1.0-0
122128
restart: always
123129
volumes:

tests/common.py

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,14 +27,12 @@
2727
TEST_BIGQUERY_CONN_STRING: str = os.environ.get("DATADIFF_BIGQUERY_URI") or None
2828
TEST_REDSHIFT_CONN_STRING: str = os.environ.get("DATADIFF_REDSHIFT_URI") or None
2929
TEST_ORACLE_CONN_STRING: str = None
30-
TEST_DATABRICKS_CONN_STRING: str = os.environ.get("DATADIFF_DATABRICKS_URI")
30+
TEST_DATABRICKS_CONN_STRING: str = os.environ.get("DATADIFF_DATABRICKS_URI") or None
3131
TEST_TRINO_CONN_STRING: str = os.environ.get("DATADIFF_TRINO_URI") or None
32-
# clickhouse uri for provided docker - "clickhouse://clickhouse:Password1@localhost:9000/clickhouse"
33-
TEST_CLICKHOUSE_CONN_STRING: str = os.environ.get("DATADIFF_CLICKHOUSE_URI")
34-
# vertica uri provided for docker - "vertica://vertica:Password1@localhost:5433/vertica"
35-
TEST_VERTICA_CONN_STRING: str = os.environ.get("DATADIFF_VERTICA_URI")
32+
TEST_CLICKHOUSE_CONN_STRING: str = os.environ.get("DATADIFF_CLICKHOUSE_URI") or None
33+
TEST_VERTICA_CONN_STRING: str = os.environ.get("DATADIFF_VERTICA_URI") or None
3634
TEST_DUCKDB_CONN_STRING: str = "duckdb://main:@:memory:"
37-
TEST_MSSQL_CONN_STRING: str = os.environ.get("DATADIFF_MSSQL_URI")
35+
TEST_MSSQL_CONN_STRING: str = os.environ.get("DATADIFF_MSSQL_URI") or None
3836

3937

4038
DEFAULT_N_SAMPLES = 50

0 commit comments

Comments
 (0)