Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Commit fb1d421

Browse files
committed
Added new guide for implementing a database driver
1 parent 2bf0354 commit fb1d421

File tree

2 files changed

+187
-9
lines changed

2 files changed

+187
-9
lines changed

docs/index.rst

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
:hidden:
55

66
python-api
7+
new-database-driver-guide
78

89
Introduction
910
------------
@@ -65,14 +66,12 @@ How to use from Python
6566
Resources
6667
---------
6768

68-
- Git: `<https://github.com/datafold/data-diff>`_
69-
70-
- Reference
71-
72-
- :doc:`python-api`
73-
74-
- Tutorials
75-
76-
- TODO
69+
- Source code (git): `<https://github.com/datafold/data-diff>`_
70+
- API Reference
71+
- :doc:`python-api`
72+
- Guides
73+
- :doc:`new-database-driver-guide`
74+
- Tutorials
75+
- TODO
7776

7877

docs/new-database-driver-guide.rst

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
How to implement a new database driver for data-diff
2+
====================================================
3+
4+
First, read through the CONTRIBUTING.md document.
5+
6+
Make sure data-diff is set up for development, and that all the tests pass (try to at least set it up for mysql and postgresql)
7+
8+
Look at the other database drivers for example and inspiration.
9+
10+
11+
1. Add dependencies to ``pyproject.toml``
12+
-----------------------------------------
13+
14+
Most new drivers will require a 3rd party library in order to connect to the database.
15+
16+
These dependencies should be specified in the ``pyproject.toml`` file, in ``[tool.poetry.extras]``. Example:
17+
18+
::
19+
20+
[tool.poetry.extras]
21+
postgresql = ["psycopg2"]
22+
23+
Then, users can install the dependencies needed for your database driver, with ``pip install 'data-diff[postgresql]``.
24+
25+
This way, data-diff can support a wide variety of drivers, without requiring our users to install libraries that they won't use.
26+
27+
2. Implement database module
28+
----------------------------
29+
30+
New database modules belong in the ``data_diff/databases`` directory.
31+
32+
Import on demand
33+
~~~~~~~~~~~~~~~~~
34+
35+
Database drivers should not import any 3rd party library at the module level.
36+
37+
Instead, they should be imported and initialized within a function. Example:
38+
39+
::
40+
41+
from .base import import_helper
42+
43+
@import_helper("postgresql")
44+
def import_postgresql():
45+
import psycopg2
46+
import psycopg2.extras
47+
48+
psycopg2.extensions.set_wait_callback(psycopg2.extras.wait_select)
49+
return psycopg2
50+
51+
We use the ``import_helper()`` decorator to provide a uniform and informative error. The string argument should be the name of the package, as written in ``pyproject.toml``.
52+
53+
Choosing a base class, based on threading Model
54+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
55+
56+
You can choose to inherit from either ``base.Database`` or ``base.ThreadedDatabase``.
57+
58+
Usually, databases with cursor-based connections, like MySQL or Postgresql, only allow one thread per connection. In order to support multithreading, we implement them by inheriting from ``ThreadedDatabase``, which holds a pool of worker threads, and creates a new connection per thread.
59+
60+
Usually, cloud databases, such as snowflake and bigquery, open a new connection per request, and support simultaneous queries from any number of threads. In other words, they already support multithreading, so we can implement them by inheriting directly from ``Database``.
61+
62+
63+
:meth:`_query()`
64+
~~~~~~~~~~~~~~~~~~
65+
66+
All queries to the database pass through ``_query()``. It takes SQL code, and returns a list of rows. Here is its signature:
67+
68+
::
69+
70+
def _query(self, sql_code: str) -> list: ...
71+
72+
For standard cursor connections, it's sufficient to implement it with a call to ``base._query_conn()``, like:
73+
74+
::
75+
return _query_conn(self._conn, sql_code)
76+
77+
78+
:meth:`select_table_schema()` / :meth:`query_table_schema()`
79+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
80+
81+
If your database does not have a ``information_schema.columns`` table, or if its structure is unusual, you may have to implement your own ``select_table_schema()`` function, which returns the query needed to return column information in the form of a list of tuples, where each tuple is `column_name, data_type, datetime_precision, numeric_precision, numeric_scale`.
82+
83+
If such a query isn't possible, you may have to implement ``query_table_schema()`` yourself, which extracts this information from the database, and returns it in the proper form.
84+
85+
If the information returned from ``query_table_schema()`` is requires slow or error-prone post-processing, you may delay that post-processing by overriding ``_process_table_schema()`` and implementing it there. The method ``_process_table_schema()`` only gets called for the columns that will be diffed.
86+
87+
Documentation:
88+
89+
- :meth:`data_diff.databases.database_types.AbstractDatabase.select_table_schema`
90+
91+
- :meth:`data_diff.databases.database_types.AbstractDatabase.query_table_schema`
92+
93+
:data:`TYPE_CLASSES`
94+
~~~~~~~~~~~~~~~~~~~~~~
95+
96+
Each database class must have a ``TYPE_CLASSES`` dictionary, which maps between the string data-type, as returned by querying the table schema, into the appropriate data-diff type class, i.e. a subclass of ``database_types.ColType``.
97+
98+
:data:`ROUNDS_ON_PREC_LOSS`
99+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
100+
101+
When providing a datetime or a timestamp to a database, the database may lower its precision to correspond with the target column type.
102+
103+
Some databases will lower precision of timestamp/datetime values by truncating them, and some by rounding them.
104+
105+
``ROUNDS_ON_PREC_LOSS`` should be True if this database rounds, or False if it truncates.
106+
107+
:meth:`__init__`, :meth:`create_connection()`
108+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109+
110+
The options for the database connection will be given to the ``__init__()`` method as keywords.
111+
112+
If you inherit from ``Database``, your ``__init__()`` method may create the database connection.
113+
114+
If you inherit from ``ThreadedDatabase``, you should instead create the connection in the ``create_connection()`` method.
115+
116+
:meth:`close()`
117+
~~~~~~~~~~~~~~~~
118+
119+
If you inherit from ``Database``, you will need to implement this method to close the connection yourself.
120+
121+
If you inherit from ``ThreadedDatabase``, you don't have to implement this method.
122+
123+
Docs:
124+
125+
- :meth:`data_diff.databases.database_types.AbstractDatabase.close`
126+
127+
:meth:`quote()`, :meth:`to_string()`, :meth:`normalize_number()`, :meth:`normalize_timestamp()`, :meth:`md5_to_int()`
128+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
129+
130+
These methods are used when creating queries.
131+
132+
They accept an SQL code fragment, and returns a new code fragment representing the appropriate computation.
133+
134+
For more information, read their docs:
135+
136+
- :meth:`data_diff.databases.database_types.AbstractDatabase.quote`
137+
138+
- :meth:`data_diff.databases.database_types.AbstractDatabase.to_string`
139+
140+
3. Add tests
141+
--------------
142+
143+
Add your new database to the ``DATABASE_TYPES`` dict in ``tests/test_database_types.py``
144+
145+
The key is the class itself, and the value is a dict of {category: [type1, type2, ...]}
146+
147+
Categories supported are: ``int``, ``datetime``, ``float``, and ``uuid``.
148+
149+
Example:
150+
151+
::
152+
153+
DATABASE_TYPES = {
154+
...
155+
db.PostgreSQL: {
156+
"int": [ "int", "bigint" ],
157+
"datetime": [
158+
"timestamp(6) without time zone",
159+
"timestamp(3) without time zone",
160+
"timestamp(0) without time zone",
161+
"timestamp with time zone",
162+
],
163+
...
164+
},
165+
166+
167+
Then run the tests and make sure your database driver is being tested.
168+
169+
You can run the tests with ``unittest``.
170+
171+
To save time, we recommend running them with ``unittest-parallel``.
172+
173+
When debugging, we recommend using the `-f` flag, to stop on error. Also, use the `-k` flag to run only the individual test that you're trying to fix.
174+
175+
4. Create Pull-Request
176+
-----------------------
177+
178+
Open a pull-request on github, and we'll take it from there!
179+

0 commit comments

Comments
 (0)