|
| 1 | +How to implement a new database driver for data-diff |
| 2 | +==================================================== |
| 3 | + |
| 4 | +First, read through the CONTRIBUTING.md document. |
| 5 | + |
| 6 | +Make sure data-diff is set up for development, and that all the tests pass (try to at least set it up for mysql and postgresql) |
| 7 | + |
| 8 | +Look at the other database drivers for example and inspiration. |
| 9 | + |
| 10 | + |
| 11 | +1. Add dependencies to ``pyproject.toml`` |
| 12 | +----------------------------------------- |
| 13 | + |
| 14 | +Most new drivers will require a 3rd party library in order to connect to the database. |
| 15 | + |
| 16 | +These dependencies should be specified in the ``pyproject.toml`` file, in ``[tool.poetry.extras]``. Example: |
| 17 | + |
| 18 | +:: |
| 19 | + |
| 20 | + [tool.poetry.extras] |
| 21 | + postgresql = ["psycopg2"] |
| 22 | + |
| 23 | +Then, users can install the dependencies needed for your database driver, with ``pip install 'data-diff[postgresql]``. |
| 24 | + |
| 25 | +This way, data-diff can support a wide variety of drivers, without requiring our users to install libraries that they won't use. |
| 26 | + |
| 27 | +2. Implement database module |
| 28 | +---------------------------- |
| 29 | + |
| 30 | +New database modules belong in the ``data_diff/databases`` directory. |
| 31 | + |
| 32 | +Import on demand |
| 33 | +~~~~~~~~~~~~~~~~~ |
| 34 | + |
| 35 | +Database drivers should not import any 3rd party library at the module level. |
| 36 | + |
| 37 | +Instead, they should be imported and initialized within a function. Example: |
| 38 | + |
| 39 | +:: |
| 40 | + |
| 41 | + from .base import import_helper |
| 42 | + |
| 43 | + @import_helper("postgresql") |
| 44 | + def import_postgresql(): |
| 45 | + import psycopg2 |
| 46 | + import psycopg2.extras |
| 47 | + |
| 48 | + psycopg2.extensions.set_wait_callback(psycopg2.extras.wait_select) |
| 49 | + return psycopg2 |
| 50 | + |
| 51 | +We use the ``import_helper()`` decorator to provide a uniform and informative error. The string argument should be the name of the package, as written in ``pyproject.toml``. |
| 52 | + |
| 53 | +Choosing a base class, based on threading Model |
| 54 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 55 | + |
| 56 | +You can choose to inherit from either ``base.Database`` or ``base.ThreadedDatabase``. |
| 57 | + |
| 58 | +Usually, databases with cursor-based connections, like MySQL or Postgresql, only allow one thread per connection. In order to support multithreading, we implement them by inheriting from ``ThreadedDatabase``, which holds a pool of worker threads, and creates a new connection per thread. |
| 59 | + |
| 60 | +Usually, cloud databases, such as snowflake and bigquery, open a new connection per request, and support simultaneous queries from any number of threads. In other words, they already support multithreading, so we can implement them by inheriting directly from ``Database``. |
| 61 | + |
| 62 | + |
| 63 | +:meth:`_query()` |
| 64 | +~~~~~~~~~~~~~~~~~~ |
| 65 | + |
| 66 | +All queries to the database pass through ``_query()``. It takes SQL code, and returns a list of rows. Here is its signature: |
| 67 | + |
| 68 | +:: |
| 69 | + |
| 70 | + def _query(self, sql_code: str) -> list: ... |
| 71 | + |
| 72 | +For standard cursor connections, it's sufficient to implement it with a call to ``base._query_conn()``, like: |
| 73 | + |
| 74 | +:: |
| 75 | + return _query_conn(self._conn, sql_code) |
| 76 | + |
| 77 | + |
| 78 | +:meth:`select_table_schema()` / :meth:`query_table_schema()` |
| 79 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 80 | + |
| 81 | +If your database does not have a ``information_schema.columns`` table, or if its structure is unusual, you may have to implement your own ``select_table_schema()`` function, which returns the query needed to return column information in the form of a list of tuples, where each tuple is `column_name, data_type, datetime_precision, numeric_precision, numeric_scale`. |
| 82 | + |
| 83 | +If such a query isn't possible, you may have to implement ``query_table_schema()`` yourself, which extracts this information from the database, and returns it in the proper form. |
| 84 | + |
| 85 | +If the information returned from ``query_table_schema()`` is requires slow or error-prone post-processing, you may delay that post-processing by overriding ``_process_table_schema()`` and implementing it there. The method ``_process_table_schema()`` only gets called for the columns that will be diffed. |
| 86 | + |
| 87 | +Documentation: |
| 88 | + |
| 89 | +- :meth:`data_diff.databases.database_types.AbstractDatabase.select_table_schema` |
| 90 | + |
| 91 | +- :meth:`data_diff.databases.database_types.AbstractDatabase.query_table_schema` |
| 92 | + |
| 93 | +:data:`TYPE_CLASSES` |
| 94 | +~~~~~~~~~~~~~~~~~~~~~~ |
| 95 | + |
| 96 | +Each database class must have a ``TYPE_CLASSES`` dictionary, which maps between the string data-type, as returned by querying the table schema, into the appropriate data-diff type class, i.e. a subclass of ``database_types.ColType``. |
| 97 | + |
| 98 | +:data:`ROUNDS_ON_PREC_LOSS` |
| 99 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 100 | + |
| 101 | +When providing a datetime or a timestamp to a database, the database may lower its precision to correspond with the target column type. |
| 102 | + |
| 103 | +Some databases will lower precision of timestamp/datetime values by truncating them, and some by rounding them. |
| 104 | + |
| 105 | +``ROUNDS_ON_PREC_LOSS`` should be True if this database rounds, or False if it truncates. |
| 106 | + |
| 107 | +:meth:`__init__`, :meth:`create_connection()` |
| 108 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 109 | + |
| 110 | +The options for the database connection will be given to the ``__init__()`` method as keywords. |
| 111 | + |
| 112 | +If you inherit from ``Database``, your ``__init__()`` method may create the database connection. |
| 113 | + |
| 114 | +If you inherit from ``ThreadedDatabase``, you should instead create the connection in the ``create_connection()`` method. |
| 115 | + |
| 116 | +:meth:`close()` |
| 117 | +~~~~~~~~~~~~~~~~ |
| 118 | + |
| 119 | +If you inherit from ``Database``, you will need to implement this method to close the connection yourself. |
| 120 | + |
| 121 | +If you inherit from ``ThreadedDatabase``, you don't have to implement this method. |
| 122 | + |
| 123 | +Docs: |
| 124 | + |
| 125 | +- :meth:`data_diff.databases.database_types.AbstractDatabase.close` |
| 126 | + |
| 127 | +:meth:`quote()`, :meth:`to_string()`, :meth:`normalize_number()`, :meth:`normalize_timestamp()`, :meth:`md5_to_int()` |
| 128 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 129 | + |
| 130 | +These methods are used when creating queries. |
| 131 | + |
| 132 | +They accept an SQL code fragment, and returns a new code fragment representing the appropriate computation. |
| 133 | + |
| 134 | +For more information, read their docs: |
| 135 | + |
| 136 | +- :meth:`data_diff.databases.database_types.AbstractDatabase.quote` |
| 137 | + |
| 138 | +- :meth:`data_diff.databases.database_types.AbstractDatabase.to_string` |
| 139 | + |
| 140 | +3. Add tests |
| 141 | +-------------- |
| 142 | + |
| 143 | +Add your new database to the ``DATABASE_TYPES`` dict in ``tests/test_database_types.py`` |
| 144 | + |
| 145 | +The key is the class itself, and the value is a dict of {category: [type1, type2, ...]} |
| 146 | + |
| 147 | +Categories supported are: ``int``, ``datetime``, ``float``, and ``uuid``. |
| 148 | + |
| 149 | +Example: |
| 150 | + |
| 151 | +:: |
| 152 | + |
| 153 | + DATABASE_TYPES = { |
| 154 | + ... |
| 155 | + db.PostgreSQL: { |
| 156 | + "int": [ "int", "bigint" ], |
| 157 | + "datetime": [ |
| 158 | + "timestamp(6) without time zone", |
| 159 | + "timestamp(3) without time zone", |
| 160 | + "timestamp(0) without time zone", |
| 161 | + "timestamp with time zone", |
| 162 | + ], |
| 163 | + ... |
| 164 | + }, |
| 165 | + |
| 166 | + |
| 167 | +Then run the tests and make sure your database driver is being tested. |
| 168 | + |
| 169 | +You can run the tests with ``unittest``. |
| 170 | + |
| 171 | +To save time, we recommend running them with ``unittest-parallel``. |
| 172 | + |
| 173 | +When debugging, we recommend using the `-f` flag, to stop on error. Also, use the `-k` flag to run only the individual test that you're trying to fix. |
| 174 | + |
| 175 | +4. Create Pull-Request |
| 176 | +----------------------- |
| 177 | + |
| 178 | +Open a pull-request on github, and we'll take it from there! |
| 179 | + |
0 commit comments