Bug
When running datafaker create-data against an MS-SQL destination, the following error is raised:
('42000', "[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]
Table 'mimic100_synthetic.person' does not have the identity property.
Cannot perform SET operation. (8106) (SQLExecDirectW)")
[SQL: SET IDENTITY_INSERT mimic100_synthetic.person ON]
Datafaker halts before inserting any rows.
Root cause
There are two systems in conflict:
-
remove_mssql_identity DDL hook (create.py:46–57) strips IDENTITY from every CREATE TABLE statement on MS-SQL, so the physical column is a plain INTEGER with no identity property.
-
SQLAlchemy's insert path sees Column(Integer, primary_key=True) with autoincrement='auto' (the SQLAlchemy default). When an explicit PK value is supplied, SQLAlchemy wraps the INSERT with SET IDENTITY_INSERT <table> ON. SQL Server rejects this with error 8106 because the column has no identity property — it was stripped in step 1.
The result is that the table and the metadata disagree about whether the column is an identity column, so either the table creation or the insert path will always be wrong.
Options considered
Option A — Let the database generate PKs ✅ (chosen)
- Remove
remove_mssql_identity so tables are created with IDENTITY(1,1).
- Stop supplying explicit PK values in generated
df.py files. The DB generates them, and SQLAlchemy's return_defaults() captures the generated value for story generators.
Pros:
- PKs are guaranteed unique even across concurrent datafaker runs.
- No
SELECT MAX() query on first insert.
- Simpler generated
df.py — no increment() boilerplate per identity column.
- Works correctly when datafaker is run multiple times against a non-empty destination.
Cons:
- Generated PK values are not deterministic — cannot reproduce the exact same synthetic dataset.
- Existing hand-edited
df.py files must be updated to remove increment() calls.
Option B — Suppress autoincrement in MetaData, keep increment()
- Add
autoincrement=False to Column(...) in dict_to_column (serialize_metadata.py). SQLAlchemy stops emitting SET IDENTITY_INSERT; increment() in df.py supplies explicit values as before.
Pros:
- Smallest possible code change (one keyword argument).
- Deterministic/reproducible PK values.
- No need to regenerate or edit
df.py files.
Cons:
- Does not give automatic generation — every
df.py must still wire up increment() per identity PK.
- In-memory accumulator is not concurrency-safe: two parallel datafaker processes both query
SELECT MAX() and start from the same value → PK collision.
Option C — Keep IDENTITY on tables, keep increment() (minimal surgical)
- Remove
remove_mssql_identity so tables have IDENTITY. Keep increment() in df.py to supply explicit values. SET IDENTITY_INSERT ON/OFF now works because the column actually has IDENTITY.
Pros:
- Smallest change to
create.py (delete ~10 lines). No changes to df.py or serialize_metadata.py.
Cons:
- Two systems manage the same counter. The DB's IDENTITY sequence falls behind explicit inserts; if anything ever inserts without
IDENTITY_INSERT ON, the DB-generated value collides with a previously-inserted one.
SET IDENTITY_INSERT is one-table-at-a-time per session; story generators touching multiple identity-PK tables in one pass hit this limit.
Fix (Option A)
The following files are changed in the associated PR:
datafaker/create.py: Remove remove_mssql_identity hook — MS-SQL tables now get IDENTITY(1,1).
datafaker/make.py: _get_default_generator returns None for single-column integer PKs with no FK; _get_generator_for_table excludes them from both nonnull_columns and row_gens. The existing comment ("we presume that primary keys are populated automatically") already documented this intent.
examples/omop-mssql/df.py: Remove manual increment() call for person_id.
tests/test_create_mssql.py: Invert TestMSSQLRemoveIdentity → TestMSSQLIdentityPresent; add insert-without-PK test.
tests/test_make.py: Add TestGetDefaultGenerator unit tests.
Bug
When running
datafaker create-dataagainst an MS-SQL destination, the following error is raised:Datafaker halts before inserting any rows.
Root cause
There are two systems in conflict:
remove_mssql_identityDDL hook (create.py:46–57) stripsIDENTITYfrom everyCREATE TABLEstatement on MS-SQL, so the physical column is a plainINTEGERwith no identity property.SQLAlchemy's insert path sees
Column(Integer, primary_key=True)withautoincrement='auto'(the SQLAlchemy default). When an explicit PK value is supplied, SQLAlchemy wraps the INSERT withSET IDENTITY_INSERT <table> ON. SQL Server rejects this with error 8106 because the column has no identity property — it was stripped in step 1.The result is that the table and the metadata disagree about whether the column is an identity column, so either the table creation or the insert path will always be wrong.
Options considered
Option A — Let the database generate PKs ✅ (chosen)
remove_mssql_identityso tables are created withIDENTITY(1,1).df.pyfiles. The DB generates them, and SQLAlchemy'sreturn_defaults()captures the generated value for story generators.Pros:
SELECT MAX()query on first insert.df.py— noincrement()boilerplate per identity column.Cons:
df.pyfiles must be updated to removeincrement()calls.Option B — Suppress autoincrement in MetaData, keep
increment()autoincrement=FalsetoColumn(...)indict_to_column(serialize_metadata.py). SQLAlchemy stops emittingSET IDENTITY_INSERT;increment()indf.pysupplies explicit values as before.Pros:
df.pyfiles.Cons:
df.pymust still wire upincrement()per identity PK.SELECT MAX()and start from the same value → PK collision.Option C — Keep IDENTITY on tables, keep
increment()(minimal surgical)remove_mssql_identityso tables haveIDENTITY. Keepincrement()indf.pyto supply explicit values.SET IDENTITY_INSERT ON/OFFnow works because the column actually has IDENTITY.Pros:
create.py(delete ~10 lines). No changes todf.pyorserialize_metadata.py.Cons:
IDENTITY_INSERT ON, the DB-generated value collides with a previously-inserted one.SET IDENTITY_INSERTis one-table-at-a-time per session; story generators touching multiple identity-PK tables in one pass hit this limit.Fix (Option A)
The following files are changed in the associated PR:
datafaker/create.py: Removeremove_mssql_identityhook — MS-SQL tables now getIDENTITY(1,1).datafaker/make.py:_get_default_generatorreturnsNonefor single-column integer PKs with no FK;_get_generator_for_tableexcludes them from bothnonnull_columnsandrow_gens. The existing comment ("we presume that primary keys are populated automatically") already documented this intent.examples/omop-mssql/df.py: Remove manualincrement()call forperson_id.tests/test_create_mssql.py: InvertTestMSSQLRemoveIdentity→TestMSSQLIdentityPresent; add insert-without-PK test.tests/test_make.py: AddTestGetDefaultGeneratorunit tests.