Skip to content

configure-generators and make-stats fail on MS-SQL: PostgreSQL-specific SQL functions used in source queries #105

@myyong

Description

@myyong

Bug

Running datafaker configure-generators against an MS-SQL source raises:

('42000', "[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]
'EXTRACT' is not a recognized built-in function name. (195)")
[SQL: SELECT MIN(CAST(EXTRACT(YEAR FROM birth_datetime) AS INT)) AS start,
      MAX(CAST(EXTRACT(YEAR FROM birth_datetime) AS INT)) AS end FROM person]

A full audit of raw SQL in the codebase found three MS-SQL incompatibilities in the commands that query the source database.


Affected locations

1. EXTRACT(YEAR FROM ...)datafaker/generators/mimesis.py:189–195

Triggered by configure-generators.

extract_year = f"CAST(EXTRACT(YEAR FROM {column.name}) AS INT)"
max_year = f"MAX({extract_year})"
min_year = f"MIN({extract_year})"
# ...
result = connection.execute(text(
    f"SELECT {min_year} AS start, {max_year} AS end FROM {column.table.name}"
))

MS-SQL does not support EXTRACT(); it uses YEAR(col) or DATEPART(year, col).

2. EXTRACT(YEAR FROM ...)datafaker/make.py:318–321

Triggered by create-generators and make-stats for Date/DateTime columns.

_YEAR_SUMMARY_QUERY = (
    "SELECT MIN(y) AS start, MAX(y) AS end FROM "
    "(SELECT EXTRACT(YEAR FROM {column}) AS y FROM {table}) AS years"
)

3. STDDEV()datafaker/generators/base.py:325–327

Triggered by configure-generators for numeric columns used in bucket generation.

text(
    f"SELECT AVG({column_name}) AS mean,"
    f" STDDEV({column_name}) AS stddev,"
    f" COUNT({column_name}) AS count FROM {table_name}"
)

MS-SQL spells this STDEV() (no trailing D), not STDDEV().


Already fixed

RANDOM()NEWID() was fixed in providers.py (commit 84a209f) using a connection.dialect.name == "mssql" check — the same pattern applies here.


Options for fixes

Option A — Dialect check with raw SQL (follows existing pattern)

Mirror the NEWID() fix: check connection.dialect.name or engine.dialect.name at the call site and substitute the MS-SQL equivalent.

# EXTRACT fix
if dialect_name == "mssql":
    year_expr = f"YEAR({col})"
else:
    year_expr = f"EXTRACT(YEAR FROM {col})"

# STDDEV fix
if dialect_name == "mssql":
    stddev_fn = "STDEV"
else:
    stddev_fn = "STDDEV"

Pros: Minimal change; consistent with NEWID() pattern already in the codebase.
Cons: Must enumerate dialects; each new dialect may need adding.

Option B — SQLAlchemy expression API (most portable)

Replace raw SQL strings with SQLAlchemy expressions that compile correctly per dialect:

  • sqlalchemy.extract('year', col)EXTRACT(YEAR FROM col) on PostgreSQL/DuckDB, DATEPART(year, col) on MS-SQL
  • sqlalchemy.func.stddev(col)STDDEV(col) on PostgreSQL, STDEV(col) on MS-SQL (SQLAlchemy maps this automatically)

Requires changing text() calls to expression-based queries. Both locations already have access to a Column object.

Pros: Automatically correct for all current and future SQLAlchemy-supported dialects.
Cons: More refactoring; _YEAR_SUMMARY_QUERY can no longer be a plain string constant.

Affected files summary

File Line Issue
datafaker/generators/mimesis.py 189–195 EXTRACT(YEAR FROM ...)
datafaker/make.py 318–321 EXTRACT(YEAR FROM ...)
datafaker/generators/base.py 325–327 STDDEV() vs STDEV()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions