Skip to content

configure-generators fails on MS-SQL: RANDOM() and LIMIT not supported in ChoiceGeneratorFactory #106

@myyong

Description

@myyong

Bug

Running datafaker configure-generators against an MS-SQL source database fails
for any column that is assessed as a choice distribution. ChoiceGeneratorFactory.get_generators()
executes raw SQL containing RANDOM() and LIMIT, neither of which MS-SQL supports.

Expected error (RANDOM):

('42000', "[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]
'RANDOM' is not a recognized built-in function name.")
[SQL: SELECT v, COUNT(v) AS f FROM (SELECT "col" as v FROM "table"
      ORDER BY RANDOM() LIMIT 500) AS _inner GROUP BY v ORDER BY f DESC]

Expected error (LIMIT):

('42000', "Incorrect syntax near 'LIMIT'.")
[SQL: SELECT "col" AS v, COUNT("col") AS f FROM "table"
      GROUP BY v ORDER BY f DESC LIMIT 501]

Affected locations

1. ChoiceGeneratorFactory.get_generators()datafaker/generators/choice.py:307–313

Distinct-value count query, uses LIMIT:

text(
    f'SELECT "{column_name}" AS v, COUNT("{column_name}")'
    f' AS f FROM "{table_name}" GROUP BY v'
    f" ORDER BY f DESC LIMIT {MAXIMUM_CHOICES + 1}"
)

MS-SQL has no LIMIT clause; the equivalent is SELECT TOP n.

2. ChoiceGeneratorFactory.get_generators()datafaker/generators/choice.py:352–359

Random-sample query, uses RANDOM() and LIMIT:

text(
    f"SELECT v, COUNT(v) AS f FROM"
    f' (SELECT "{column_name}" as v FROM "{table_name}"'
    f" ORDER BY RANDOM() LIMIT {self.SAMPLE_COUNT})"
    f" AS _inner GROUP BY v ORDER BY f DESC"
)

MS-SQL issues:

  • RANDOM() → must use NEWID() for per-row randomness in ORDER BY
  • LIMIT n → must use TOP n before the column list

3. Stored query strings in ChoiceGenerator.__init__()choice.py:91–95, 117–121

These query strings are stored on the generator and returned via custom_queries(),
which writes them into src-stats.yaml. They are later executed by make-stats
against the source database, so they must also be dialect-correct.

  • Line 94: ORDER BY RANDOM() LIMIT {sample_count} in a subquery
  • Line 120: same pattern, with an additional suppress filter

4. ORDER BY in subqueries without TOPchoice.py:108, 121

MS-SQL disallows ORDER BY inside a subquery (derived table) unless TOP or
OFFSET…FETCH is also present. The suppress-only paths (suppress_count > 0, sample_count = None) produce subqueries with ORDER BY count DESCbut noLIMIT/TOP`,
which MS-SQL will reject.

# line 108 — ORDER BY inside subquery, no TOP
f" GROUP BY value ORDER BY count DESC) AS _inner"
f" WHERE {suppress_count} < count"

5. Schema-missing FROM clause in Buckets queries — generators/base.py

Buckets.make_buckets() and Buckets.__init__() used table(table_name) (a schema-less
SQLAlchemy TableClause) and raw text() SQL, producing unqualified table names such as:

SELECT avg(ethnicity_concept_id) AS mean, stdev(ethnicity_concept_id) AS stddev,
       count(ethnicity_concept_id) AS count FROM person

MS-SQL rejects this with "Invalid object name 'person'." (error 42S02) when person
lives under a non-default schema. Callers in continuous.py and mimesis.py passed
only column.table.name (a plain string), discarding the schema.

Additionally, Buckets.__init__() used GROUP BY b (grouping by alias), which MS-SQL
does not support.


Already fixed in this codebase

  • RANDOM()NEWID() in providers.py using connection.dialect.name == "mssql" check
  • EXTRACT(YEAR FROM …)DATEPART(year, …) in generators/mimesis.py using SQLAlchemy extract()
  • STDDEV()STDEV() in generators/base.py using func.stdev / func.stddev

Options for fixes

Option A — Dialect check with raw SQL (follows existing providers.py pattern)

Check engine.dialect.name == "mssql" in ChoiceGeneratorFactory.get_generators() and
build separate SQL strings. Pass the dialect to ChoiceGenerator.__init__() so it can
store the correct _query string for src-stats.yaml.

MS-SQL equivalents:

-- Distinct-value query
SELECT TOP 501 "col" AS v, COUNT("col") AS f FROM "table" GROUP BY v ORDER BY f DESC

-- Random-sample query
SELECT v, COUNT(v) AS f FROM
  (SELECT TOP 500 "col" AS v FROM "table" ORDER BY NEWID()) AS _inner
GROUP BY v ORDER BY f DESC

-- Suppress-only subquery (drop the unnecessary ORDER BY)
SELECT value FROM
  (SELECT col AS value, COUNT(col) AS count FROM table
   WHERE col IS NOT NULL GROUP BY value) AS _inner
WHERE suppress_count < count

Pros: Minimal change; consistent with the NEWID() pattern already in providers.py.
Stored query strings are straightforward dialect-specific strings.
Cons: Must enumerate dialects explicitly; each new dialect may require additions.
ChoiceGenerator.__init__() signature changes (dialect parameter added).

Option B — SQLAlchemy expression API (most portable)

Use select(), .limit(), func.newid()/func.random(), and .subquery(). SQLAlchemy's
.limit(n) automatically compiles to SELECT TOP n … on MS-SQL and … LIMIT n on
PostgreSQL/DuckDB — so LIMIT vs TOP is handled for free. RANDOM() vs NEWID() still
requires a dialect check (same as providers.py).

For the stored _query strings: compile the SQLAlchemy expression against the
engine's dialect at construct time (the same technique used in mimesis.py after the
recent fix for _min_year/_max_year). The factory passes engine (or engine.dialect)
to ChoiceGenerator.__init__().

# Live query (get_generators)
random_fn = func.newid() if engine.dialect.name == "mssql" else func.random()
inner = (
    select(literal_column(f'"{column_name}"').label("v"))
    .select_from(table(table_name))
    .order_by(random_fn)
    .limit(self.SAMPLE_COUNT)
    .subquery("_inner")
)
stmt = (
    select(inner.c.v, func.count(inner.c.v).label("f"))
    .select_from(inner)
    .group_by(inner.c.v)
    .order_by(desc("f"))
)

# Stored query string for src-stats.yaml
stored_sql = str(stmt.compile(dialect=engine.dialect, compile_kwargs={"literal_binds": True}))

Pros: LIMIT/TOP handled automatically by SQLAlchemy; no need to enumerate dialects
for that part. Consistent with mimesis.py and base.py after recent fixes.
Cons: More significant refactor; ChoiceGenerator constructor changes; compiled SQL
in src-stats.yaml is still dialect-locked at the time configure-generators is run.


Affected files summary

File Lines Issue
datafaker/generators/choice.py 307–313 LIMIT without TOP
datafaker/generators/choice.py 352–359 RANDOM() + LIMIT
datafaker/generators/choice.py 91–95, 117–121 stored _query with RANDOM() LIMIT
datafaker/generators/choice.py 108, 121 ORDER BY in subquery without TOP
datafaker/generators/base.py Buckets.__init__, make_buckets schema-missing FROM, GROUP BY alias
datafaker/generators/continuous.py 163 caller passes bare table name
datafaker/generators/mimesis.py 320–324 caller passes bare table name

Fixes applied (Option B)

7c0add6 — ChoiceGeneratorFactory: RANDOM()/LIMIT → SQLAlchemy expression API

  • Replaced raw text() queries in ChoiceGeneratorFactory.get_generators() with SQLAlchemy select() expressions.
  • Added _choice_stmt() module-level helper that builds dialect-correct SELECT expressions: .limit(n) compiles to TOP n on MS-SQL and LIMIT n elsewhere; func.newid() / func.random() selected per dialect.
  • Added dialect parameter to ChoiceGenerator.__init__() to compile the stored _query string at construction time against the correct dialect.
  • MS-SQL suppress-only path no longer emits ORDER BY inside a subquery without TOP.
  • Live queries in get_generators() use src_table = column.table (the actual SQLAlchemy Table object) to preserve schema qualification in the FROM clause.

2bcea2b — ChoiceGeneratorFactory: schema-qualified table in live queries

  • Fixed the two live queries in get_generators() to use column.table (with schema) rather than table(table_name) (schema-less TableClause).
  • Added test test_schema_qualified_table_appears_in_from to verify schema appears in both queries on MS-SQL and PostgreSQL.

41b96f6 — Buckets: schema-missing FROM clause and GROUP BY alias

  • Buckets.make_buckets() and Buckets.__init__() now accept an optional src_table parameter (a SQLAlchemy Table object). When provided, it is used in select_from() instead of the schema-less table(table_name).
  • Buckets.__init__() converted from raw text() SQL to a SQLAlchemy expression: func.floor(...) with group_by(floor_expr) (grouping by expression, not alias — avoids MS-SQL GROUP BY alias rejection).
  • Callers in continuous.py and mimesis.py updated to pass src_table=column.table.
  • Added TestBucketsSchemaQualified with 3 tests verifying schema appears in the FROM clause.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions