Bug
Running datafaker configure-generators against an MS-SQL source database fails
for any column that is assessed as a choice distribution. ChoiceGeneratorFactory.get_generators()
executes raw SQL containing RANDOM() and LIMIT, neither of which MS-SQL supports.
Expected error (RANDOM):
('42000', "[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]
'RANDOM' is not a recognized built-in function name.")
[SQL: SELECT v, COUNT(v) AS f FROM (SELECT "col" as v FROM "table"
ORDER BY RANDOM() LIMIT 500) AS _inner GROUP BY v ORDER BY f DESC]
Expected error (LIMIT):
('42000', "Incorrect syntax near 'LIMIT'.")
[SQL: SELECT "col" AS v, COUNT("col") AS f FROM "table"
GROUP BY v ORDER BY f DESC LIMIT 501]
Affected locations
1. ChoiceGeneratorFactory.get_generators() — datafaker/generators/choice.py:307–313
Distinct-value count query, uses LIMIT:
text(
f'SELECT "{column_name}" AS v, COUNT("{column_name}")'
f' AS f FROM "{table_name}" GROUP BY v'
f" ORDER BY f DESC LIMIT {MAXIMUM_CHOICES + 1}"
)
MS-SQL has no LIMIT clause; the equivalent is SELECT TOP n.
2. ChoiceGeneratorFactory.get_generators() — datafaker/generators/choice.py:352–359
Random-sample query, uses RANDOM() and LIMIT:
text(
f"SELECT v, COUNT(v) AS f FROM"
f' (SELECT "{column_name}" as v FROM "{table_name}"'
f" ORDER BY RANDOM() LIMIT {self.SAMPLE_COUNT})"
f" AS _inner GROUP BY v ORDER BY f DESC"
)
MS-SQL issues:
RANDOM() → must use NEWID() for per-row randomness in ORDER BY
LIMIT n → must use TOP n before the column list
3. Stored query strings in ChoiceGenerator.__init__() — choice.py:91–95, 117–121
These query strings are stored on the generator and returned via custom_queries(),
which writes them into src-stats.yaml. They are later executed by make-stats
against the source database, so they must also be dialect-correct.
- Line 94:
ORDER BY RANDOM() LIMIT {sample_count} in a subquery
- Line 120: same pattern, with an additional suppress filter
4. ORDER BY in subqueries without TOP — choice.py:108, 121
MS-SQL disallows ORDER BY inside a subquery (derived table) unless TOP or
OFFSET…FETCH is also present. The suppress-only paths (suppress_count > 0, sample_count = None) produce subqueries with ORDER BY count DESCbut noLIMIT/TOP`,
which MS-SQL will reject.
# line 108 — ORDER BY inside subquery, no TOP
f" GROUP BY value ORDER BY count DESC) AS _inner"
f" WHERE {suppress_count} < count"
5. Schema-missing FROM clause in Buckets queries — generators/base.py
Buckets.make_buckets() and Buckets.__init__() used table(table_name) (a schema-less
SQLAlchemy TableClause) and raw text() SQL, producing unqualified table names such as:
SELECT avg(ethnicity_concept_id) AS mean, stdev(ethnicity_concept_id) AS stddev,
count(ethnicity_concept_id) AS count FROM person
MS-SQL rejects this with "Invalid object name 'person'." (error 42S02) when person
lives under a non-default schema. Callers in continuous.py and mimesis.py passed
only column.table.name (a plain string), discarding the schema.
Additionally, Buckets.__init__() used GROUP BY b (grouping by alias), which MS-SQL
does not support.
Already fixed in this codebase
RANDOM() → NEWID() in providers.py using connection.dialect.name == "mssql" check
EXTRACT(YEAR FROM …) → DATEPART(year, …) in generators/mimesis.py using SQLAlchemy extract()
STDDEV() → STDEV() in generators/base.py using func.stdev / func.stddev
Options for fixes
Option A — Dialect check with raw SQL (follows existing providers.py pattern)
Check engine.dialect.name == "mssql" in ChoiceGeneratorFactory.get_generators() and
build separate SQL strings. Pass the dialect to ChoiceGenerator.__init__() so it can
store the correct _query string for src-stats.yaml.
MS-SQL equivalents:
-- Distinct-value query
SELECT TOP 501 "col" AS v, COUNT("col") AS f FROM "table" GROUP BY v ORDER BY f DESC
-- Random-sample query
SELECT v, COUNT(v) AS f FROM
(SELECT TOP 500 "col" AS v FROM "table" ORDER BY NEWID()) AS _inner
GROUP BY v ORDER BY f DESC
-- Suppress-only subquery (drop the unnecessary ORDER BY)
SELECT value FROM
(SELECT col AS value, COUNT(col) AS count FROM table
WHERE col IS NOT NULL GROUP BY value) AS _inner
WHERE suppress_count < count
Pros: Minimal change; consistent with the NEWID() pattern already in providers.py.
Stored query strings are straightforward dialect-specific strings.
Cons: Must enumerate dialects explicitly; each new dialect may require additions.
ChoiceGenerator.__init__() signature changes (dialect parameter added).
Option B — SQLAlchemy expression API (most portable)
Use select(), .limit(), func.newid()/func.random(), and .subquery(). SQLAlchemy's
.limit(n) automatically compiles to SELECT TOP n … on MS-SQL and … LIMIT n on
PostgreSQL/DuckDB — so LIMIT vs TOP is handled for free. RANDOM() vs NEWID() still
requires a dialect check (same as providers.py).
For the stored _query strings: compile the SQLAlchemy expression against the
engine's dialect at construct time (the same technique used in mimesis.py after the
recent fix for _min_year/_max_year). The factory passes engine (or engine.dialect)
to ChoiceGenerator.__init__().
# Live query (get_generators)
random_fn = func.newid() if engine.dialect.name == "mssql" else func.random()
inner = (
select(literal_column(f'"{column_name}"').label("v"))
.select_from(table(table_name))
.order_by(random_fn)
.limit(self.SAMPLE_COUNT)
.subquery("_inner")
)
stmt = (
select(inner.c.v, func.count(inner.c.v).label("f"))
.select_from(inner)
.group_by(inner.c.v)
.order_by(desc("f"))
)
# Stored query string for src-stats.yaml
stored_sql = str(stmt.compile(dialect=engine.dialect, compile_kwargs={"literal_binds": True}))
Pros: LIMIT/TOP handled automatically by SQLAlchemy; no need to enumerate dialects
for that part. Consistent with mimesis.py and base.py after recent fixes.
Cons: More significant refactor; ChoiceGenerator constructor changes; compiled SQL
in src-stats.yaml is still dialect-locked at the time configure-generators is run.
Affected files summary
| File |
Lines |
Issue |
datafaker/generators/choice.py |
307–313 |
LIMIT without TOP |
datafaker/generators/choice.py |
352–359 |
RANDOM() + LIMIT |
datafaker/generators/choice.py |
91–95, 117–121 |
stored _query with RANDOM() LIMIT |
datafaker/generators/choice.py |
108, 121 |
ORDER BY in subquery without TOP |
datafaker/generators/base.py |
Buckets.__init__, make_buckets |
schema-missing FROM, GROUP BY alias |
datafaker/generators/continuous.py |
163 |
caller passes bare table name |
datafaker/generators/mimesis.py |
320–324 |
caller passes bare table name |
Fixes applied (Option B)
7c0add6 — ChoiceGeneratorFactory: RANDOM()/LIMIT → SQLAlchemy expression API
- Replaced raw
text() queries in ChoiceGeneratorFactory.get_generators() with SQLAlchemy select() expressions.
- Added
_choice_stmt() module-level helper that builds dialect-correct SELECT expressions: .limit(n) compiles to TOP n on MS-SQL and LIMIT n elsewhere; func.newid() / func.random() selected per dialect.
- Added
dialect parameter to ChoiceGenerator.__init__() to compile the stored _query string at construction time against the correct dialect.
- MS-SQL suppress-only path no longer emits
ORDER BY inside a subquery without TOP.
- Live queries in
get_generators() use src_table = column.table (the actual SQLAlchemy Table object) to preserve schema qualification in the FROM clause.
2bcea2b — ChoiceGeneratorFactory: schema-qualified table in live queries
- Fixed the two live queries in
get_generators() to use column.table (with schema) rather than table(table_name) (schema-less TableClause).
- Added test
test_schema_qualified_table_appears_in_from to verify schema appears in both queries on MS-SQL and PostgreSQL.
41b96f6 — Buckets: schema-missing FROM clause and GROUP BY alias
Buckets.make_buckets() and Buckets.__init__() now accept an optional src_table parameter (a SQLAlchemy Table object). When provided, it is used in select_from() instead of the schema-less table(table_name).
Buckets.__init__() converted from raw text() SQL to a SQLAlchemy expression: func.floor(...) with group_by(floor_expr) (grouping by expression, not alias — avoids MS-SQL GROUP BY alias rejection).
- Callers in
continuous.py and mimesis.py updated to pass src_table=column.table.
- Added
TestBucketsSchemaQualified with 3 tests verifying schema appears in the FROM clause.
Bug
Running
datafaker configure-generatorsagainst an MS-SQL source database failsfor any column that is assessed as a choice distribution.
ChoiceGeneratorFactory.get_generators()executes raw SQL containing
RANDOM()andLIMIT, neither of which MS-SQL supports.Expected error (RANDOM):
Expected error (LIMIT):
Affected locations
1.
ChoiceGeneratorFactory.get_generators()—datafaker/generators/choice.py:307–313Distinct-value count query, uses
LIMIT:MS-SQL has no
LIMITclause; the equivalent isSELECT TOP n.2.
ChoiceGeneratorFactory.get_generators()—datafaker/generators/choice.py:352–359Random-sample query, uses
RANDOM()andLIMIT:MS-SQL issues:
RANDOM()→ must useNEWID()for per-row randomness inORDER BYLIMIT n→ must useTOP nbefore the column list3. Stored query strings in
ChoiceGenerator.__init__()—choice.py:91–95, 117–121These query strings are stored on the generator and returned via
custom_queries(),which writes them into
src-stats.yaml. They are later executed bymake-statsagainst the source database, so they must also be dialect-correct.
ORDER BY RANDOM() LIMIT {sample_count}in a subquery4.
ORDER BYin subqueries withoutTOP—choice.py:108, 121MS-SQL disallows
ORDER BYinside a subquery (derived table) unlessTOPorOFFSET…FETCHis also present. The suppress-only paths (suppress_count > 0,sample_count = None) produce subqueries withORDER BY count DESCbut noLIMIT/TOP`,which MS-SQL will reject.
5. Schema-missing FROM clause in
Bucketsqueries —generators/base.pyBuckets.make_buckets()andBuckets.__init__()usedtable(table_name)(a schema-lessSQLAlchemy
TableClause) and rawtext()SQL, producing unqualified table names such as:MS-SQL rejects this with
"Invalid object name 'person'."(error 42S02) whenpersonlives under a non-default schema. Callers in
continuous.pyandmimesis.pypassedonly
column.table.name(a plain string), discarding the schema.Additionally,
Buckets.__init__()usedGROUP BY b(grouping by alias), which MS-SQLdoes not support.
Already fixed in this codebase
RANDOM()→NEWID()inproviders.pyusingconnection.dialect.name == "mssql"checkEXTRACT(YEAR FROM …)→DATEPART(year, …)ingenerators/mimesis.pyusing SQLAlchemyextract()STDDEV()→STDEV()ingenerators/base.pyusingfunc.stdev/func.stddevOptions for fixes
Option A — Dialect check with raw SQL (follows existing
providers.pypattern)Check
engine.dialect.name == "mssql"inChoiceGeneratorFactory.get_generators()andbuild separate SQL strings. Pass the dialect to
ChoiceGenerator.__init__()so it canstore the correct
_querystring forsrc-stats.yaml.MS-SQL equivalents:
Pros: Minimal change; consistent with the
NEWID()pattern already inproviders.py.Stored query strings are straightforward dialect-specific strings.
Cons: Must enumerate dialects explicitly; each new dialect may require additions.
ChoiceGenerator.__init__()signature changes (dialect parameter added).Option B — SQLAlchemy expression API (most portable)
Use
select(),.limit(),func.newid()/func.random(), and.subquery(). SQLAlchemy's.limit(n)automatically compiles toSELECT TOP n …on MS-SQL and… LIMIT nonPostgreSQL/DuckDB — so
LIMITvsTOPis handled for free.RANDOM()vsNEWID()stillrequires a dialect check (same as
providers.py).For the stored
_querystrings: compile the SQLAlchemy expression against theengine's dialect at construct time (the same technique used in
mimesis.pyafter therecent fix for
_min_year/_max_year). The factory passesengine(orengine.dialect)to
ChoiceGenerator.__init__().Pros:
LIMIT/TOPhandled automatically by SQLAlchemy; no need to enumerate dialectsfor that part. Consistent with
mimesis.pyandbase.pyafter recent fixes.Cons: More significant refactor;
ChoiceGeneratorconstructor changes; compiled SQLin
src-stats.yamlis still dialect-locked at the timeconfigure-generatorsis run.Affected files summary
datafaker/generators/choice.pyLIMITwithoutTOPdatafaker/generators/choice.pyRANDOM()+LIMITdatafaker/generators/choice.py_querywithRANDOM() LIMITdatafaker/generators/choice.pyORDER BYin subquery withoutTOPdatafaker/generators/base.pyBuckets.__init__,make_bucketsFROM,GROUP BYaliasdatafaker/generators/continuous.pydatafaker/generators/mimesis.pyFixes applied (Option B)
7c0add6 — ChoiceGeneratorFactory: RANDOM()/LIMIT → SQLAlchemy expression API
text()queries inChoiceGeneratorFactory.get_generators()with SQLAlchemyselect()expressions._choice_stmt()module-level helper that builds dialect-correct SELECT expressions:.limit(n)compiles toTOP non MS-SQL andLIMIT nelsewhere;func.newid()/func.random()selected per dialect.dialectparameter toChoiceGenerator.__init__()to compile the stored_querystring at construction time against the correct dialect.ORDER BYinside a subquery withoutTOP.get_generators()usesrc_table = column.table(the actual SQLAlchemyTableobject) to preserve schema qualification in the FROM clause.2bcea2b — ChoiceGeneratorFactory: schema-qualified table in live queries
get_generators()to usecolumn.table(with schema) rather thantable(table_name)(schema-lessTableClause).test_schema_qualified_table_appears_in_fromto verify schema appears in both queries on MS-SQL and PostgreSQL.41b96f6 — Buckets: schema-missing FROM clause and GROUP BY alias
Buckets.make_buckets()andBuckets.__init__()now accept an optionalsrc_tableparameter (a SQLAlchemyTableobject). When provided, it is used inselect_from()instead of the schema-lesstable(table_name).Buckets.__init__()converted from rawtext()SQL to a SQLAlchemy expression:func.floor(...)withgroup_by(floor_expr)(grouping by expression, not alias — avoids MS-SQLGROUP BYalias rejection).continuous.pyandmimesis.pyupdated to passsrc_table=column.table.TestBucketsSchemaQualifiedwith 3 tests verifying schema appears in the FROM clause.