Bug
After earlier fixes to choice.py, providers.py, mimesis.py, and base.py,
configure-generators (and related interactive commands) still emits SQL containing
RANDOM() and LIMIT when run against an MS-SQL source database.
Example observed SQL:
SELECT ethnicity_concept_id FROM person WHERE ethnicity_concept_id IS NOT NULL ORDER BY RANDOM() LIMIT 5
MS-SQL rejects both:
RANDOM() — must use NEWID() for per-row randomness in ORDER BY
ORDER BY … LIMIT n — must use SELECT TOP n … ORDER BY … (TOP precedes the column list)
Affected locations
| File |
Lines |
Method |
Used by |
datafaker/generators/continuous.py |
570–571 |
CovariateQuery._inner_query() |
statistical covariance query in configure-generators |
datafaker/interactive/missingness.py |
17 |
MissingnessType.SAMPLED_QUERY constant |
stored SQL written to src-stats.yaml |
datafaker/interactive/base.py |
422 |
DbCmd.do_peek() |
interactive peek command |
datafaker/interactive/table.py |
487 |
TableEntry.print_column_data() |
interactive data command |
datafaker/interactive/table.py |
501 |
TableEntry.print_row_data() |
interactive peek / data |
datafaker/interactive/generators.py |
763 |
ColumnEntry._get_column_data() |
display sample values during generator config |
Detail
1. CovariateQuery._inner_query() — continuous.py:570–571
Builds a sampled subquery as a raw string:
return (
f"(SELECT * FROM {self.table}{where} ORDER BY RANDOM()"
f" LIMIT {self._sample_count}) AS _sampled"
)
MS-SQL equivalent: (SELECT TOP {n} * FROM {table}{where} ORDER BY NEWID()) AS _sampled
2. MissingnessType.SAMPLED_QUERY — missingness.py:17
Class-level template string:
SAMPLED_QUERY = (
"SELECT COUNT(*) AS row_count, {result_names} FROM "
"(SELECT {column_is_nulls} FROM {table} ORDER BY RANDOM() LIMIT {count})"
" AS __t GROUP BY {result_names}"
)
This string is stored in src-stats.yaml and later executed by make-stats.
MS-SQL equivalent uses SELECT TOP {count} … ORDER BY NEWID().
3–6. Interactive shell — base.py:422, table.py:487,501, generators.py:763
All four use sqlalchemy.text() with raw f-string SQL:
sqlalchemy.text(
f"SELECT {cols} FROM {table_name} WHERE … ORDER BY RANDOM() LIMIT {n}"
)
All have access to self.sync_engine and can check self.sync_engine.dialect.name.
Proposed fix
Interactive shell (base.py, table.py, generators.py): Replace sqlalchemy.text() f-strings with the SQLAlchemy expression API. Check engine.dialect.name == "mssql" to choose func.newid() vs func.random(); use .limit(n) which SQLAlchemy compiles to TOP n on MS-SQL and LIMIT n elsewhere automatically.
CovariateQuery (continuous.py): Add dialect_name: str = "" parameter to CovariateQuery.__init__(); branch in _inner_query() to produce the correct form. Pass engine.dialect.name at the one instantiation site in get_generators() (line 631).
MissingnessType (missingness.py): Add dialect_name: str = "" parameter to the sampled_query() classmethod; return the MS-SQL form when appropriate. Update the do_sampled() caller to pass self.sync_engine.dialect.name.
Already fixed in this codebase
RANDOM() → NEWID() in providers.py
EXTRACT(YEAR FROM …) → DATEPART(year, …) in generators/mimesis.py
STDDEV() → STDEV() in generators/base.py
RANDOM() + LIMIT → NEWID() + TOP in generators/choice.py (live queries and stored _query strings)
- Schema-missing
FROM clause in Buckets queries (generators/base.py)
Bug
After earlier fixes to
choice.py,providers.py,mimesis.py, andbase.py,configure-generators(and related interactive commands) still emits SQL containingRANDOM()andLIMITwhen run against an MS-SQL source database.Example observed SQL:
MS-SQL rejects both:
RANDOM()— must useNEWID()for per-row randomness inORDER BYORDER BY … LIMIT n— must useSELECT TOP n … ORDER BY …(TOP precedes the column list)Affected locations
datafaker/generators/continuous.pyCovariateQuery._inner_query()configure-generatorsdatafaker/interactive/missingness.pyMissingnessType.SAMPLED_QUERYconstantsrc-stats.yamldatafaker/interactive/base.pyDbCmd.do_peek()peekcommanddatafaker/interactive/table.pyTableEntry.print_column_data()datacommanddatafaker/interactive/table.pyTableEntry.print_row_data()peek/datadatafaker/interactive/generators.pyColumnEntry._get_column_data()Detail
1.
CovariateQuery._inner_query()—continuous.py:570–571Builds a sampled subquery as a raw string:
MS-SQL equivalent:
(SELECT TOP {n} * FROM {table}{where} ORDER BY NEWID()) AS _sampled2.
MissingnessType.SAMPLED_QUERY—missingness.py:17Class-level template string:
This string is stored in
src-stats.yamland later executed bymake-stats.MS-SQL equivalent uses
SELECT TOP {count} … ORDER BY NEWID().3–6. Interactive shell —
base.py:422,table.py:487,501,generators.py:763All four use
sqlalchemy.text()with raw f-string SQL:All have access to
self.sync_engineand can checkself.sync_engine.dialect.name.Proposed fix
Interactive shell (base.py, table.py, generators.py): Replace
sqlalchemy.text()f-strings with the SQLAlchemy expression API. Checkengine.dialect.name == "mssql"to choosefunc.newid()vsfunc.random(); use.limit(n)which SQLAlchemy compiles toTOP non MS-SQL andLIMIT nelsewhere automatically.CovariateQuery (continuous.py): Add
dialect_name: str = ""parameter toCovariateQuery.__init__(); branch in_inner_query()to produce the correct form. Passengine.dialect.nameat the one instantiation site inget_generators()(line 631).MissingnessType (missingness.py): Add
dialect_name: str = ""parameter to thesampled_query()classmethod; return the MS-SQL form when appropriate. Update thedo_sampled()caller to passself.sync_engine.dialect.name.Already fixed in this codebase
RANDOM()→NEWID()inproviders.pyEXTRACT(YEAR FROM …)→DATEPART(year, …)ingenerators/mimesis.pySTDDEV()→STDEV()ingenerators/base.pyRANDOM()+LIMIT→NEWID()+TOPingenerators/choice.py(live queries and stored_querystrings)FROMclause inBucketsqueries (generators/base.py)