Skip to content

configure-generators fails on MS-SQL: remaining RANDOM() and LIMIT occurrences in interactive shell and generators #107

@myyong

Description

@myyong

Bug

After earlier fixes to choice.py, providers.py, mimesis.py, and base.py,
configure-generators (and related interactive commands) still emits SQL containing
RANDOM() and LIMIT when run against an MS-SQL source database.

Example observed SQL:

SELECT ethnicity_concept_id FROM person WHERE ethnicity_concept_id IS NOT NULL ORDER BY RANDOM() LIMIT 5

MS-SQL rejects both:

  • RANDOM() — must use NEWID() for per-row randomness in ORDER BY
  • ORDER BY … LIMIT n — must use SELECT TOP n … ORDER BY … (TOP precedes the column list)

Affected locations

File Lines Method Used by
datafaker/generators/continuous.py 570–571 CovariateQuery._inner_query() statistical covariance query in configure-generators
datafaker/interactive/missingness.py 17 MissingnessType.SAMPLED_QUERY constant stored SQL written to src-stats.yaml
datafaker/interactive/base.py 422 DbCmd.do_peek() interactive peek command
datafaker/interactive/table.py 487 TableEntry.print_column_data() interactive data command
datafaker/interactive/table.py 501 TableEntry.print_row_data() interactive peek / data
datafaker/interactive/generators.py 763 ColumnEntry._get_column_data() display sample values during generator config

Detail

1. CovariateQuery._inner_query()continuous.py:570–571

Builds a sampled subquery as a raw string:

return (
    f"(SELECT * FROM {self.table}{where} ORDER BY RANDOM()"
    f" LIMIT {self._sample_count}) AS _sampled"
)

MS-SQL equivalent: (SELECT TOP {n} * FROM {table}{where} ORDER BY NEWID()) AS _sampled

2. MissingnessType.SAMPLED_QUERYmissingness.py:17

Class-level template string:

SAMPLED_QUERY = (
    "SELECT COUNT(*) AS row_count, {result_names} FROM "
    "(SELECT {column_is_nulls} FROM {table} ORDER BY RANDOM() LIMIT {count})"
    " AS __t GROUP BY {result_names}"
)

This string is stored in src-stats.yaml and later executed by make-stats.
MS-SQL equivalent uses SELECT TOP {count} … ORDER BY NEWID().

3–6. Interactive shell — base.py:422, table.py:487,501, generators.py:763

All four use sqlalchemy.text() with raw f-string SQL:

sqlalchemy.text(
    f"SELECT {cols} FROM {table_name} WHERE … ORDER BY RANDOM() LIMIT {n}"
)

All have access to self.sync_engine and can check self.sync_engine.dialect.name.


Proposed fix

Interactive shell (base.py, table.py, generators.py): Replace sqlalchemy.text() f-strings with the SQLAlchemy expression API. Check engine.dialect.name == "mssql" to choose func.newid() vs func.random(); use .limit(n) which SQLAlchemy compiles to TOP n on MS-SQL and LIMIT n elsewhere automatically.

CovariateQuery (continuous.py): Add dialect_name: str = "" parameter to CovariateQuery.__init__(); branch in _inner_query() to produce the correct form. Pass engine.dialect.name at the one instantiation site in get_generators() (line 631).

MissingnessType (missingness.py): Add dialect_name: str = "" parameter to the sampled_query() classmethod; return the MS-SQL form when appropriate. Update the do_sampled() caller to pass self.sync_engine.dialect.name.


Already fixed in this codebase

  • RANDOM()NEWID() in providers.py
  • EXTRACT(YEAR FROM …)DATEPART(year, …) in generators/mimesis.py
  • STDDEV()STDEV() in generators/base.py
  • RANDOM() + LIMITNEWID() + TOP in generators/choice.py (live queries and stored _query strings)
  • Schema-missing FROM clause in Buckets queries (generators/base.py)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions