Skip to content

Bug: column_value provider uses random() which is not recognised by MS-SQL #102

@myyong

Description

@myyong

Problem

ColumnValueProvider.column_value() in datafaker/providers.py:27–36 selects a random row using:

query = select(orm_class).order_by(functions.random()).limit(1)

SQLAlchemy's functions.random() compiles to random() for all dialects. MS-SQL does not have a random() function and raises an error at runtime:

sqlalchemy.exc.ProgrammingError: ('42000', "[42000] [Microsoft][ODBC Driver 18
for SQL Server][SQL Server]'random' is not a recognized built-in function name.")

The correct MS-SQL equivalents are RAND() (returns a scalar float, usable in ORDER BY) or NEWID() (generates a random GUID, commonly used for random row ordering).

Root cause

functions.random() is not dialect-aware — SQLAlchemy emits random() unconditionally regardless of the connected engine. PostgreSQL and DuckDB both expose a random() function, so this works on those engines, but fails on MS-SQL.

Impact

Any df.py that calls column_value against an MS-SQL source database will fail. This includes OMOP-style schemas where many FK columns are populated via column_value lookups against vocabulary tables (e.g. concept).

Proposed fix

Detect the dialect at runtime inside column_value and substitute the appropriate random expression:

@staticmethod
def column_value(
    db_connection: Connection, orm_class: Any, column_name: str
) -> Any:
    """Return a random value from the column specified."""
    dialect = db_connection.dialect.name
    if dialect == "mssql":
        random_fn = func.newid()
    else:
        random_fn = functions.random()
    query = select(orm_class).order_by(random_fn).limit(1)
    random_row = db_connection.execute(query).first()

    if random_row:
        return getattr(random_row, column_name)
    return None

NEWID() is preferred over RAND() for MS-SQL row ordering because RAND() returns the same value for every row in a single query execution, giving a non-random sort. NEWID() produces a unique value per row and is the standard MS-SQL idiom for ORDER BY random().

Reference

The existing dialect-specific hooks in datafaker/create.py (e.g. remove_mssql_identity, remove_mssql_on_delete_cascade) show the established pattern for handling dialect differences in this codebase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions