configure-generators fails on MS-SQL: RANDOM() and LIMIT not supported in ChoiceGeneratorFactory

### Bug

Running `datafaker configure-generators` against an MS-SQL source database fails
for any column that is assessed as a choice distribution. `ChoiceGeneratorFactory.get_generators()`
executes raw SQL containing `RANDOM()` and `LIMIT`, neither of which MS-SQL supports.

Expected error (RANDOM):
```
('42000', "[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]
'RANDOM' is not a recognized built-in function name.")
[SQL: SELECT v, COUNT(v) AS f FROM (SELECT "col" as v FROM "table"
      ORDER BY RANDOM() LIMIT 500) AS _inner GROUP BY v ORDER BY f DESC]
```

Expected error (LIMIT):
```
('42000', "Incorrect syntax near 'LIMIT'.")
[SQL: SELECT "col" AS v, COUNT("col") AS f FROM "table"
      GROUP BY v ORDER BY f DESC LIMIT 501]
```

---

### Affected locations

#### 1. `ChoiceGeneratorFactory.get_generators()` — `datafaker/generators/choice.py:307–313`
Distinct-value count query, uses `LIMIT`:
```python
text(
    f'SELECT "{column_name}" AS v, COUNT("{column_name}")'
    f' AS f FROM "{table_name}" GROUP BY v'
    f" ORDER BY f DESC LIMIT {MAXIMUM_CHOICES + 1}"
)
```
MS-SQL has no `LIMIT` clause; the equivalent is `SELECT TOP n`.

#### 2. `ChoiceGeneratorFactory.get_generators()` — `datafaker/generators/choice.py:352–359`
Random-sample query, uses `RANDOM()` and `LIMIT`:
```python
text(
    f"SELECT v, COUNT(v) AS f FROM"
    f' (SELECT "{column_name}" as v FROM "{table_name}"'
    f" ORDER BY RANDOM() LIMIT {self.SAMPLE_COUNT})"
    f" AS _inner GROUP BY v ORDER BY f DESC"
)
```
MS-SQL issues:
- `RANDOM()` → must use `NEWID()` for per-row randomness in `ORDER BY`
- `LIMIT n` → must use `TOP n` before the column list

#### 3. Stored query strings in `ChoiceGenerator.__init__()` — `choice.py:91–95, 117–121`
These query strings are stored on the generator and returned via `custom_queries()`,
which writes them into `src-stats.yaml`. They are later executed by `make-stats`
against the source database, so they must also be dialect-correct.

- Line 94: `ORDER BY RANDOM() LIMIT {sample_count}` in a subquery
- Line 120: same pattern, with an additional suppress filter

#### 4. `ORDER BY` in subqueries without `TOP` — `choice.py:108, 121`
MS-SQL disallows `ORDER BY` inside a subquery (derived table) unless `TOP` or
`OFFSET…FETCH` is also present. The suppress-only paths (`suppress_count > 0,
`sample_count = None`) produce subqueries with `ORDER BY count DESC` but no `LIMIT`/`TOP`,
which MS-SQL will reject.

```python
# line 108 — ORDER BY inside subquery, no TOP
f" GROUP BY value ORDER BY count DESC) AS _inner"
f" WHERE {suppress_count} < count"
```

#### 5. Schema-missing FROM clause in `Buckets` queries — `generators/base.py`
`Buckets.make_buckets()` and `Buckets.__init__()` used `table(table_name)` (a schema-less
SQLAlchemy `TableClause`) and raw `text()` SQL, producing unqualified table names such as:
```sql
SELECT avg(ethnicity_concept_id) AS mean, stdev(ethnicity_concept_id) AS stddev,
       count(ethnicity_concept_id) AS count FROM person
```
MS-SQL rejects this with `"Invalid object name 'person'."` (error 42S02) when `person`
lives under a non-default schema. Callers in `continuous.py` and `mimesis.py` passed
only `column.table.name` (a plain string), discarding the schema.

Additionally, `Buckets.__init__()` used `GROUP BY b` (grouping by alias), which MS-SQL
does not support.

---

### Already fixed in this codebase
- `RANDOM()` → `NEWID()` in `providers.py` using `connection.dialect.name == "mssql"` check
- `EXTRACT(YEAR FROM …)` → `DATEPART(year, …)` in `generators/mimesis.py` using SQLAlchemy `extract()`
- `STDDEV()` → `STDEV()` in `generators/base.py` using `func.stdev` / `func.stddev`

---

### Options for fixes

#### Option A — Dialect check with raw SQL (follows existing `providers.py` pattern)
Check `engine.dialect.name == "mssql"` in `ChoiceGeneratorFactory.get_generators()` and
build separate SQL strings. Pass the dialect to `ChoiceGenerator.__init__()` so it can
store the correct `_query` string for `src-stats.yaml`.

MS-SQL equivalents:
```sql
-- Distinct-value query
SELECT TOP 501 "col" AS v, COUNT("col") AS f FROM "table" GROUP BY v ORDER BY f DESC

-- Random-sample query
SELECT v, COUNT(v) AS f FROM
  (SELECT TOP 500 "col" AS v FROM "table" ORDER BY NEWID()) AS _inner
GROUP BY v ORDER BY f DESC

-- Suppress-only subquery (drop the unnecessary ORDER BY)
SELECT value FROM
  (SELECT col AS value, COUNT(col) AS count FROM table
   WHERE col IS NOT NULL GROUP BY value) AS _inner
WHERE suppress_count < count
```

**Pros:** Minimal change; consistent with the `NEWID()` pattern already in `providers.py`.
Stored query strings are straightforward dialect-specific strings.  
**Cons:** Must enumerate dialects explicitly; each new dialect may require additions.
`ChoiceGenerator.__init__()` signature changes (dialect parameter added).

#### Option B — SQLAlchemy expression API (most portable)
Use `select()`, `.limit()`, `func.newid()`/`func.random()`, and `.subquery()`. SQLAlchemy's
`.limit(n)` automatically compiles to `SELECT TOP n …` on MS-SQL and `… LIMIT n` on
PostgreSQL/DuckDB — so `LIMIT` vs `TOP` is handled for free. `RANDOM()` vs `NEWID()` still
requires a dialect check (same as `providers.py`).

For the stored `_query` strings: compile the SQLAlchemy expression against the
engine's dialect at construct time (the same technique used in `mimesis.py` after the
recent fix for `_min_year`/`_max_year`). The factory passes `engine` (or `engine.dialect`)
to `ChoiceGenerator.__init__()`.

```python
# Live query (get_generators)
random_fn = func.newid() if engine.dialect.name == "mssql" else func.random()
inner = (
    select(literal_column(f'"{column_name}"').label("v"))
    .select_from(table(table_name))
    .order_by(random_fn)
    .limit(self.SAMPLE_COUNT)
    .subquery("_inner")
)
stmt = (
    select(inner.c.v, func.count(inner.c.v).label("f"))
    .select_from(inner)
    .group_by(inner.c.v)
    .order_by(desc("f"))
)

# Stored query string for src-stats.yaml
stored_sql = str(stmt.compile(dialect=engine.dialect, compile_kwargs={"literal_binds": True}))
```

**Pros:** `LIMIT`/`TOP` handled automatically by SQLAlchemy; no need to enumerate dialects
for that part. Consistent with `mimesis.py` and `base.py` after recent fixes.  
**Cons:** More significant refactor; `ChoiceGenerator` constructor changes; compiled SQL
in `src-stats.yaml` is still dialect-locked at the time `configure-generators` is run.

---

### Affected files summary

| File | Lines | Issue |
|---|---|---|
| `datafaker/generators/choice.py` | 307–313 | `LIMIT` without `TOP` |
| `datafaker/generators/choice.py` | 352–359 | `RANDOM()` + `LIMIT` |
| `datafaker/generators/choice.py` | 91–95, 117–121 | stored `_query` with `RANDOM() LIMIT` |
| `datafaker/generators/choice.py` | 108, 121 | `ORDER BY` in subquery without `TOP` |
| `datafaker/generators/base.py` | `Buckets.__init__`, `make_buckets` | schema-missing `FROM`, `GROUP BY` alias |
| `datafaker/generators/continuous.py` | 163 | caller passes bare table name |
| `datafaker/generators/mimesis.py` | 320–324 | caller passes bare table name |

---

### Fixes applied (Option B)

**[7c0add6](https://github.com/SAFEHR-data/datafaker/commit/7c0add6a0dffb3d8fc89270298c8759c962f2997) — ChoiceGeneratorFactory: RANDOM()/LIMIT → SQLAlchemy expression API**

- Replaced raw `text()` queries in `ChoiceGeneratorFactory.get_generators()` with SQLAlchemy `select()` expressions.
- Added `_choice_stmt()` module-level helper that builds dialect-correct SELECT expressions: `.limit(n)` compiles to `TOP n` on MS-SQL and `LIMIT n` elsewhere; `func.newid()` / `func.random()` selected per dialect.
- Added `dialect` parameter to `ChoiceGenerator.__init__()` to compile the stored `_query` string at construction time against the correct dialect.
- MS-SQL suppress-only path no longer emits `ORDER BY` inside a subquery without `TOP`.
- Live queries in `get_generators()` use `src_table = column.table` (the actual SQLAlchemy `Table` object) to preserve schema qualification in the FROM clause.

**[2bcea2b](https://github.com/SAFEHR-data/datafaker/commit/2bcea2b4a050f608cac4fe71f210a947f33090e1) — ChoiceGeneratorFactory: schema-qualified table in live queries**

- Fixed the two live queries in `get_generators()` to use `column.table` (with schema) rather than `table(table_name)` (schema-less `TableClause`).
- Added test `test_schema_qualified_table_appears_in_from` to verify schema appears in both queries on MS-SQL and PostgreSQL.

**[41b96f6](https://github.com/SAFEHR-data/datafaker/commit/41b96f676be7677e33f180d80d58bb19fdc95e92) — Buckets: schema-missing FROM clause and GROUP BY alias**

- `Buckets.make_buckets()` and `Buckets.__init__()` now accept an optional `src_table` parameter (a SQLAlchemy `Table` object). When provided, it is used in `select_from()` instead of the schema-less `table(table_name)`.
- `Buckets.__init__()` converted from raw `text()` SQL to a SQLAlchemy expression: `func.floor(...)` with `group_by(floor_expr)` (grouping by expression, not alias — avoids MS-SQL `GROUP BY` alias rejection).
- Callers in `continuous.py` and `mimesis.py` updated to pass `src_table=column.table`.
- Added `TestBucketsSchemaQualified` with 3 tests verifying schema appears in the FROM clause.

File	Lines	Issue
`datafaker/generators/choice.py`	307–313	`LIMIT` without `TOP`
`datafaker/generators/choice.py`	352–359	`RANDOM()` + `LIMIT`
`datafaker/generators/choice.py`	91–95, 117–121	stored `_query` with `RANDOM() LIMIT`
`datafaker/generators/choice.py`	108, 121	`ORDER BY` in subquery without `TOP`
`datafaker/generators/base.py`	`Buckets.__init__`, `make_buckets`	schema-missing `FROM`, `GROUP BY` alias
`datafaker/generators/continuous.py`	163	caller passes bare table name
`datafaker/generators/mimesis.py`	320–324	caller passes bare table name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configure-generators fails on MS-SQL: RANDOM() and LIMIT not supported in ChoiceGeneratorFactory #106

Bug

Affected locations

1. `ChoiceGeneratorFactory.get_generators()` — `datafaker/generators/choice.py:307–313`

2. `ChoiceGeneratorFactory.get_generators()` — `datafaker/generators/choice.py:352–359`

3. Stored query strings in `ChoiceGenerator.init()` — `choice.py:91–95, 117–121`

4. `ORDER BY` in subqueries without `TOP` — `choice.py:108, 121`

5. Schema-missing FROM clause in `Buckets` queries — `generators/base.py`

Already fixed in this codebase

Options for fixes

Option A — Dialect check with raw SQL (follows existing `providers.py` pattern)

Option B — SQLAlchemy expression API (most portable)

Affected files summary

Fixes applied (Option B)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

configure-generators fails on MS-SQL: RANDOM() and LIMIT not supported in ChoiceGeneratorFactory #106

Description

Bug

Affected locations

1. ChoiceGeneratorFactory.get_generators() — datafaker/generators/choice.py:307–313

2. ChoiceGeneratorFactory.get_generators() — datafaker/generators/choice.py:352–359

3. Stored query strings in ChoiceGenerator.__init__() — choice.py:91–95, 117–121

4. ORDER BY in subqueries without TOP — choice.py:108, 121

5. Schema-missing FROM clause in Buckets queries — generators/base.py

Already fixed in this codebase

Options for fixes

Option A — Dialect check with raw SQL (follows existing providers.py pattern)

Option B — SQLAlchemy expression API (most portable)

Affected files summary

Fixes applied (Option B)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `ChoiceGeneratorFactory.get_generators()` — `datafaker/generators/choice.py:307–313`

2. `ChoiceGeneratorFactory.get_generators()` — `datafaker/generators/choice.py:352–359`

3. Stored query strings in `ChoiceGenerator.init()` — `choice.py:91–95, 117–121`

4. `ORDER BY` in subqueries without `TOP` — `choice.py:108, 121`

5. Schema-missing FROM clause in `Buckets` queries — `generators/base.py`

Option A — Dialect check with raw SQL (follows existing `providers.py` pattern)