hyperdb-mcp: QUALIFY unsupported (42601); APPROX_COUNT_DISTINCT no speedup on string keys — surfaced by 299M-row benchmark

## Summary

Follow-up from the same NYC TLC Yellow Taxi exercise as StefanSteiner/hyper-api-rust#1, this time a head-to-head benchmark of `hyperdb-mcp` vs DuckDB over the full normalized dataset (299,214,146 rows, 34 Parquet files, Jan + Jul × 2009–2025). hyperdb performed *very* well — it won 8 of 10 compute queries against DuckDB's own pre-loaded native table, by 1.25–3.9× — but the suite surfaced one dialect blocker and one perf papercut worth filing.

Environment: `hyperdb-mcp` (this repo), driven from Claude Code. Dialect is Salesforce Data Cloud SQL / Tableau Hyper. Data was warm in the daemon (one `bench` table). Timings are engine-side `stats.elapsed_ms`.

---

### 1. (Dialect gap) `QUALIFY` is not supported — `42601 syntax error`

A standard "top-N per group" query using `QUALIFY` on a window function fails:

```sql
WITH z AS (
  SELECT data_year, "PULocationID", COUNT(*) trips
  FROM bench WHERE "PULocationID" IS NOT NULL
  GROUP BY data_year, "PULocationID"
)
SELECT data_year, "PULocationID", trips,
       ROW_NUMBER() OVER (PARTITION BY data_year ORDER BY trips DESC) rnk
FROM z
QUALIFY rnk <= 5;
-- server error (42601): ERROR: syntax error: got identifier, expected end-of-file
```

The documented subquery fallback works fine:

```sql
SELECT * FROM (
  WITH z AS (...) SELECT ..., ROW_NUMBER() OVER (...) rnk FROM z
) s
WHERE rnk <= 5;
```

**Impact:** Minor functionally (the fallback is mechanical), but `QUALIFY` is a very common idiom and is supported by DuckDB, Snowflake, BigQuery, and Databricks. The MCP server instructions / dialect card advertise extensive window-function support (row_number, rank, modified_rank, IGNORE NULLS, frame modes, etc.) but do not flag `QUALIFY` as absent, so a user reasonably expects it to work.

**Suggestion:** Support `QUALIFY`, or explicitly note its absence (and the subquery-wrap workaround) in the dialect card alongside the window-function list.

---

### 2. (Perf) `APPROX_COUNT_DISTINCT` gives almost no speedup when the argument is a constructed string

Exact vs approximate distinct over a 299M-row concatenated key were essentially the same:

```sql
SELECT COUNT(DISTINCT "PULocationID" || '-' || "DOLocationID") FROM bench;
-- ~2.31 s (median of 3)
SELECT APPROX_COUNT_DISTINCT("PULocationID" || '-' || "DOLocationID") FROM bench;
-- ~2.10 s (median of 3) — only ~9% faster
```

For comparison, on DuckDB the approximate path is dramatically cheaper than exact. Here the cost is dominated by **materializing the `BIGINT || '-' || BIGINT` string for every row** before the distinct/sketch step, so swapping exact for approximate barely moves the needle. Hashing the integer pair directly (e.g. `"PULocationID" * 1000 + "DOLocationID"`) sidesteps it.

**Impact:** Low — this is a "shape your key as a number, not a string" lesson more than a bug. Filing because `APPROX_COUNT_DISTINCT` being ~equal to exact is a surprising result that suggests the string concat, not the cardinality estimation, dominates; worth a docs note or a look at whether string construction can be lazier under approximate aggregation.

---

## Notes / context (not issues — positive findings)

- hyperdb beat DuckDB's pre-loaded native table on Q2/Q3/Q4/Q5/Q6/Q7/Q9/Q10 (global agg, low- and high-card GROUP BY, FILTER aggregates, PERCENTILE_CONT, window rank, EXTRACT grouping, GROUPING SETS) by 1.25–3.9×. GROUPING SETS (3.9×) and FILTER aggregates (2.6×) were the standouts.
- The only losses were Q1 (`COUNT(*)`, ~28 ms vs ~8 ms — trivial fixed overhead, not throughput) and Q8 exact distinct (the string-key issue above).
- `PERCENTILE_CONT ... WITHIN GROUP`, `FILTER (WHERE ...)`, `GROUPING SETS`, `APPROX_COUNT_DISTINCT`, and all window functions worked as documented.
- Build path: 34 `load_files` at concurrency 8 into distinct per-file tables, then one `CREATE TABLE bench AS ... UNION ALL ...` (= 61 s), was stable. (The unsafe pattern is concurrent *appends to a single table* — avoided here.)
- Mixed-case TLC column names (`VendorID`, `PULocationID`, `DOLocationID`, `RatecodeID`) require double-quoting (`42703` otherwise) — expected for a PostgreSQL-family dialect, noted for completeness.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hyperdb-mcp: QUALIFY unsupported (42601); APPROX_COUNT_DISTINCT no speedup on string keys — surfaced by 299M-row benchmark #164

Summary

1. (Dialect gap) `QUALIFY` is not supported — `42601 syntax error`

2. (Perf) `APPROX_COUNT_DISTINCT` gives almost no speedup when the argument is a constructed string

Notes / context (not issues — positive findings)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

hyperdb-mcp: QUALIFY unsupported (42601); APPROX_COUNT_DISTINCT no speedup on string keys — surfaced by 299M-row benchmark #164

Description

Summary

1. (Dialect gap) QUALIFY is not supported — 42601 syntax error

2. (Perf) APPROX_COUNT_DISTINCT gives almost no speedup when the argument is a constructed string

Notes / context (not issues — positive findings)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. (Dialect gap) `QUALIFY` is not supported — `42601 syntax error`

2. (Perf) `APPROX_COUNT_DISTINCT` gives almost no speedup when the argument is a constructed string