Skip to content

hyperdb-mcp: QUALIFY unsupported (42601); APPROX_COUNT_DISTINCT no speedup on string keys — surfaced by 299M-row benchmark #164

Description

@StefanSteiner

Summary

Follow-up from the same NYC TLC Yellow Taxi exercise as StefanSteiner#1, this time a head-to-head benchmark of hyperdb-mcp vs DuckDB over the full normalized dataset (299,214,146 rows, 34 Parquet files, Jan + Jul × 2009–2025). hyperdb performed very well — it won 8 of 10 compute queries against DuckDB's own pre-loaded native table, by 1.25–3.9× — but the suite surfaced one dialect blocker and one perf papercut worth filing.

Environment: hyperdb-mcp (this repo), driven from Claude Code. Dialect is Salesforce Data Cloud SQL / Tableau Hyper. Data was warm in the daemon (one bench table). Timings are engine-side stats.elapsed_ms.


1. (Dialect gap) QUALIFY is not supported — 42601 syntax error

A standard "top-N per group" query using QUALIFY on a window function fails:

WITH z AS (
  SELECT data_year, "PULocationID", COUNT(*) trips
  FROM bench WHERE "PULocationID" IS NOT NULL
  GROUP BY data_year, "PULocationID"
)
SELECT data_year, "PULocationID", trips,
       ROW_NUMBER() OVER (PARTITION BY data_year ORDER BY trips DESC) rnk
FROM z
QUALIFY rnk <= 5;
-- server error (42601): ERROR: syntax error: got identifier, expected end-of-file

The documented subquery fallback works fine:

SELECT * FROM (
  WITH z AS (...) SELECT ..., ROW_NUMBER() OVER (...) rnk FROM z
) s
WHERE rnk <= 5;

Impact: Minor functionally (the fallback is mechanical), but QUALIFY is a very common idiom and is supported by DuckDB, Snowflake, BigQuery, and Databricks. The MCP server instructions / dialect card advertise extensive window-function support (row_number, rank, modified_rank, IGNORE NULLS, frame modes, etc.) but do not flag QUALIFY as absent, so a user reasonably expects it to work.

Suggestion: Support QUALIFY, or explicitly note its absence (and the subquery-wrap workaround) in the dialect card alongside the window-function list.


2. (Perf) APPROX_COUNT_DISTINCT gives almost no speedup when the argument is a constructed string

Exact vs approximate distinct over a 299M-row concatenated key were essentially the same:

SELECT COUNT(DISTINCT "PULocationID" || '-' || "DOLocationID") FROM bench;
-- ~2.31 s (median of 3)
SELECT APPROX_COUNT_DISTINCT("PULocationID" || '-' || "DOLocationID") FROM bench;
-- ~2.10 s (median of 3) — only ~9% faster

For comparison, on DuckDB the approximate path is dramatically cheaper than exact. Here the cost is dominated by materializing the BIGINT || '-' || BIGINT string for every row before the distinct/sketch step, so swapping exact for approximate barely moves the needle. Hashing the integer pair directly (e.g. "PULocationID" * 1000 + "DOLocationID") sidesteps it.

Impact: Low — this is a "shape your key as a number, not a string" lesson more than a bug. Filing because APPROX_COUNT_DISTINCT being ~equal to exact is a surprising result that suggests the string concat, not the cardinality estimation, dominates; worth a docs note or a look at whether string construction can be lazier under approximate aggregation.


Notes / context (not issues — positive findings)

  • hyperdb beat DuckDB's pre-loaded native table on Q2/Q3/Q4/Q5/Q6/Q7/Q9/Q10 (global agg, low- and high-card GROUP BY, FILTER aggregates, PERCENTILE_CONT, window rank, EXTRACT grouping, GROUPING SETS) by 1.25–3.9×. GROUPING SETS (3.9×) and FILTER aggregates (2.6×) were the standouts.
  • The only losses were Q1 (COUNT(*), ~28 ms vs ~8 ms — trivial fixed overhead, not throughput) and Q8 exact distinct (the string-key issue above).
  • PERCENTILE_CONT ... WITHIN GROUP, FILTER (WHERE ...), GROUPING SETS, APPROX_COUNT_DISTINCT, and all window functions worked as documented.
  • Build path: 34 load_files at concurrency 8 into distinct per-file tables, then one CREATE TABLE bench AS ... UNION ALL ... (= 61 s), was stable. (The unsafe pattern is concurrent appends to a single table — avoided here.)
  • Mixed-case TLC column names (VendorID, PULocationID, DOLocationID, RatecodeID) require double-quoting (42703 otherwise) — expected for a PostgreSQL-family dialect, noted for completeness.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions