Skip to content

hyperdb-mcp: Parquet read failures (dictionary encoding, NullType) + numeric serialization/format gaps #163

Description

@StefanSteiner

Summary

While using the hyperdb-mcp server to normalize and analyze a large public dataset (NYC TLC Yellow Taxi, ~143M rows across 17 yearly Parquet files), I hit four distinct issues. Two are functional blockers for Parquet interop; two are usability/correctness papercuts in the query path. Filing as one issue since they were all surfaced by the same end-to-end exercise; happy to split into separate issues if preferred.

Environment: hyperdb-mcp (this repo), driven from Claude Code. Dialect is Salesforce Data Cloud SQL / Tableau Hyper. Comparison reads were against locally-written Parquet files.


1. (Blocker) Parquet read fails on dictionary-encoded files — XX001 "Error while reading parquet file"

load_file / query_file / external() fail with XX001 Error while reading parquet file on Parquet written by:

  • Apache Arrow ≥ 8.0.0 (the writer version embedded in several NYC TLC files, created_by arrow 8.0.0 / 14 / 16), and
  • all DuckDB-written Parquet (tested DuckDB 1.5.4).

Bisected to dictionary encoding: PLAIN-encoded data (including hand-generated test files) reads fine; dictionary-encoded data fails. Reproduced independent of compression codec (snappy / zstd / gzip / uncompressed) and Parquet version (v1 / v2). Re-exporting a previously-readable file through DuckDB makes it unreadable.

Impact: This breaks the most natural interop path — "another tool writes Parquet, hyperdb reads it." It forced a CSV bridge (DuckDB → CSV → load_file) for the whole pipeline. Hyper's own exported Parquet round-trips fine, so the writer is OK; the reader is the issue.

Repro sketch:

duckdb -c "COPY (SELECT 1 AS a, 'x' AS b) TO '/tmp/dict.parquet' (FORMAT parquet);"
# then via MCP:
load_file(table="t", path="/tmp/dict.parquet")  -> XX001

2. (Blocker) Physical NullType columns are rejected — 42804, and a schema override does not rescue it

Several NYC TLC files store always-null columns (congestion_surcharge, airport_fee) as physical Parquet NullType. load_file fails with 42804. Supplying a schema override mapping those columns to DOUBLE PRECISION does not help — the read fails before the override is applied.

Impact: A common real-world Parquet shape (optional columns that happen to be entirely null in a partition) is unreadable, and the documented escape hatch (schema override) doesn't cover it. Worked around by re-typing the columns in DuckDB before the CSV bridge.

Suggestion: Apply schema overrides early enough to coerce NullType physical columns to the target type, or special-case NullType → emit all-NULL of the inferred/overridden type.


3. (Correctness) NUMERIC scale lost in JSON result serialization

query results round numeric values to whole numbers in the JSON payload. SELECT ROUND(AVG(fare_amount), 2) returned 10.0 where the true value was 9.50; ROUND(x,2) of 14.46 came back 14.0.

Workaround that works: CAST(expr AS NUMERIC(8,2)) serializes correctly (9.50, 14.46). So the issue is specifically in how un-cast / default-scale numeric (e.g. the result of ROUND(double, n)) is encoded to JSON — trailing scale is dropped or the value is rounded to integer.

Impact: Silent precision loss in returned data. Easy to miss because the SQL looks correct.


4. (Usability) to_char() unsupported for numeric/timestamp — 42601

to_char(AVG(x), 'FM990.00') and to_char(ts, ...) fail with 42601 unsupported data types in call to 'to_char', despite the dialect being advertised as PostgreSQL-compatible. to_char is the standard PostgreSQL formatting function and a natural reach for fixed-decimal/locale formatting (and the obvious fix for #3 from a user's perspective).

Impact: Minor, but combined with #3 it leaves no ergonomic in-SQL way to format numbers; users must CAST to NUMERIC(p,s).


Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions