Summary
While using the hyperdb-mcp server to normalize and analyze a large public dataset (NYC TLC Yellow Taxi, ~143M rows across 17 yearly Parquet files), I hit four distinct issues. Two are functional blockers for Parquet interop; two are usability/correctness papercuts in the query path. Filing as one issue since they were all surfaced by the same end-to-end exercise; happy to split into separate issues if preferred.
Environment: hyperdb-mcp (this repo), driven from Claude Code. Dialect is Salesforce Data Cloud SQL / Tableau Hyper. Comparison reads were against locally-written Parquet files.
1. (Blocker) Parquet read fails on dictionary-encoded files — XX001 "Error while reading parquet file"
load_file / query_file / external() fail with XX001 Error while reading parquet file on Parquet written by:
- Apache Arrow ≥ 8.0.0 (the writer version embedded in several NYC TLC files,
created_by arrow 8.0.0 / 14 / 16), and
- all DuckDB-written Parquet (tested DuckDB 1.5.4).
Bisected to dictionary encoding: PLAIN-encoded data (including hand-generated test files) reads fine; dictionary-encoded data fails. Reproduced independent of compression codec (snappy / zstd / gzip / uncompressed) and Parquet version (v1 / v2). Re-exporting a previously-readable file through DuckDB makes it unreadable.
Impact: This breaks the most natural interop path — "another tool writes Parquet, hyperdb reads it." It forced a CSV bridge (DuckDB → CSV → load_file) for the whole pipeline. Hyper's own exported Parquet round-trips fine, so the writer is OK; the reader is the issue.
Repro sketch:
duckdb -c "COPY (SELECT 1 AS a, 'x' AS b) TO '/tmp/dict.parquet' (FORMAT parquet);"
# then via MCP:
load_file(table="t", path="/tmp/dict.parquet") -> XX001
2. (Blocker) Physical NullType columns are rejected — 42804, and a schema override does not rescue it
Several NYC TLC files store always-null columns (congestion_surcharge, airport_fee) as physical Parquet NullType. load_file fails with 42804. Supplying a schema override mapping those columns to DOUBLE PRECISION does not help — the read fails before the override is applied.
Impact: A common real-world Parquet shape (optional columns that happen to be entirely null in a partition) is unreadable, and the documented escape hatch (schema override) doesn't cover it. Worked around by re-typing the columns in DuckDB before the CSV bridge.
Suggestion: Apply schema overrides early enough to coerce NullType physical columns to the target type, or special-case NullType → emit all-NULL of the inferred/overridden type.
3. (Correctness) NUMERIC scale lost in JSON result serialization
query results round numeric values to whole numbers in the JSON payload. SELECT ROUND(AVG(fare_amount), 2) returned 10.0 where the true value was 9.50; ROUND(x,2) of 14.46 came back 14.0.
Workaround that works: CAST(expr AS NUMERIC(8,2)) serializes correctly (9.50, 14.46). So the issue is specifically in how un-cast / default-scale numeric (e.g. the result of ROUND(double, n)) is encoded to JSON — trailing scale is dropped or the value is rounded to integer.
Impact: Silent precision loss in returned data. Easy to miss because the SQL looks correct.
4. (Usability) to_char() unsupported for numeric/timestamp — 42601
to_char(AVG(x), 'FM990.00') and to_char(ts, ...) fail with 42601 unsupported data types in call to 'to_char', despite the dialect being advertised as PostgreSQL-compatible. to_char is the standard PostgreSQL formatting function and a natural reach for fixed-decimal/locale formatting (and the obvious fix for #3 from a user's perspective).
Impact: Minor, but combined with #3 it leaves no ergonomic in-SQL way to format numbers; users must CAST to NUMERIC(p,s).
Notes
Summary
While using the
hyperdb-mcpserver to normalize and analyze a large public dataset (NYC TLC Yellow Taxi, ~143M rows across 17 yearly Parquet files), I hit four distinct issues. Two are functional blockers for Parquet interop; two are usability/correctness papercuts in the query path. Filing as one issue since they were all surfaced by the same end-to-end exercise; happy to split into separate issues if preferred.Environment:
hyperdb-mcp(this repo), driven from Claude Code. Dialect is Salesforce Data Cloud SQL / Tableau Hyper. Comparison reads were against locally-written Parquet files.1. (Blocker) Parquet read fails on dictionary-encoded files —
XX001 "Error while reading parquet file"load_file/query_file/external()fail withXX001 Error while reading parquet fileon Parquet written by:created_byarrow 8.0.0 / 14 / 16), andBisected to dictionary encoding: PLAIN-encoded data (including hand-generated test files) reads fine; dictionary-encoded data fails. Reproduced independent of compression codec (snappy / zstd / gzip / uncompressed) and Parquet version (v1 / v2). Re-exporting a previously-readable file through DuckDB makes it unreadable.
Impact: This breaks the most natural interop path — "another tool writes Parquet, hyperdb reads it." It forced a CSV bridge (DuckDB → CSV →
load_file) for the whole pipeline. Hyper's own exported Parquet round-trips fine, so the writer is OK; the reader is the issue.Repro sketch:
2. (Blocker) Physical
NullTypecolumns are rejected —42804, and aschemaoverride does not rescue itSeveral NYC TLC files store always-null columns (
congestion_surcharge,airport_fee) as physical ParquetNullType.load_filefails with42804. Supplying aschemaoverride mapping those columns toDOUBLE PRECISIONdoes not help — the read fails before the override is applied.Impact: A common real-world Parquet shape (optional columns that happen to be entirely null in a partition) is unreadable, and the documented escape hatch (
schemaoverride) doesn't cover it. Worked around by re-typing the columns in DuckDB before the CSV bridge.Suggestion: Apply
schemaoverrides early enough to coerceNullTypephysical columns to the target type, or special-caseNullType→ emit all-NULL of the inferred/overridden type.3. (Correctness) NUMERIC scale lost in JSON result serialization
queryresults roundnumericvalues to whole numbers in the JSON payload.SELECT ROUND(AVG(fare_amount), 2)returned10.0where the true value was9.50;ROUND(x,2)of14.46came back14.0.Workaround that works:
CAST(expr AS NUMERIC(8,2))serializes correctly (9.50,14.46). So the issue is specifically in how un-cast / default-scalenumeric(e.g. the result ofROUND(double, n)) is encoded to JSON — trailing scale is dropped or the value is rounded to integer.Impact: Silent precision loss in returned data. Easy to miss because the SQL looks correct.
4. (Usability)
to_char()unsupported for numeric/timestamp —42601to_char(AVG(x), 'FM990.00')andto_char(ts, ...)fail with42601 unsupported data types in call to 'to_char', despite the dialect being advertised as PostgreSQL-compatible.to_charis the standard PostgreSQL formatting function and a natural reach for fixed-decimal/locale formatting (and the obvious fix for #3 from a user's perspective).Impact: Minor, but combined with #3 it leaves no ergonomic in-SQL way to format numbers; users must
CASTtoNUMERIC(p,s).Notes