hyperdb-mcp: Parquet read failures (dictionary encoding, NullType) + numeric serialization/format gaps

## Summary

While using the `hyperdb-mcp` server to normalize and analyze a large public dataset (NYC TLC Yellow Taxi, ~143M rows across 17 yearly Parquet files), I hit four distinct issues. Two are functional blockers for Parquet interop; two are usability/correctness papercuts in the query path. Filing as one issue since they were all surfaced by the same end-to-end exercise; happy to split into separate issues if preferred.

Environment: `hyperdb-mcp` (this repo), driven from Claude Code. Dialect is Salesforce Data Cloud SQL / Tableau Hyper. Comparison reads were against locally-written Parquet files.

---

### 1. (Blocker) Parquet read fails on dictionary-encoded files — `XX001 "Error while reading parquet file"`

`load_file` / `query_file` / `external()` fail with `XX001 Error while reading parquet file` on Parquet written by:
- Apache Arrow **≥ 8.0.0** (the writer version embedded in several NYC TLC files, `created_by` arrow 8.0.0 / 14 / 16), and
- **all DuckDB-written Parquet** (tested DuckDB 1.5.4).

Bisected to **dictionary encoding**: PLAIN-encoded data (including hand-generated test files) reads fine; dictionary-encoded data fails. Reproduced independent of compression codec (snappy / zstd / gzip / uncompressed) and Parquet version (v1 / v2). Re-exporting a previously-readable file through DuckDB makes it unreadable.

**Impact:** This breaks the most natural interop path — "another tool writes Parquet, hyperdb reads it." It forced a CSV bridge (DuckDB → CSV → `load_file`) for the whole pipeline. Hyper's *own* exported Parquet round-trips fine, so the writer is OK; the reader is the issue.

**Repro sketch:**
```
duckdb -c "COPY (SELECT 1 AS a, 'x' AS b) TO '/tmp/dict.parquet' (FORMAT parquet);"
# then via MCP:
load_file(table="t", path="/tmp/dict.parquet")  -> XX001
```

---

### 2. (Blocker) Physical `NullType` columns are rejected — `42804`, and a `schema` override does not rescue it

Several NYC TLC files store always-null columns (`congestion_surcharge`, `airport_fee`) as physical Parquet `NullType`. `load_file` fails with `42804`. Supplying a `schema` override mapping those columns to `DOUBLE PRECISION` does **not** help — the read fails *before* the override is applied.

**Impact:** A common real-world Parquet shape (optional columns that happen to be entirely null in a partition) is unreadable, and the documented escape hatch (`schema` override) doesn't cover it. Worked around by re-typing the columns in DuckDB before the CSV bridge.

**Suggestion:** Apply `schema` overrides early enough to coerce `NullType` physical columns to the target type, or special-case `NullType` → emit all-NULL of the inferred/overridden type.

---

### 3. (Correctness) NUMERIC scale lost in JSON result serialization

`query` results round `numeric` values to whole numbers in the JSON payload. `SELECT ROUND(AVG(fare_amount), 2)` returned `10.0` where the true value was `9.50`; `ROUND(x,2)` of `14.46` came back `14.0`.

**Workaround that works:** `CAST(expr AS NUMERIC(8,2))` serializes correctly (`9.50`, `14.46`). So the issue is specifically in how un-cast / default-scale `numeric` (e.g. the result of `ROUND(double, n)`) is encoded to JSON — trailing scale is dropped or the value is rounded to integer.

**Impact:** Silent precision loss in returned data. Easy to miss because the SQL looks correct.

---

### 4. (Usability) `to_char()` unsupported for numeric/timestamp — `42601`

`to_char(AVG(x), 'FM990.00')` and `to_char(ts, ...)` fail with `42601 unsupported data types in call to 'to_char'`, despite the dialect being advertised as PostgreSQL-compatible. `to_char` is the standard PostgreSQL formatting function and a natural reach for fixed-decimal/locale formatting (and the obvious fix for #3 from a user's perspective).

**Impact:** Minor, but combined with #3 it leaves no ergonomic in-SQL way to format numbers; users must `CAST` to `NUMERIC(p,s)`.

---

## Notes

- Findings tableau/hyper-api-rust#163 and StefanSteiner/hyper-api-rust#2 are the high-impact ones — they determine whether hyperdb can participate in a Parquet-native toolchain without a CSV detour.
- #3 is a quiet correctness trap worth prioritizing despite being "just formatting."
- The full exercise (17 files, schema normalization across 3 TLC schema eras, cross-year aggregation) otherwise worked well — ingest throughput was ~2.2M rows/s for CSV and Parquet export was sub-second per file.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hyperdb-mcp: Parquet read failures (dictionary encoding, NullType) + numeric serialization/format gaps #163

Summary

1. (Blocker) Parquet read fails on dictionary-encoded files — `XX001 "Error while reading parquet file"`

2. (Blocker) Physical `NullType` columns are rejected — `42804`, and a `schema` override does not rescue it

3. (Correctness) NUMERIC scale lost in JSON result serialization

4. (Usability) `to_char()` unsupported for numeric/timestamp — `42601`

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

hyperdb-mcp: Parquet read failures (dictionary encoding, NullType) + numeric serialization/format gaps #163

Description

Summary

1. (Blocker) Parquet read fails on dictionary-encoded files — XX001 "Error while reading parquet file"

2. (Blocker) Physical NullType columns are rejected — 42804, and a schema override does not rescue it

3. (Correctness) NUMERIC scale lost in JSON result serialization

4. (Usability) to_char() unsupported for numeric/timestamp — 42601

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. (Blocker) Parquet read fails on dictionary-encoded files — `XX001 "Error while reading parquet file"`

2. (Blocker) Physical `NullType` columns are rejected — `42804`, and a `schema` override does not rescue it

4. (Usability) `to_char()` unsupported for numeric/timestamp — `42601`