-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Summary
Parquet files can contain INT64/UINT64 values that exceed JavaScript's Number.MAX_SAFE_INTEGER (2^53 - 1 = 9007199254740991). These values survive Parquet decoding but get mangled before reaching AG Grid, resulting in potential precision loss or incorrect sorting/filtering.
AG Grid 35.1 now has native BigInt cell data type support, which we could leverage to handle this correctly end-to-end.
Current Pipeline Behavior
The data flows through these stages:
- Parquet file — INT64/UINT64 types are preserved in the schema
- Python read (
read_utils.py,data_loading.py) — Polars reads aspl.Int64/pl.UInt64, pandas asInt64/UInt64nullable types. No precision loss here. - Serialization (
serialization_utils.py) — DataFrame serialized to Parquet bytes via fastparquet/PyArrow. INT64 type preserved in the Parquet binary. - Frontend decode (
resolveDFData.ts) — hyparquet v1.8.2 decodes INT64 toBigInt64Arrayand UINT64 toBigUint64Array. Values are still correct at this point. - JSON conversion —
toJsonSafe()converts BigInt values to strings via.toString(). This preserves the digits but changes the type. - AG Grid display — The
IntegerDisplayerAdisplayer receives a string, not a number or BigInt.max_digitsdefaults to 12, which is insufficient for large int64 values (up to 19 digits).
Where precision breaks
- Values > 2^53 that get coerced to JS
Numberat any point silently lose precision (e.g.9007199254740993becomes9007199254740992) - The string conversion in
toJsonSafe()preserves digits but means AG Grid treats the column as text — sorting is lexicographic, not numeric - The
max_digits: 12cap in_dtype_to_displayer()(data_loading.py:244-254) may truncate display formatting for values with more than 12 digits
Relevant Code Paths
Python (type detection → serialization)
buckaroo/server/data_loading.py:244-254—_dtype_to_displayer()maps integer dtypes to"integer"displayer withmax_digits: 12buckaroo/serialization_utils.py:176-215—to_parquet()serializationbuckaroo/pluggable_analysis_framework/polars_utils.py:54-57—NUMERIC_POLARS_DTYPESincludespl.Int64,pl.UInt64
Frontend (decode → display)
packages/buckaroo-js-core/src/components/DFViewerParts/resolveDFData.ts:52-69—parseParquetRow()JSON-parses cells after Parquet decodepackages/buckaroo-js-core/src/components/DFViewerParts/DFWhole.ts:29-33—IntegerDisplayerAtype definitionpackages/buckaroo-js-core/src/components/DFViewerParts/gridUtils.ts:44-61—getCellRendererorFormatter()applies display formatting
Proposed Fix
1. Preserve BigInt through the decode pipeline
In resolveDFData.ts, avoid converting BigInt values to strings. Either pass them through as native BigInts or convert to Number only when safe (value <= Number.MAX_SAFE_INTEGER).
2. Use AG Grid 35.1 native BigInt support
For columns with INT64/UINT64 data containing values > 2^53, set cellDataType: 'bigint' on the ColDef. This gives us correct sorting, filtering, and editing for free.
3. Update the integer displayer
- Increase
max_digitsfor int64 columns (up to 19 digits for signed, 20 for unsigned) - Or add a dedicated
"bigint"displayer type that handles large values
4. Update Python type mapping
In _dtype_to_displayer(), detect Int64/UInt64 columns and emit a displayer config that signals the frontend to use BigInt handling.
Integration Test Plan
A test should cover the full round-trip:
# Python side
import polars as pl
df = pl.DataFrame({
"small_int": [1, 2, 3],
"big_int": [9007199254740993, 9007199254740994, 9007199254740995], # > 2^53
"max_uint64": [0, 2**63 - 1, 2**64 - 1], # unsigned edge cases
})
# Write to parquet, serialize through Buckaroo pipeline
# Assert values survive serialization// Frontend side
// Decode parquet bytes with hyparquet
// Assert BigInt values are not truncated
// Assert typeof value is 'bigint', not 'number'
// Assert AG Grid column receives correct cellDataTypeEdge cases to test
- Values exactly at
Number.MAX_SAFE_INTEGER(2^53 - 1) — should work as normal numbers - Values at
Number.MAX_SAFE_INTEGER + 1— must not silently round Int64min/max: -9223372036854775808 to 9223372036854775807UInt64max: 18446744073709551615- Null/missing values in BigInt columns
- Mixed columns where some values fit in Number and some don't
Priority
Low — this only affects datasets with integer values > 2^53, which is uncommon in typical data analysis. But it's a correctness issue that could cause silent data corruption for affected users (e.g. database primary keys, blockchain data, large ID systems).