Skip to content

BigInt values from Parquet lose precision in display pipeline #538

@paddymul

Description

@paddymul

Summary

Parquet files can contain INT64/UINT64 values that exceed JavaScript's Number.MAX_SAFE_INTEGER (2^53 - 1 = 9007199254740991). These values survive Parquet decoding but get mangled before reaching AG Grid, resulting in potential precision loss or incorrect sorting/filtering.

AG Grid 35.1 now has native BigInt cell data type support, which we could leverage to handle this correctly end-to-end.

Current Pipeline Behavior

The data flows through these stages:

  1. Parquet file — INT64/UINT64 types are preserved in the schema
  2. Python read (read_utils.py, data_loading.py) — Polars reads as pl.Int64/pl.UInt64, pandas as Int64/UInt64 nullable types. No precision loss here.
  3. Serialization (serialization_utils.py) — DataFrame serialized to Parquet bytes via fastparquet/PyArrow. INT64 type preserved in the Parquet binary.
  4. Frontend decode (resolveDFData.ts) — hyparquet v1.8.2 decodes INT64 to BigInt64Array and UINT64 to BigUint64Array. Values are still correct at this point.
  5. JSON conversiontoJsonSafe() converts BigInt values to strings via .toString(). This preserves the digits but changes the type.
  6. AG Grid display — The IntegerDisplayerA displayer receives a string, not a number or BigInt. max_digits defaults to 12, which is insufficient for large int64 values (up to 19 digits).

Where precision breaks

  • Values > 2^53 that get coerced to JS Number at any point silently lose precision (e.g. 9007199254740993 becomes 9007199254740992)
  • The string conversion in toJsonSafe() preserves digits but means AG Grid treats the column as text — sorting is lexicographic, not numeric
  • The max_digits: 12 cap in _dtype_to_displayer() (data_loading.py:244-254) may truncate display formatting for values with more than 12 digits

Relevant Code Paths

Python (type detection → serialization)

  • buckaroo/server/data_loading.py:244-254_dtype_to_displayer() maps integer dtypes to "integer" displayer with max_digits: 12
  • buckaroo/serialization_utils.py:176-215to_parquet() serialization
  • buckaroo/pluggable_analysis_framework/polars_utils.py:54-57NUMERIC_POLARS_DTYPES includes pl.Int64, pl.UInt64

Frontend (decode → display)

  • packages/buckaroo-js-core/src/components/DFViewerParts/resolveDFData.ts:52-69parseParquetRow() JSON-parses cells after Parquet decode
  • packages/buckaroo-js-core/src/components/DFViewerParts/DFWhole.ts:29-33IntegerDisplayerA type definition
  • packages/buckaroo-js-core/src/components/DFViewerParts/gridUtils.ts:44-61getCellRendererorFormatter() applies display formatting

Proposed Fix

1. Preserve BigInt through the decode pipeline

In resolveDFData.ts, avoid converting BigInt values to strings. Either pass them through as native BigInts or convert to Number only when safe (value <= Number.MAX_SAFE_INTEGER).

2. Use AG Grid 35.1 native BigInt support

For columns with INT64/UINT64 data containing values > 2^53, set cellDataType: 'bigint' on the ColDef. This gives us correct sorting, filtering, and editing for free.

3. Update the integer displayer

  • Increase max_digits for int64 columns (up to 19 digits for signed, 20 for unsigned)
  • Or add a dedicated "bigint" displayer type that handles large values

4. Update Python type mapping

In _dtype_to_displayer(), detect Int64/UInt64 columns and emit a displayer config that signals the frontend to use BigInt handling.

Integration Test Plan

A test should cover the full round-trip:

# Python side
import polars as pl

df = pl.DataFrame({
    "small_int": [1, 2, 3],
    "big_int": [9007199254740993, 9007199254740994, 9007199254740995],  # > 2^53
    "max_uint64": [0, 2**63 - 1, 2**64 - 1],  # unsigned edge cases
})
# Write to parquet, serialize through Buckaroo pipeline
# Assert values survive serialization
// Frontend side
// Decode parquet bytes with hyparquet
// Assert BigInt values are not truncated
// Assert typeof value is 'bigint', not 'number'
// Assert AG Grid column receives correct cellDataType

Edge cases to test

  • Values exactly at Number.MAX_SAFE_INTEGER (2^53 - 1) — should work as normal numbers
  • Values at Number.MAX_SAFE_INTEGER + 1 — must not silently round
  • Int64 min/max: -9223372036854775808 to 9223372036854775807
  • UInt64 max: 18446744073709551615
  • Null/missing values in BigInt columns
  • Mixed columns where some values fit in Number and some don't

Priority

Low — this only affects datasets with integer values > 2^53, which is uncommon in typical data analysis. But it's a correctness issue that could cause silent data corruption for affected users (e.g. database primary keys, blockchain data, large ID systems).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions