[parquet] Use MICROS annotation for TIMESTAMP(n<=3) columns (Iceberg v2 compatibility) by q8webmaster · Pull Request #8230 · apache/paimon

q8webmaster · 2026-06-13T22:59:23Z

Problem

Any Paimon table with a TIMESTAMP or TIMESTAMP_WITH_LOCAL_TIME_ZONE column of precision ≤ 3 cannot be queried by Iceberg-aware engines (Athena, Trino, Spark Iceberg reader). The engine rejects every Parquet file for that column with an error such as:

Field ts's type INT64 in parquet file … is incompatible with type
timestamp(6) with time zone defined in table schema

The job writes data successfully and Paimon itself can read the files back, but any reader that enforces the Iceberg v2 Parquet spec rejects them.

Root cause

1. Wrong Parquet annotation

ParquetSchemaConverter.createTimestampWithLogicalType selects the annotation unit based on precision:

Precision	Annotation emitted	Iceberg v2 valid?
≤ 3	`TIMESTAMP(MILLIS)`	❌
4–6	`TIMESTAMP(MICROS)`	✅
> 6	`INT96`	—

The Iceberg v2 specification requires INT64 MICROS for both timestamp and timestamptz logical types. MILLIS is only permitted under Iceberg v3, which most engines do not yet support.

2. Wrong Parquet footer-stats decoding

ParquetSimpleStatsExtractor.toTimestampStats called Timestamp.fromEpochMillis for precision ≤ 3. After the annotation fix, footer stats for those columns are INT64 microseconds, so the correct call is Timestamp.fromMicros. Using fromEpochMillis inflates the decoded bound by 1000×.

3. Wrong lazy dictionary decoding

VectorizedColumnReader has a lazy dictionary fast path for INT64/LongColumnVector: the raw Parquet dictionary is attached to the vector without going through LongTimestampUpdater.longTimestamp(), which normalises on-disk microseconds to the milliseconds that ParquetTimestampVector.getTimestamp expects. The result is timestamps ~1000× too large for any dictionary-encoded page.

Fix

ParquetSchemaConverter: emit MICROS for precision <= 3 in createTimestampWithLogicalType.
ParquetRowDataWriter.TimestampMillsWriter: call value.toMicros() (milliseconds × 1000) so the stored value matches the MICROS annotation.
ParquetSimpleStatsExtractor: use Timestamp.fromMicros for precision ≤ 3 footer statistics.
VectorizedColumnReader: exclude precision ≤ 3 timestamp types from lazy dictionary decoding via isLowPrecisionTimestamp helper.

The reader path for existing files (MILLIS annotation → precision=3, MICROS annotation → precision=6) is intentionally left unchanged so that files written by older Paimon versions remain readable.

Backward compatibility

Existing tables whose files carry a MILLIS annotation are still readable. Tables that mix old (MILLIS) and new (MICROS) files should be rebuilt after upgrading, because Iceberg-aware engines enforcing strict annotation checking will still reject the old files.

After the fix, a TIMESTAMP(3) column read back through Paimon's own Parquet reader returns TIMESTAMP(6) (since MICROS maps to precision 6 in the reader). Stored values remain millisecond-granular; only the declared precision widens.

Prior art

PR #8222 fixed a related bug in PostgresRecordParser where a Debezium io.debezium.time.Timestamp (int64 millis) was mapped to BIGINT instead of TIMESTAMP(3). Both affect the same precision ≤ 3 path. The Iceberg manifest decoding fix is in the companion PR #8231.

Changes

ParquetSchemaConverter.java: emit MICROS for precision <= 3
ParquetRowDataWriter.java: TimestampMillsWriter.writeTimestamp calls value.toMicros()
ParquetSimpleStatsExtractor.java: use fromMicros for precision ≤ 3 footer stats
ParquetTimestampVector.java: update class javadoc to reflect MICROS storage
VectorizedColumnReader.java: exclude precision ≤ 3 timestamps from lazy dictionary decoding; add isLowPrecisionTimestamp helper
ParquetSchemaConverterTest.java: new test testLowPrecisionTimestampUseMicrosAnnotation; testPaimonParquetSchemaConvert updated for widened round-trip precision

…v2 compatibility) Paimon emits TIMESTAMP(MILLIS) for precision <= 3 columns. The Iceberg v2 spec requires INT64 MICROS for timestamp/timestamptz; MILLIS is only valid under Iceberg v3. This causes Iceberg-aware engines (Athena, Trino, Spark) to reject Parquet files with a schema compatibility error. - ParquetSchemaConverter.createTimestampWithLogicalType: emit MICROS for precision <= 3 instead of MILLIS. - ParquetRowDataWriter.TimestampMillsWriter.writeTimestamp: call value.toMicros() so the stored INT64 matches the MICROS annotation unit. The reader path (MILLIS -> precision=3, MICROS -> precision=6) is left unchanged so files written by older versions remain readable. Existing tables with precision<=3 columns should be rebuilt after upgrading. Tests: testLowPrecisionTimestampUseMicrosAnnotation verifies MICROS annotation for precision 0-3; testPaimonParquetSchemaConvert updated for the widened round-trip precision.

…of micros ParquetSimpleStatsExtractor.toTimestampStats called fromEpochMillis for precision <= 3, but footer statistics for those columns now contain INT64 microseconds (matching the MICROS annotation). Switch to fromMicros so that Parquet column bounds are decoded correctly.

VectorizedColumnReader has a lazy dictionary fast path for INT64/ LongColumnVector: the raw Parquet dictionary is stored on the vector directly, bypassing LongTimestampUpdater.longTimestamp() which normalises on-disk microseconds to the milliseconds that ParquetTimestampVector. getTimestamp expects. The result is timestamps ~1000x too far in the future for any dictionary-encoded page (triggered when rowGroupSize is large enough to activate dictionary encoding). Exclude precision <= 3 timestamp types from lazy decoding via a new isLowPrecisionTimestamp helper so the eager path (decodeDictionaryIds) is always taken, applying the correct /1000 normalisation.

…vs epoch_µs After the MICROS annotation change, ParquetRowDataWriter stores TIMESTAMP(n<=3) values as epoch microseconds. ParquetFilters.convertLiteral was still using getMillisecond() (epoch_ms) for those columns, so the Parquet row-group statistics comparison always failed against the new epoch_µs statistics — causing WHERE predicates on low-precision timestamp columns to filter out all row groups and return empty results. Fix: use toMicros() for all INT64 timestamp precisions (0-6) in ParquetFilters.convertLiteral, matching the storage unit written by the writer. Update ParquetFiltersTest assertions accordingly.

Q8Webmaster added 3 commits June 14, 2026 02:27

q8webmaster force-pushed the fix/parquet-timestamp-millis-to-micros branch from 6ebb533 to f9315e0 Compare June 14, 2026 00:27

q8webmaster marked this pull request as draft June 14, 2026 15:16

Q8Webmaster added 2 commits June 14, 2026 23:06

[parquet] remove over-long comment that broke Spotless line-length check

5bc2d62

q8webmaster marked this pull request as ready for review June 14, 2026 23:01

q8webmaster mentioned this pull request Jun 15, 2026

[parquet] Add regression test for TIMESTAMP(n<=3) reading MICROS-annotated INT64 files #8238

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[parquet] Use MICROS annotation for TIMESTAMP(n<=3) columns (Iceberg v2 compatibility)#8230

[parquet] Use MICROS annotation for TIMESTAMP(n<=3) columns (Iceberg v2 compatibility)#8230
q8webmaster wants to merge 5 commits into
apache:masterfrom
q8webmaster:fix/parquet-timestamp-millis-to-micros

q8webmaster commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

q8webmaster commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

1. Wrong Parquet annotation

2. Wrong Parquet footer-stats decoding

3. Wrong lazy dictionary decoding

Fix

Backward compatibility

Prior art

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

q8webmaster commented Jun 13, 2026 •

edited

Loading