Skip to content

feat: Support Mosaic file format in paimon-rust #378

@QuakeWang

Description

@QuakeWang

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Apache Paimon already supports the Mosaic file format as a wide-table optimized data format. Mosaic is a columnar-bucket hybrid format that groups columns into deterministic buckets and compresses each bucket independently, reducing read amplification for very wide tables when queries project only a small subset of columns.

Currently, paimon-rust dispatches data file readers by file extension and supports formats such as Parquet, ORC, Avro, Blob, and feature-gated Vortex. However, .mosaic files are not recognized. As a result, Rust clients cannot read Paimon tables whose file.format is set to mosaic, even if those tables were written by Java Paimon or PyPaimon.

This matters for compatibility with the broader Paimon ecosystem, especially for wide-table workloads where Mosaic is the recommended format.

Solution

Add feature-gated support for reading Mosaic data files in paimon-rust, with the implementation split into small reviewable steps.

Proposed phases:

  1. Reader foundation

    Add an optional mosaic feature, depend on paimon-mosaic-core behind that feature, and introduce a MosaicFormatReader wired into the existing file-format dispatch for .mosaic files.

    This phase should provide basic Arrow RecordBatch reading and projection support.

  2. Paimon read-path correctness

    Integrate the Mosaic reader with the existing table read semantics, including schema evolution, missing-column null filling, projection order, deletion vectors, and row-range selection.

    This phase should add table-level and mixed-format tests to ensure Mosaic files behave consistently with other supported formats.

  3. Documentation and compatibility polish

    Document the feature flag, supported scope, and current limitations. The initial scope should be read-only and focused on compatibility with Mosaic files written by Java Paimon, PyPaimon, or paimon-mosaic-core.

Out of scope for the first phase:

  • Writing .mosaic data files from paimon-rust.
  • Emitting Mosaic row-group statistics into DataFileMeta.value_stats.
  • Making Mosaic the default file format.
  • Implementing Mosaic bloom filter support.
  • Changing the Mosaic storage format.

Follow-up work can add writer support, stats integration, and performance benchmarks once read compatibility is stable.

Anything else?

Relevant context:

  • paimon-rust already has a format abstraction through FormatFileReader and FormatFileWriter.
  • paimon-mosaic-core and paimon-rust both currently use Arrow 58, so a pure Rust integration should avoid the JNI/native library loading issues that exist in the Java integration path.
  • The first implementation should be conservative and feature-gated because Mosaic is still evolving and has format-specific options such as bucket count, ZSTD level, and stats columns.
  • This issue should prioritize ecosystem read compatibility first. Performance optimizations and writer support can be tracked separately after the initial reader is merged.

Willingness to contribute

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions