Search before asking
Motivation
Apache Paimon already supports the Mosaic file format as a wide-table optimized data format. Mosaic is a columnar-bucket hybrid format that groups columns into deterministic buckets and compresses each bucket independently, reducing read amplification for very wide tables when queries project only a small subset of columns.
Currently, paimon-rust dispatches data file readers by file extension and supports formats such as Parquet, ORC, Avro, Blob, and feature-gated Vortex. However, .mosaic files are not recognized. As a result, Rust clients cannot read Paimon tables whose file.format is set to mosaic, even if those tables were written by Java Paimon or PyPaimon.
This matters for compatibility with the broader Paimon ecosystem, especially for wide-table workloads where Mosaic is the recommended format.
Solution
Add feature-gated support for reading Mosaic data files in paimon-rust, with the implementation split into small reviewable steps.
Proposed phases:
-
Reader foundation
Add an optional mosaic feature, depend on paimon-mosaic-core behind that feature, and introduce a MosaicFormatReader wired into the existing file-format dispatch for .mosaic files.
This phase should provide basic Arrow RecordBatch reading and projection support.
-
Paimon read-path correctness
Integrate the Mosaic reader with the existing table read semantics, including schema evolution, missing-column null filling, projection order, deletion vectors, and row-range selection.
This phase should add table-level and mixed-format tests to ensure Mosaic files behave consistently with other supported formats.
-
Documentation and compatibility polish
Document the feature flag, supported scope, and current limitations. The initial scope should be read-only and focused on compatibility with Mosaic files written by Java Paimon, PyPaimon, or paimon-mosaic-core.
Out of scope for the first phase:
- Writing
.mosaic data files from paimon-rust.
- Emitting Mosaic row-group statistics into
DataFileMeta.value_stats.
- Making Mosaic the default file format.
- Implementing Mosaic bloom filter support.
- Changing the Mosaic storage format.
Follow-up work can add writer support, stats integration, and performance benchmarks once read compatibility is stable.
Anything else?
Relevant context:
paimon-rust already has a format abstraction through FormatFileReader and FormatFileWriter.
paimon-mosaic-core and paimon-rust both currently use Arrow 58, so a pure Rust integration should avoid the JNI/native library loading issues that exist in the Java integration path.
- The first implementation should be conservative and feature-gated because Mosaic is still evolving and has format-specific options such as bucket count, ZSTD level, and stats columns.
- This issue should prioritize ecosystem read compatibility first. Performance optimizations and writer support can be tracked separately after the initial reader is merged.
Willingness to contribute
Search before asking
Motivation
Apache Paimon already supports the Mosaic file format as a wide-table optimized data format. Mosaic is a columnar-bucket hybrid format that groups columns into deterministic buckets and compresses each bucket independently, reducing read amplification for very wide tables when queries project only a small subset of columns.
Currently,
paimon-rustdispatches data file readers by file extension and supports formats such as Parquet, ORC, Avro, Blob, and feature-gated Vortex. However,.mosaicfiles are not recognized. As a result, Rust clients cannot read Paimon tables whosefile.formatis set tomosaic, even if those tables were written by Java Paimon or PyPaimon.This matters for compatibility with the broader Paimon ecosystem, especially for wide-table workloads where Mosaic is the recommended format.
Solution
Add feature-gated support for reading Mosaic data files in
paimon-rust, with the implementation split into small reviewable steps.Proposed phases:
Reader foundation
Add an optional
mosaicfeature, depend onpaimon-mosaic-corebehind that feature, and introduce aMosaicFormatReaderwired into the existing file-format dispatch for.mosaicfiles.This phase should provide basic Arrow
RecordBatchreading and projection support.Paimon read-path correctness
Integrate the Mosaic reader with the existing table read semantics, including schema evolution, missing-column null filling, projection order, deletion vectors, and row-range selection.
This phase should add table-level and mixed-format tests to ensure Mosaic files behave consistently with other supported formats.
Documentation and compatibility polish
Document the feature flag, supported scope, and current limitations. The initial scope should be read-only and focused on compatibility with Mosaic files written by Java Paimon, PyPaimon, or
paimon-mosaic-core.Out of scope for the first phase:
.mosaicdata files frompaimon-rust.DataFileMeta.value_stats.Follow-up work can add writer support, stats integration, and performance benchmarks once read compatibility is stable.
Anything else?
Relevant context:
paimon-rustalready has a format abstraction throughFormatFileReaderandFormatFileWriter.paimon-mosaic-coreandpaimon-rustboth currently use Arrow 58, so a pure Rust integration should avoid the JNI/native library loading issues that exist in the Java integration path.Willingness to contribute