feat(puffin): add puffin file reader and writer#624
feat(puffin): add puffin file reader and writer#624zhaoxuan1994 wants to merge 1 commit intoapache:mainfrom
Conversation
zhaoxuan1994
commented
Apr 23, 2026
- PuffinWriter: in-memory writer that builds complete Puffin files
- Add() writes blobs with optional compression
- Finish() serializes footer with JSON metadata
- Tracks BlobMetadata for all written blobs
- PuffinReader: in-memory reader that parses Puffin files
- ReadFileMetadata() parses footer and validates magic bytes
- ReadBlob() reads and decompresses individual blobs
- ReadAll() reads all blobs from metadata
- Expose Compress/Decompress as public API in puffin_format.h
- Register new sources in CMake and Meson build systems
- Add comprehensive tests
There was a problem hiding this comment.
Pull request overview
Adds an in-memory Puffin file writer/reader to the Iceberg C++ Puffin module, exposes compression helpers as public API, wires new sources into the build, and introduces round-trip + Java-compatibility tests.
Changes:
- Introduce
PuffinWriterfor building Puffin files in-memory (blob writing + footer serialization). - Introduce
PuffinReaderfor parsing Puffin files in-memory (footer parsing + blob reading). - Export
Compress/DecompressAPIs and register new sources/tests in CMake + Meson.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/iceberg/puffin/puffin_writer.h | Declares the in-memory Puffin writer API. |
| src/iceberg/puffin/puffin_writer.cc | Implements Puffin file serialization (header, blobs, footer). |
| src/iceberg/puffin/puffin_reader.h | Declares the in-memory Puffin reader API. |
| src/iceberg/puffin/puffin_reader.cc | Implements footer parsing, magic validation, and blob reads. |
| src/iceberg/puffin/puffin_format.h | Exposes Compress/Decompress as public APIs. |
| src/iceberg/puffin/puffin_format.cc | Moves flag helpers and keeps compression stubs in the public namespace. |
| src/iceberg/puffin/meson.build | Installs new public headers for Meson builds. |
| src/iceberg/meson.build | Adds new Puffin reader/writer sources to the core library build. |
| src/iceberg/CMakeLists.txt | Adds new Puffin reader/writer sources to the CMake build. |
| src/iceberg/test/puffin_reader_writer_test.cc | Adds comprehensive writer/reader round-trip and Java-compat tests. |
| src/iceberg/test/meson.build | Adds the new Puffin reader/writer test to Meson test target. |
| src/iceberg/test/CMakeLists.txt | Adds the new Puffin reader/writer test to CMake test target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Parses a Puffin file from an in-memory buffer. Usage: | ||
| /// PuffinReader reader(file_data); | ||
| /// auto metadata = reader.ReadFileMetadata(); | ||
| /// auto blob = reader.ReadBlob(metadata->blobs[0]); |
| auto payload_size = ReadLittleEndian<int32_t>( | ||
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructPayloadSizeOffset); | ||
|
|
||
| // Calculate total footer size and validate | ||
| int64_t footer_size = PuffinFormat::kFooterStartMagicLength + | ||
| static_cast<int64_t>(payload_size) + | ||
| PuffinFormat::kFooterStructLength; | ||
| auto footer_offset = file_size - footer_size; | ||
| if (footer_offset < 0) { | ||
| return Invalid("Invalid file: footer size {} exceeds file size {}", footer_size, | ||
| file_size); | ||
| } | ||
|
|
||
| // Validate footer start magic | ||
| ICEBERG_RETURN_UNEXPECTED(CheckMagic(data_, footer_offset)); | ||
|
|
||
| // Check flags for footer compression | ||
| std::array<uint8_t, 4> flags{}; | ||
| std::memcpy( | ||
| flags.data(), | ||
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, 4); | ||
|
|
||
| PuffinCompressionCodec footer_compression = PuffinCompressionCodec::kNone; | ||
| if (IsFlagSet(flags, PuffinFlag::kFooterPayloadCompressed)) { | ||
| footer_compression = PuffinFormat::kDefaultFooterCompressionCodec; | ||
| } | ||
|
|
||
| // Extract footer payload | ||
| auto payload_offset = footer_offset + PuffinFormat::kFooterStartMagicLength; | ||
| std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto payload_bytes, | ||
| Decompress(footer_compression, payload_span)); |
There was a problem hiding this comment.
payload_size is already validated above:
if (payload_size < 0) {
return Invalid("Invalid file: negative payload size {}", payload_size);
}
| if (blob_metadata.offset < 0 || | ||
| blob_metadata.offset + blob_metadata.length > file_size) { | ||
| return Invalid("Invalid blob: offset {} + length {} exceeds file size {}", | ||
| blob_metadata.offset, blob_metadata.length, file_size); | ||
| } | ||
|
|
||
| std::span<const std::byte> raw_data(data_.data() + blob_metadata.offset, | ||
| blob_metadata.length); |
- PuffinWriter: in-memory writer that builds complete Puffin files - Add() writes blobs with optional compression - Finish() serializes footer with JSON metadata - Tracks BlobMetadata for all written blobs - PuffinReader: in-memory reader that parses Puffin files - ReadFileMetadata() parses footer and validates magic bytes - ReadBlob() reads and decompresses individual blobs - ReadAll() reads all blobs from metadata - Expose Compress/Decompress as public API in puffin_format.h - Register new sources in CMake and Meson build systems - Add comprehensive tests including Java binary compatibility
1dd5373 to
f8c15ee
Compare
| std::unreachable(); | ||
| } | ||
|
|
||
| } // namespace |
There was a problem hiding this comment.
is there any reason to move both Result<std::vector<std::byte>> Compress and Result<std::vector<std::byte>> Decompress out of the namespace?
kamcheungting-db
left a comment
There was a problem hiding this comment.
Some minor comments.
Nice code
| } | ||
| auto* begin = reinterpret_cast<const uint8_t*>(data.data() + offset); | ||
| if (!std::equal(PuffinFormat::kMagicV1.begin(), PuffinFormat::kMagicV1.end(), begin)) { | ||
| return Invalid("Invalid file: expected magic at offset {}", offset); |
There was a problem hiding this comment.
please also save the value of incorrect magic number into Invalid for easy debugging.
| auto payload_size = ReadLittleEndian<int32_t>( | ||
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructPayloadSizeOffset); | ||
|
|
||
| if (payload_size < 0) { |
There was a problem hiding this comment.
should payload_size == 0 also be considered as an error.
| std::memcpy( | ||
| flags.data(), | ||
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, 4); |
There was a problem hiding this comment.
| std::memcpy( | |
| flags.data(), | |
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, 4); | |
| std::memcpy( | |
| flags.data(), | |
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, | |
| 4); |
| auto payload_size = ReadLittleEndian<int32_t>( | ||
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructPayloadSizeOffset); | ||
|
|
||
| // Calculate total footer size and validate | ||
| int64_t footer_size = PuffinFormat::kFooterStartMagicLength + | ||
| static_cast<int64_t>(payload_size) + | ||
| PuffinFormat::kFooterStructLength; | ||
| auto footer_offset = file_size - footer_size; | ||
| if (footer_offset < 0) { | ||
| return Invalid("Invalid file: footer size {} exceeds file size {}", footer_size, | ||
| file_size); | ||
| } | ||
|
|
||
| // Validate footer start magic | ||
| ICEBERG_RETURN_UNEXPECTED(CheckMagic(data_, footer_offset)); | ||
|
|
||
| // Check flags for footer compression | ||
| std::array<uint8_t, 4> flags{}; | ||
| std::memcpy( | ||
| flags.data(), | ||
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, 4); | ||
|
|
||
| PuffinCompressionCodec footer_compression = PuffinCompressionCodec::kNone; | ||
| if (IsFlagSet(flags, PuffinFlag::kFooterPayloadCompressed)) { | ||
| footer_compression = PuffinFormat::kDefaultFooterCompressionCodec; | ||
| } | ||
|
|
||
| // Extract footer payload | ||
| auto payload_offset = footer_offset + PuffinFormat::kFooterStartMagicLength; | ||
| std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto payload_bytes, | ||
| Decompress(footer_compression, payload_span)); |
There was a problem hiding this comment.
payload_size is already validated above:
if (payload_size < 0) {
return Invalid("Invalid file: negative payload size {}", payload_size);
}
|
|
||
| // Extract footer payload | ||
| auto payload_offset = footer_offset + PuffinFormat::kFooterStartMagicLength; | ||
| std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size); |
There was a problem hiding this comment.
| std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size); | |
| const std::span<const std::byte> payload_span(data_.data() + payload_offset, payload_size); |
| PuffinCompressionCodec default_codec_; | ||
| std::vector<std::byte> buffer_; | ||
| std::vector<BlobMetadata> written_blobs_metadata_; | ||
| bool header_written_ = false; | ||
| bool finished_ = false; | ||
| std::optional<int64_t> footer_size_; |
There was a problem hiding this comment.
please add one line comment for these varaibles
| : default_codec_(default_codec) {} | ||
|
|
||
| void PuffinWriter::WriteHeader() { | ||
| if (header_written_) return; |
There was a problem hiding this comment.
shouldn't we error out here?
It looks like dangerous to call WriteHeader() multiple time.
There was a problem hiding this comment.
| if (header_written_) return; | |
| if (header_written_) { | |
| return Invalid("Header has already been written."); | |
| } |
| }; | ||
|
|
||
| auto footer_json = ToJsonString(file_metadata); | ||
| auto footer_payload = std::span<const std::byte>( |
There was a problem hiding this comment.
| auto footer_payload = std::span<const std::byte>( | |
| const auto footer_payload = std::span<const std::byte>( |
| buffer_.insert(buffer_.end(), reinterpret_cast<const std::byte*>(magic.data()), | ||
| reinterpret_cast<const std::byte*>(magic.data() + magic.size())); |
There was a problem hiding this comment.
| buffer_.insert(buffer_.end(), reinterpret_cast<const std::byte*>(magic.data()), | |
| reinterpret_cast<const std::byte*>(magic.data() + magic.size())); | |
| buffer_.insert(buffer_.end(), | |
| reinterpret_cast<const std::byte*>(magic.data()), | |
| reinterpret_cast<const std::byte*>(magic.data() + magic.size())); |
| /// \param properties File-level properties to include in the footer. | ||
| /// \return The complete Puffin file as a byte vector, or an error. | ||
| Result<std::vector<std::byte>> Finish( | ||
| std::unordered_map<std::string, std::string> properties = {}); |
There was a problem hiding this comment.
do we really need to copy the whole std::unordered_map<std::string, std::string>
There was a problem hiding this comment.
Why do not set the properties while creating the writer?
There was a problem hiding this comment.
BTW, it seems that we need to extend our FileIO abstraction to support puffin I/O. It looks inefficient to write all blobs into a buffer like this.
wgtmac
left a comment
There was a problem hiding this comment.
Sorry for the long delay. I have some concerns about its API design which would be inefficient in the real world use case. I believe the main issue arises from the lack of streaming support of FileIO. I will put up a PR to address this.
| /// | ||
| /// Parses a Puffin file from an in-memory buffer. Usage: | ||
| /// PuffinReader reader(file_data); | ||
| /// auto metadata = reader.ReadFileMetadata(); |
There was a problem hiding this comment.
This might not be the recommended approach unless error handling has been added. Otherwise users may silently ignore the error and call metadata.value() unconsciously.
| class ICEBERG_EXPORT PuffinReader { | ||
| public: | ||
| /// \brief Construct a reader from file data. | ||
| explicit PuffinReader(std::span<const std::byte> data); |
There was a problem hiding this comment.
The main issue of this ctor is that we lose the capability of seeking into a segment of a puffin file, which is the main use case of v3 deletion vector.
| data_.data() + footer_struct_offset + PuffinFormat::kFooterStructFlagsOffset, 4); | ||
|
|
||
| PuffinCompressionCodec footer_compression = PuffinCompressionCodec::kNone; | ||
| if (IsFlagSet(flags, PuffinFlag::kFooterPayloadCompressed)) { |
There was a problem hiding this comment.
Please validate that all unknown/reserved footer flag bits are unset. Java's PuffinReader rejects unknown flags, and the spec says reserved bits should be 0; silently ignoring them may cause C++ to accept future or invalid files and interpret the footer incorrectly.
|
|
||
| /// \brief Writer for Puffin files. | ||
| /// | ||
| /// Builds a complete Puffin file in memory. Usage: |
There was a problem hiding this comment.
Same as my comment in the reader, I think we should remove comment like this.
|
|
||
| /// \brief Add a blob to be written. | ||
| /// \return The BlobMetadata for the written blob, or an error. | ||
| Result<BlobMetadata> Add(const Blob& blob); |
There was a problem hiding this comment.
Do we really need to return a copy of BlobMetadata here? IMO, users usually do not care about it and it pays to do so.
| public: | ||
| /// \brief Construct a writer with the given default compression codec. | ||
| explicit PuffinWriter( | ||
| PuffinCompressionCodec default_codec = PuffinCompressionCodec::kNone); |
There was a problem hiding this comment.
Do we need to introduce a PuffinWriterOptions to hold all possible (but optional) configs in case of expanding the ctor signature in the future?
| /// \param properties File-level properties to include in the footer. | ||
| /// \return The complete Puffin file as a byte vector, or an error. | ||
| Result<std::vector<std::byte>> Finish( | ||
| std::unordered_map<std::string, std::string> properties = {}); |
There was a problem hiding this comment.
Why do not set the properties while creating the writer?
| /// \param properties File-level properties to include in the footer. | ||
| /// \return The complete Puffin file as a byte vector, or an error. | ||
| Result<std::vector<std::byte>> Finish( | ||
| std::unordered_map<std::string, std::string> properties = {}); |
There was a problem hiding this comment.
BTW, it seems that we need to extend our FileIO abstraction to support puffin I/O. It looks inefficient to write all blobs into a buffer like this.