feat(datafusion-ballista) Add Initial Integration for Ballisa Datafusion#2613
Open
NoahKusaba wants to merge 7 commits into
Open
feat(datafusion-ballista) Add Initial Integration for Ballisa Datafusion#2613NoahKusaba wants to merge 7 commits into
NoahKusaba wants to merge 7 commits into
Conversation
Contributor
Author
|
Still need to put more time into this, but it's in a state where it's ready for feedback. |
blackmwk
requested changes
Jun 15, 2026
blackmwk
left a comment
Contributor
There was a problem hiding this comment.
Thanks @NoahKusaba for this pr, but I don't think it's the right direction to maintain such a huge integration in this repo. Currently the review resources is quite limited, and most committers are not familiar with ballista. I think it would be better to put it in ballista repo
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
What changes are included in this PR?
Adds a new iceberg-ballista crate, which provides a distributed-query driver for Apache Iceberg for a distributed datafusion engine Apache Datafusion-Ballista + the targeted changes to iceberg-datafusion that make Iceberg's existing plan nodes serializable so they can cross node boundaries.
The core problem it solves
Iceberg's DataFusion integration already produces complete physical read and write plans, but every Iceberg plan node holds live, non-serializable state (Arc, an open Table/FileIO). Ballista ships logical and physical plans to remote schedulers/executors, so those nodes couldn't travel. This branch closes that gap with one consistent idea: serialize a minimal self-contained recipe (IcebergCatalogConfig + identifiers), rebuild the live objects on the receiving node.
IcebergLogicalCodec: serializes the catalog-backed table provider (config + table ident, plus snapshot/metadata variants) so the scheduler can rebuild it and do physical planning, including INSERT.
IcebergPhysicalCodec: serializes the four Iceberg execution nodes (IcebergTableScan, IcebergWriteExec, IcebergCommitExec, IcebergMetadataScan) and the PartitionExpr physical expression.
Tagged-envelope wire framing (TAG_ICEBERG / TAG_DELEGATED):every blob carries a leading tag; non-Iceberg nodes are delegated to Ballista's own codec, so shuffles/sorts/etc. keep working and an unknown tag is a hard error--> Based off comments from https://github.com/milenkovicm/ballista_delta
bridge.rs runtime bridge: Each executor node needs to build an HTTP client with the iceberg catalog which requires an async-call, but PhysicalExtensionCodec from datafusion_proto try_decode is synchronous. The block_on function is a workaround to make this async function call blocking. Each try_decode also performs a load_table catalog round-trip per plan node to resolve the table's current metadata pointer
-Public API: register_iceberg_codecs(SessionConfig) and register_iceberg_table
Are these changes tested?
Ballista tests:
Datafusion tests: