Open
Conversation
The branch parameter was accepted but silently ignored — reads always hit the main branch. This wires it through via scan().use_ref(branch) so branch-based reads work. Also resolves the snapshot_id property to the branch tip and validates that branch and snapshot_id are mutually exclusive. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The snapshot_id property already resolves the branch tip, so pass the resolved ID to scan() directly. Removes the use_ref() call and the branch guard on scan kwargs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of asserting on scan kwargs, set up two parquet files with different data and verify the loader returns data from the branch's snapshot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test that different branches return data from their respective snapshots, that a branch with multiple snapshots in its lineage reads from the latest, and that a missing branch raises ValueError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use side_effect lambdas so snapshot_by_name only returns a snapshot for the specific branch name, returning None for anything else. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on tests Write to branches via Spark SQL (INSERT INTO table.branch_name) instead of directly manipulating Iceberg metadata files. This matches how users actually create and populate branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Docker Compose setup uses Spark 3.1, which does not support branch-qualified writes or ALTER TABLE CREATE BRANCH. Branch behavior is covered by unit tests. Integration tests can be added when the Docker setup moves to Spark 3.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds functional Iceberg table branch support to the Python DataLoader by resolving a branch name to a snapshot and reading via that snapshot, with accompanying unit tests.
Changes:
- Rejects using
branchandsnapshot_idtogether inOpenHouseDataLoader. - Resolves
snapshot_idfrombranchviaTable.snapshot_by_name()and uses it for scans. - Extends unit tests to cover branch resolution and branch-based reads; updates the Parquet test helper to support multiple filenames.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| integrations/python/dataloader/src/openhouse/dataloader/data_loader.py | Adds branch/snapshot_id validation and resolves branch to snapshot ID for scanning. |
| integrations/python/dataloader/tests/test_data_loader.py | Adds branch-focused unit tests and updates Parquet test helper to support multiple files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous simplification lost the assertion that branch and non-branch reads return different data. Restore two parquet files with snapshot-based routing and assert both cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use mocks to verify that branch and non-branch loaders yield splits backed by different files, without needing real parquet data or exercising DataLoaderSplit read logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This adds table branch support to data loader.
I have not added e2e integration tests. Branched writes requires Spark 3.5, but the images use 3.1. Updating this is a significant scope increase for this PR and I did not want to block the feature. The unit tests also exercise the branch logic.
Changes
Testing Done
Added unit tests
Additional Information
No breaking changes. The
branchparameter already existed on the public API — this makes it functional.