[DataLoader] Add branch support by robreeves · Pull Request #486 · linkedin/openhouse

robreeves · 2026-03-03T23:45:42Z

Summary

This adds table branch support to data loader.

I have not added e2e integration tests. Branched writes requires Spark 3.5, but the images use 3.1. Updating this is a significant scope increase for this PR and I did not want to block the feature. The unit tests also exercise the branch logic.

Changes

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

Added unit tests

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

No breaking changes. The branch parameter already existed on the public API — this makes it functional.

The branch parameter was accepted but silently ignored — reads always hit the main branch. This wires it through via scan().use_ref(branch) so branch-based reads work. Also resolves the snapshot_id property to the branch tip and validates that branch and snapshot_id are mutually exclusive. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The snapshot_id property already resolves the branch tip, so pass the resolved ID to scan() directly. Removes the use_ref() call and the branch guard on scan kwargs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of asserting on scan kwargs, set up two parquet files with different data and verify the loader returns data from the branch's snapshot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Test that different branches return data from their respective snapshots, that a branch with multiple snapshots in its lineage reads from the latest, and that a missing branch raises ValueError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use side_effect lambdas so snapshot_by_name only returns a snapshot for the specific branch name, returning None for anything else. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…on tests Write to branches via Spark SQL (INSERT INTO table.branch_name) instead of directly manipulating Iceberg metadata files. This matches how users actually create and populate branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Docker Compose setup uses Spark 3.1, which does not support branch-qualified writes or ALTER TABLE CREATE BRANCH. Branch behavior is covered by unit tests. Integration tests can be added when the Docker setup moves to Spark 3.5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds functional Iceberg table branch support to the Python DataLoader by resolving a branch name to a snapshot and reading via that snapshot, with accompanying unit tests.

Changes:

Rejects using branch and snapshot_id together in OpenHouseDataLoader.
Resolves snapshot_id from branch via Table.snapshot_by_name() and uses it for scans.
Extends unit tests to cover branch resolution and branch-based reads; updates the Parquet test helper to support multiple filenames.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py	Adds branch/snapshot_id validation and resolves branch to snapshot ID for scanning.
integrations/python/dataloader/tests/test_data_loader.py	Adds branch-focused unit tests and updates Parquet test helper to support multiple files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

integrations/python/dataloader/src/openhouse/dataloader/data_loader.py

integrations/python/dataloader/tests/test_data_loader.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The previous simplification lost the assertion that branch and non-branch reads return different data. Restore two parquet files with snapshot-based routing and assert both cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use mocks to verify that branch and non-branch loaders yield splits backed by different files, without needing real parquet data or exercising DataLoaderSplit read logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

integrations/python/dataloader/src/openhouse/dataloader/data_loader.py

robreeves changed the title ~~Wire branch support through OpenHouseDataLoader~~ [Feature][DataLoader] Add branch support Mar 3, 2026

robreeves and others added 5 commits March 10, 2026 11:54

Simplify branch scan: resolve via snapshot_id property

a58eb7e

The snapshot_id property already resolves the branch tip, so pass the resolved ID to scan() directly. Removes the use_ref() call and the branch guard on scan kwargs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace mock-inspection branch test with behavioral test

ae4a66d

Instead of asserting on scan kwargs, set up two parquet files with different data and verify the loader returns data from the branch's snapshot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tighten branch test mocks to match on exact branch name

5287a2d

Use side_effect lambdas so snapshot_by_name only returns a snapshot for the specific branch name, returning None for anything else. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves force-pushed the dl_branch branch from 133dfd9 to 5287a2d Compare March 10, 2026 19:00

robreeves and others added 3 commits March 10, 2026 12:04

Fix ruff line-too-long in branch test mock setup

93724a8

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves changed the title ~~[Feature][DataLoader] Add branch support~~ [DataLoader] Add branch support Mar 10, 2026

robreeves marked this pull request as ready for review March 10, 2026 23:31

robreeves requested a review from Copilot March 10, 2026 23:31

Copilot started reviewing on behalf of robreeves March 10, 2026 23:31 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

robreeves and others added 5 commits March 10, 2026 16:51

Apply suggestion from @Copilot

25c48ab

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

63c08ed

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Simplify branch test by reusing _make_real_catalog

5715ecf

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves requested review from ShreyeshArangath, cbb330, Copilot and sumedhsakdeo March 11, 2026 20:01

Copilot started reviewing on behalf of robreeves March 11, 2026 20:02 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

integrations/python/dataloader/src/openhouse/dataloader/data_loader.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataLoader] Add branch support#486

[DataLoader] Add branch support#486
robreeves wants to merge 13 commits intolinkedin:mainfrom
robreeves:dl_branch

robreeves commented Mar 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robreeves commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robreeves commented Mar 3, 2026 •

edited

Loading