Skip to content

[DataLoader] Add branch support#486

Open
robreeves wants to merge 13 commits intolinkedin:mainfrom
robreeves:dl_branch
Open

[DataLoader] Add branch support#486
robreeves wants to merge 13 commits intolinkedin:mainfrom
robreeves:dl_branch

Conversation

@robreeves
Copy link
Collaborator

@robreeves robreeves commented Mar 3, 2026

Summary

This adds table branch support to data loader.

I have not added e2e integration tests. Branched writes requires Spark 3.5, but the images use 3.1. Updating this is a significant scope increase for this PR and I did not want to block the feature. The unit tests also exercise the branch logic.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Added unit tests

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

No breaking changes. The branch parameter already existed on the public API — this makes it functional.

@robreeves robreeves changed the title Wire branch support through OpenHouseDataLoader [Feature][DataLoader] Add branch support Mar 3, 2026
robreeves and others added 5 commits March 10, 2026 11:54
The branch parameter was accepted but silently ignored — reads always
hit the main branch. This wires it through via scan().use_ref(branch)
so branch-based reads work. Also resolves the snapshot_id property to
the branch tip and validates that branch and snapshot_id are mutually
exclusive.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The snapshot_id property already resolves the branch tip, so pass the
resolved ID to scan() directly. Removes the use_ref() call and the
branch guard on scan kwargs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of asserting on scan kwargs, set up two parquet files with
different data and verify the loader returns data from the branch's
snapshot.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test that different branches return data from their respective
snapshots, that a branch with multiple snapshots in its lineage
reads from the latest, and that a missing branch raises ValueError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use side_effect lambdas so snapshot_by_name only returns a snapshot
for the specific branch name, returning None for anything else.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
robreeves and others added 3 commits March 10, 2026 12:04
…on tests

Write to branches via Spark SQL (INSERT INTO table.branch_name) instead of
directly manipulating Iceberg metadata files. This matches how users actually
create and populate branches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Docker Compose setup uses Spark 3.1, which does not support
branch-qualified writes or ALTER TABLE CREATE BRANCH. Branch behavior
is covered by unit tests. Integration tests can be added when the
Docker setup moves to Spark 3.5.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robreeves robreeves changed the title [Feature][DataLoader] Add branch support [DataLoader] Add branch support Mar 10, 2026
@robreeves robreeves marked this pull request as ready for review March 10, 2026 23:31
@robreeves robreeves requested a review from Copilot March 10, 2026 23:31
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds functional Iceberg table branch support to the Python DataLoader by resolving a branch name to a snapshot and reading via that snapshot, with accompanying unit tests.

Changes:

  • Rejects using branch and snapshot_id together in OpenHouseDataLoader.
  • Resolves snapshot_id from branch via Table.snapshot_by_name() and uses it for scans.
  • Extends unit tests to cover branch resolution and branch-based reads; updates the Parquet test helper to support multiple filenames.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
integrations/python/dataloader/src/openhouse/dataloader/data_loader.py Adds branch/snapshot_id validation and resolves branch to snapshot ID for scanning.
integrations/python/dataloader/tests/test_data_loader.py Adds branch-focused unit tests and updates Parquet test helper to support multiple files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

robreeves and others added 5 commits March 10, 2026 16:51
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous simplification lost the assertion that branch and non-branch
reads return different data. Restore two parquet files with snapshot-based
routing and assert both cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use mocks to verify that branch and non-branch loaders yield splits
backed by different files, without needing real parquet data or
exercising DataLoaderSplit read logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants