feat(sql): copy through blob v2 columns in Lance writes#626
Open
geruh wants to merge 1 commit into
Open
Conversation
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Blob v2 columns read as descriptor structs but writes expect BINARY. Before this,
INSERT INTO tgt SELECT blob FROM srcwould hit the write validator and error out.This PR adds a resolution rule that rewrites direct blob selects into copy tokens. The write path resolves them via
takeBlobsand actually copies the bytes.What works
Direct blob copies, including joins (both sides, self-join, fan-out):
Plain single-table copy:
CTAS / RTAS (picks up file_format_version = '2.2' when the query has blob v2 columns):
Also:
writeTo().append(), .overwrite(), path writes, cross-catalog copies when source creds differ.Filters, limits, windows are fine too, as long as each blob column is a direct ref or alias (
SELECT data, SELECT data AS x). NotCASE, notGROUP BY, notUNION, notORDER BY data.Logic
This PR adds 2 new rules:
LanceBlobV2CopyThroughRule(resolution): finds blob v2 columns in the write query, swaps them forLanceBlobV2CopyReftokens bound to_rowaddr+ sourceURI. On joins, each side gets its own token tied to that relation's row address.LanceBlobSourceContextRule(optimizer): stashes source dataset creds/version on write options so tasks can reopen the source fortakeBlobs.CTASgets context injected in the copy rule since optimizer can't see theCTASquery.Stands down when the same URI is read twice with different identities (time travel in a subquery, etc.). otherwise the per-URI context map lies about which snapshot you're copying from.