Skip to content

feat(sql): copy through blob v2 columns in Lance writes#626

Open
geruh wants to merge 1 commit into
lance-format:mainfrom
geruh:late
Open

feat(sql): copy through blob v2 columns in Lance writes#626
geruh wants to merge 1 commit into
lance-format:mainfrom
geruh:late

Conversation

@geruh

@geruh geruh commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Blob v2 columns read as descriptor structs but writes expect BINARY. Before this, INSERT INTO tgt SELECT blob FROM src would hit the write validator and error out.

This PR adds a resolution rule that rewrites direct blob selects into copy tokens. The write path resolves them via takeBlobs and actually copies the bytes.

What works

Direct blob copies, including joins (both sides, self-join, fan-out):

INSERT INTO out SELECT a.id, a.data, b.data
FROM docs_a a JOIN docs_b b ON a.id = b.id;

Plain single-table copy:

INSERT INTO documents_copy SELECT id, content FROM documents;

CTAS / RTAS (picks up file_format_version = '2.2' when the query has blob v2 columns):

CREATE TABLE documents_copy USING lance
AS SELECT id, content FROM documents;

Also: writeTo().append(), .overwrite(), path writes, cross-catalog copies when source creds differ.
Filters, limits, windows are fine too, as long as each blob column is a direct ref or alias (SELECT data, SELECT data AS x). Not CASE, not GROUP BY, not UNION, not ORDER BY data.

Logic

This PR adds 2 new rules:

  • LanceBlobV2CopyThroughRule (resolution): finds blob v2 columns in the write query, swaps them for LanceBlobV2CopyRef tokens bound to _rowaddr + source URI. On joins, each side gets its own token tied to that relation's row address.
  • LanceBlobSourceContextRule (optimizer): stashes source dataset creds/version on write options so tasks can reopen the source for takeBlobs. CTAS gets context injected in the copy rule since optimizer can't see the CTAS query.

Stands down when the same URI is read twice with different identities (time travel in a subquery, etc.). otherwise the per-URI context map lies about which snapshot you're copying from.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions github-actions Bot added the enhancement New feature or request label Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant