Scope check
Due diligence
What problem does this solve?
Currently, RubyLLM only supports text-based embeddings across all providers. However, Google's VertexAI offers multimodal embedding capabilities through their multimodalembedding model, which can generate embeddings for:
- Images (for visual search, similarity matching)
- Videos (for video content analysis)
- Combined text + image/video (for rich semantic understanding)
Users who need to:
- Build visual search systems
- Compare image/video similarity
- Create multimodal RAG (Retrieval-Augmented Generation) systems
- Generate embeddings for mixed media content
...are currently unable to leverage these capabilities through RubyLLM.
Proposed solution
Extend the existing Embedding.embed API to accept optional image and video parameters:
# Text + Image
RubyLLM.embed(
"A red sports car",
image: File.read('car.jpg'),
model: 'multimodalembedding',
provider: :vertexai
)
# Video with GCS URI
RubyLLM.embed(
"Product demo video",
video: 'gs://my-bucket/demo.mp4',
model: 'multimodalembedding',
provider: :vertexai
)
# Image-only (no text required)
RubyLLM.embed(
image: image_data,
model: 'multimodalembedding',
provider: :vertexai
)
# Text only
RubyLLM.embed(
"A blue sports car",
model: 'multimodalembedding',
provider: :vertexai
)
Implementation approach:
- Add
image: and video: parameters to Provider#embed method
- Implement multimodal payload rendering in
VertexAI::Embeddings module:
- Support base64-encoded image data
- Support video as base64 or GCS URIs (gs://...)
- Handle optional text for pure image/video embeddings
- Standardize render_embedding_payload signature across all providers
- Return structured embeddings: {
text: [...], image: [...], video: [...] }
Why this belongs in RubyLLM
- Feature parity: VertexAI already supports this; RubyLLM should be able to expose it.
- Unified API: Users expect all provider features through one interface
- Real demand: Visual search, RAG with images, content moderation all need this
I have a working implementation ready to submit as PR if there's interest!
Thanks for the review 😊
Scope check
Due diligence
What problem does this solve?
Currently, RubyLLM only supports text-based embeddings across all providers. However, Google's VertexAI offers multimodal embedding capabilities through their
multimodalembeddingmodel, which can generate embeddings for:Users who need to:
...are currently unable to leverage these capabilities through RubyLLM.
Proposed solution
Extend the existing
Embedding.embedAPI to accept optionalimageandvideoparameters:Implementation approach:
image:andvideo:parameters toProvider#embedmethodVertexAI::Embeddingsmodule:text: [...],image: [...],video: [...]}Why this belongs in RubyLLM
I have a working implementation ready to submit as PR if there's interest!
Thanks for the review 😊