fix(volcengine): support multimodal embedding models#4998
Conversation
Volcengine multimodal embedding models (e.g. doubao-embedding-vision-251215) require POST /api/v3/embeddings/multimodal and reject the standard /api/v3/embeddings path with a 400 InvalidParameter error: "the requested model doubao-embedding-vision-251215 does not support this api" Detect by model-name keywords (case-insensitive: contains "embedding" AND contains "vision" or "multimodal"). Both UpstreamModelName and OriginModelName are checked so model_mapping aliases still work. Pure-text embedding models continue to use the standard endpoint.
…ding channels
The follow-up fix for the multimodal endpoint routing exposed a second bug
in the channel test path:
controller/channel-test.go always emits Input=["hello world"] for any
embedding model, but Volcengine's /api/v3/embeddings/multimodal endpoint
rejects flat string arrays with:
"we could not parse the JSON body of your request"
So the dashboard "Test channel" button still failed for vision embedding
models even after routing was corrected.
Detect vision/multimodal embedding models by name keyword (case-insensitive)
and emit the typed-parts shape that matches the multimodal endpoint:
{"input": [{"type": "text", "text": "hello world"}]}
Plain text embedding models (Doubao-embedding, doubao-embedding-text-*,
m3e, bge-*, OpenAI text-embedding-3-*, etc.) keep using the legacy string
array shape.
WalkthroughFixes Volcengine multimodal embedding models by adding detection logic for vision/multimodal model names and routing requests to the dedicated multimodal endpoint with the correct typed-parts payload format. ChangesVolcengine multimodal embedding support
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@controller/channel-test.go`:
- Around line 761-778: The multimodal typed-parts payload is being applied for
any model name containing "vision" or "multimodal" (variable model / lower) but
must be limited to Volcengine channel only; update the conditional that builds
the dto.EmbeddingRequest to also check the channel variable (e.g., channel ==
"volcengine" or whatever canonical channel constant is used in this file) so the
typed-parts input is emitted only for Volcengine multimodal models and not for
other providers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fca0f41a-296f-45db-bf61-7f348fe5adda
📒 Files selected for processing (2)
controller/channel-test.gorelay/channel/volcengine/adaptor.go
| // Volcengine multimodal embedding endpoint (/api/v3/embeddings/multimodal) | ||
| // requires input to be an array of typed parts (e.g. [{type:"text",text:"..."}]), | ||
| // not a plain string array. Sending the standard `["hello world"]` shape | ||
| // against vision/multimodal models triggers a 400 from the upstream | ||
| // "we could not parse the JSON body of your request" — so emit the | ||
| // multimodal-compatible shape when the model name signals it. | ||
| lower := strings.ToLower(model) | ||
| if strings.Contains(lower, "vision") || strings.Contains(lower, "multimodal") { | ||
| return &dto.EmbeddingRequest{ | ||
| Model: model, | ||
| Input: []any{ | ||
| map[string]any{ | ||
| "type": "text", | ||
| "text": "hello world", | ||
| }, | ||
| }, | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for embedding model definitions that contain vision/multimodal keywords
# across different channel adaptors to assess cross-provider impact
rg -nC3 --type=go 'ModelList.*=.*\[' relay/channel/ | \
rg -i 'embedding.*(vision|multimodal)|(vision|multimodal).*embedding'Repository: QuantumNous/new-api
Length of output: 45
🌐 Web query:
Do OpenAI, Anthropic, or other major LLM providers support typed-parts input format [{type:"text", text:"..."}] for multimodal embedding requests?
💡 Result:
As of May 20, 2026, support for the specific typed-parts input format [{type: "text", text: "..."},...] varies significantly among major LLM providers [1][2][3]. Google Gemini API The Gemini API uses a native multimodal structure that supports the requested format for embedding requests. Specifically, the Gemini Embedding 2 model accepts a content structure consisting of parts [3][4]. Users can pass multiple input types (text, images, etc.) within the parts array of a Content object to generate an aggregated embedding [3]: { "model": "models/gemini-embedding-2", "content": { "parts": [ {"text": "Your text here"}, {"inlineData": {"mimeType": "image/jpeg", "data": "..."}} ] } } OpenAI API OpenAI's official Embeddings API does not support a multimodal typed-parts format [1][5]. The input parameter for the embeddings endpoint is strictly limited to a string, an array of strings, or an array of token integers [1]. While OpenAI supports multimodal inputs (like images and text) in their Chat Completions API (using the same typed-parts format you referenced), this format is not compatible with their dedicated embedding endpoints [1][5][6]. Anthropic API Anthropic's Claude API does not currently offer a standalone multimodal embedding endpoint that supports a typed-parts input format [2]. While Anthropic's Messages API uses the [{type: "text", text: "..."}, {type: "image",...}] format for chat interactions [7][6], it does not provide native support for embedding requests in this format [2][8]. Users seeking multimodal embeddings typically rely on third-party models or specialized services like Voyage AI, which also use standard input arrays rather than the typed-parts block format for their embedding endpoints [2].
Citations:
- 1: https://developers.openai.com/api/docs/api-reference/embeddings
- 2: https://console.anthropic.com/docs/en/build-with-claude/embeddings
- 3: https://ai.google.dev/gemini-api/docs/embeddings
- 4: https://pub.towardsai.net/google-just-killed-five-embedding-models-with-one-heres-what-actually-happens-4cda368ae871
- 5: https://developers.openai.com/api/docs/guides/embeddings
- 6: https://platform.claude.com/docs/en/build-with-claude/vision
- 7: https://tessl.io/registry/tessl/pypi-anthropic/files/docs/common-tasks/multimodal-input.md
- 8: Feature Request: Support embeddings as a native input modality in the Messages API anthropics/anthropic-sdk-python#1351
🏁 Script executed:
# Find where buildTestRequest is defined and called
rg -n 'buildTestRequest' --type=goRepository: QuantumNous/new-api
Length of output: 308
🏁 Script executed:
# Search for embedding model definitions across all channel implementations
fd --type=f --extension=go 'relay/channel' | head -20Repository: QuantumNous/new-api
Length of output: 421
🏁 Script executed:
# Check specifically for Volcengine embedding model definition
rg -n 'embedding' relay/channel/ --type=go -i | grep -i volcengineRepository: QuantumNous/new-api
Length of output: 1811
🏁 Script executed:
# See what embedding models are defined per channel type
rg -n 'embedding' relay/channel/ --type=go -i -B2 -A2 | head -100Repository: QuantumNous/new-api
Length of output: 7110
🏁 Script executed:
# Look at the buildTestRequest function to understand the full context
sed -n '682,810p' controller/channel-test.goRepository: QuantumNous/new-api
Length of output: 4068
🏁 Script executed:
# Check if channel.Type is used elsewhere in this file
rg -n 'channel.Type' controller/channel-test.go | head -20Repository: QuantumNous/new-api
Length of output: 874
🏁 Script executed:
# Verify what constant.ChannelTypeVolcEngine is
rg -n 'ChannelTypeVolcEngine' --type=go | head -10Repository: QuantumNous/new-api
Length of output: 1062
🏁 Script executed:
# Check if channel parameter is available and used in buildTestRequest
sed -n '682,750p' controller/channel-test.go | grep -E '(channel|Channel)'Repository: QuantumNous/new-api
Length of output: 173
🏁 Script executed:
# Search for any other embedding models with vision/multimodal in model constants
rg -n "vision|multimodal" relay/channel/ --type=go -iRepository: QuantumNous/new-api
Length of output: 2363
🏁 Script executed:
# Double-check: are there any other channels with embedding capability and vision/multimodal models?
rg -n 'embedding' relay/channel/ --type=go -l | xargs -I {} sh -c 'echo "=== {} ===" && rg -i "vision|multimodal" {}'Repository: QuantumNous/new-api
Length of output: 2263
🏁 Script executed:
# Check if Gemini has embedding support and what format it uses
rg -A10 'ConvertEmbeddingRequest' relay/channel/gemini/ --type=goRepository: QuantumNous/new-api
Length of output: 872
🏁 Script executed:
# Verify what models are in Gemini constants
rg -n 'embedding' relay/channel/gemini/ --type=go -iRepository: QuantumNous/new-api
Length of output: 3397
🏁 Script executed:
# Check Gemini's ConvertEmbeddingRequest to see how it handles the input format
sed -n '196,260p' relay/channel/gemini/adaptor.goRepository: QuantumNous/new-api
Length of output: 2395
🏁 Script executed:
# Verify if there are any Gemini multimodal embedding models already defined
rg -n 'embedding' relay/channel/gemini/constant.go --type=go -B2 -A2Repository: QuantumNous/new-api
Length of output: 346
Scope the multimodal typed-parts format to Volcengine only.
This logic emits the typed-parts input shape [{type:"text", text:"hello world"}] for ANY embedding model containing "vision" or "multimodal, regardless of channel type. While this format is correct for Volcengine's /api/v3/embeddings/multimodal endpoint, it is Volcengine-specific and would break other providers if they add multimodal embedding models in the future. For example, Gemini has embedding models and uses a completely different input structure ({content: {parts: [{text: "..."}]}}`); applying the Volcengine format to a future Gemini multimodal embedding model would cause failures.
The channel parameter is available in this function and is already used elsewhere in the file. Add a channel-type check to limit this behavior to Volcengine:
Suggested fix
lower := strings.ToLower(model)
- if strings.Contains(lower, "vision") || strings.Contains(lower, "multimodal") {
+ // Volcengine multimodal embeddings require typed-parts input format
+ if channel != nil && channel.Type == constant.ChannelTypeVolcEngine &&
+ (strings.Contains(lower, "vision") || strings.Contains(lower, "multimodal")) {
return &dto.EmbeddingRequest{🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@controller/channel-test.go` around lines 761 - 778, The multimodal
typed-parts payload is being applied for any model name containing "vision" or
"multimodal" (variable model / lower) but must be limited to Volcengine channel
only; update the conditional that builds the dto.EmbeddingRequest to also check
the channel variable (e.g., channel == "volcengine" or whatever canonical
channel constant is used in this file) so the typed-parts input is emitted only
for Volcengine multimodal models and not for other providers.
Related Issue
Closes #4997
Summary
This PR makes Volcengine multimodal embedding models (e.g.
doubao-embedding-vision-251215) work end-to-end in new-api by fixing two independent bugs.Changes
1. Route vision/multimodal models to dedicated endpoint
relay/channel/volcengine/adaptor.go:GetRequestURLforRelayModeEmbeddingsnow routes vision/multimodal models to/api/v3/embeddings/multimodaland keeps text-only models on/embeddings.isVolcengineMultimodalEmbedding(modelName)does the keyword match (case-insensitive: must containembeddingANDvision/multimodal).UpstreamModelNameandOriginModelNameare checked somodel_mappingaliases keep working.2. Use multimodal input shape in dashboard "Test channel" button
controller/channel-test.go:Input=[\"hello world\"]was rejected by the multimodal endpoint with\"we could not parse the JSON body of your request\".Input=[{type:\"text\", text:\"hello world\"}]for vision/multimodal models, matching the endpoint contract.Doubao-embedding,text-embedding-3-*,m3e,bge-*, etc.) keep the legacy string-array shape.Behavior Matrix
Doubao-embedding(text)/embeddings✅doubao-embedding-text-240715/embeddings✅doubao-embedding-large-text-240915/embeddings✅doubao-embedding-vision-241215/embeddings❌ 400/embeddings/multimodal✅doubao-embedding-vision-251215/embeddings❌ 400/embeddings/multimodal✅*-multimodal*(future)/embeddings❌/embeddings/multimodal✅Tests
Validated locally with a real Volcengine multimodal endpoint:
Plain-text embedding channels (Doubao-embedding, OpenAI text-embedding-3-small, m3e) verified unchanged.
Compatibility
model_mappingaliases (bothOriginModelNameandUpstreamModelNameare checked)Summary by CodeRabbit
Release Notes
New Features
Bug Fixes