Skip to content

fix(volcengine): support multimodal embedding models#4998

Open
rangerTao wants to merge 2 commits into
QuantumNous:mainfrom
rangerTao:pr/volcengine-multimodal-embedding
Open

fix(volcengine): support multimodal embedding models#4998
rangerTao wants to merge 2 commits into
QuantumNous:mainfrom
rangerTao:pr/volcengine-multimodal-embedding

Conversation

@rangerTao
Copy link
Copy Markdown

@rangerTao rangerTao commented May 20, 2026

Related Issue

Closes #4997

Summary

This PR makes Volcengine multimodal embedding models (e.g. doubao-embedding-vision-251215) work end-to-end in new-api by fixing two independent bugs.

Changes

1. Route vision/multimodal models to dedicated endpoint

relay/channel/volcengine/adaptor.go:

  • GetRequestURL for RelayModeEmbeddings now routes vision/multimodal models to /api/v3/embeddings/multimodal and keeps text-only models on /embeddings.
  • New helper isVolcengineMultimodalEmbedding(modelName) does the keyword match (case-insensitive: must contain embedding AND vision/multimodal).
  • Both UpstreamModelName and OriginModelName are checked so model_mapping aliases keep working.

2. Use multimodal input shape in dashboard "Test channel" button

controller/channel-test.go:

  • Hard-coded Input=[\"hello world\"] was rejected by the multimodal endpoint with \"we could not parse the JSON body of your request\".
  • Now emits Input=[{type:\"text\", text:\"hello world\"}] for vision/multimodal models, matching the endpoint contract.
  • Plain-text embedding models (Doubao-embedding, text-embedding-3-*, m3e, bge-*, etc.) keep the legacy string-array shape.

Behavior Matrix

Model name Path before Test btn before Path after Test btn after
Doubao-embedding (text) /embeddings (unchanged) (unchanged)
doubao-embedding-text-240715 /embeddings (unchanged) (unchanged)
doubao-embedding-large-text-240915 /embeddings (unchanged) (unchanged)
doubao-embedding-vision-241215 /embeddings ❌ 400 ❌ 400 /embeddings/multimodal ✅ multimodal input
doubao-embedding-vision-251215 /embeddings ❌ 400 ❌ 400 /embeddings/multimodal ✅ multimodal input
*-multimodal* (future) /embeddings /embeddings/multimodal

Tests

Validated locally with a real Volcengine multimodal endpoint:

# Direct call: 200 + full embedding vector
curl http://newapi/v1/embeddings \
  -H \"Authorization: Bearer <token>\" -H \"Content-Type: application/json\" \
  -d '{\"model\":\"doubao-embedding-vision-251215\",\"input\":[{\"type\":\"text\",\"text\":\"测试\"}]}'

# Dashboard \"Test channel\" button: ✅ 200

Plain-text embedding channels (Doubao-embedding, OpenAI text-embedding-3-small, m3e) verified unchanged.

Compatibility

  • ✅ No breaking changes for existing text embedding channels
  • ✅ No DB migration required — detection is by model name only
  • ✅ Works with model_mapping aliases (both OriginModelName and UpstreamModelName are checked)
  • ✅ "Test channel" button auto-adapts based on model name

Summary by CodeRabbit

Release Notes

  • New Features

    • Added automatic detection and enhanced support for multimodal and vision-capable embedding models.
  • Bug Fixes

    • Improved request handling and endpoint routing for specialized embedding model types to ensure proper processing.

Review Change Stack

taoliang added 2 commits May 20, 2026 20:14
Volcengine multimodal embedding models (e.g. doubao-embedding-vision-251215)
require POST /api/v3/embeddings/multimodal and reject the standard
/api/v3/embeddings path with a 400 InvalidParameter error:

  "the requested model doubao-embedding-vision-251215 does not support
   this api"

Detect by model-name keywords (case-insensitive: contains "embedding" AND
contains "vision" or "multimodal"). Both UpstreamModelName and
OriginModelName are checked so model_mapping aliases still work. Pure-text
embedding models continue to use the standard endpoint.
…ding channels

The follow-up fix for the multimodal endpoint routing exposed a second bug
in the channel test path:

  controller/channel-test.go always emits Input=["hello world"] for any
  embedding model, but Volcengine's /api/v3/embeddings/multimodal endpoint
  rejects flat string arrays with:
    "we could not parse the JSON body of your request"

So the dashboard "Test channel" button still failed for vision embedding
models even after routing was corrected.

Detect vision/multimodal embedding models by name keyword (case-insensitive)
and emit the typed-parts shape that matches the multimodal endpoint:

  {"input": [{"type": "text", "text": "hello world"}]}

Plain text embedding models (Doubao-embedding, doubao-embedding-text-*,
m3e, bge-*, OpenAI text-embedding-3-*, etc.) keep using the legacy string
array shape.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

Walkthrough

Fixes Volcengine multimodal embedding models by adding detection logic for vision/multimodal model names and routing requests to the dedicated multimodal endpoint with the correct typed-parts payload format.

Changes

Volcengine multimodal embedding support

Layer / File(s) Summary
Multimodal model detection helper
relay/channel/volcengine/adaptor.go
isVolcengineMultimodalEmbedding helper detects model names containing both embedding and (vision or multimodal) using case-insensitive matching.
Adaptor endpoint routing for multimodal embeddings
relay/channel/volcengine/adaptor.go
GetRequestURL checks the upstream model name and routes multimodal embedding requests to /api/v3/embeddings/multimodal instead of the standard embeddings path.
Test request payload format for multimodal models
controller/channel-test.go
buildTestRequest detects multimodal models in the embedding path and emits EmbeddingRequest with Input formatted as typed parts instead of plain string array.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • Calcium-Ion

Poem

🐰 A rabbit hops through Volcengine's dream,
Where vision models split the stream—
Multimodal magic, typed with care,
Embedding wisdom, floating fair! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(volcengine): support multimodal embedding models' accurately and concisely summarizes the primary change: adding support for Volcengine multimodal embedding models.
Linked Issues check ✅ Passed The PR fully addresses both bugs from issue #4997: routing multimodal models to the correct endpoint in adaptor.go and emitting correct input shape in channel-test.go.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the two identified bugs in issue #4997 with no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controller/channel-test.go`:
- Around line 761-778: The multimodal typed-parts payload is being applied for
any model name containing "vision" or "multimodal" (variable model / lower) but
must be limited to Volcengine channel only; update the conditional that builds
the dto.EmbeddingRequest to also check the channel variable (e.g., channel ==
"volcengine" or whatever canonical channel constant is used in this file) so the
typed-parts input is emitted only for Volcengine multimodal models and not for
other providers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fca0f41a-296f-45db-bf61-7f348fe5adda

📥 Commits

Reviewing files that changed from the base of the PR and between 20d3e73 and bbbd8d8.

📒 Files selected for processing (2)
  • controller/channel-test.go
  • relay/channel/volcengine/adaptor.go

Comment on lines +761 to +778
// Volcengine multimodal embedding endpoint (/api/v3/embeddings/multimodal)
// requires input to be an array of typed parts (e.g. [{type:"text",text:"..."}]),
// not a plain string array. Sending the standard `["hello world"]` shape
// against vision/multimodal models triggers a 400 from the upstream
// "we could not parse the JSON body of your request" — so emit the
// multimodal-compatible shape when the model name signals it.
lower := strings.ToLower(model)
if strings.Contains(lower, "vision") || strings.Contains(lower, "multimodal") {
return &dto.EmbeddingRequest{
Model: model,
Input: []any{
map[string]any{
"type": "text",
"text": "hello world",
},
},
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for embedding model definitions that contain vision/multimodal keywords
# across different channel adaptors to assess cross-provider impact

rg -nC3 --type=go 'ModelList.*=.*\[' relay/channel/ | \
  rg -i 'embedding.*(vision|multimodal)|(vision|multimodal).*embedding'

Repository: QuantumNous/new-api

Length of output: 45


🌐 Web query:

Do OpenAI, Anthropic, or other major LLM providers support typed-parts input format [{type:"text", text:"..."}] for multimodal embedding requests?

💡 Result:

As of May 20, 2026, support for the specific typed-parts input format [{type: "text", text: "..."},...] varies significantly among major LLM providers [1][2][3]. Google Gemini API The Gemini API uses a native multimodal structure that supports the requested format for embedding requests. Specifically, the Gemini Embedding 2 model accepts a content structure consisting of parts [3][4]. Users can pass multiple input types (text, images, etc.) within the parts array of a Content object to generate an aggregated embedding [3]: { "model": "models/gemini-embedding-2", "content": { "parts": [ {"text": "Your text here"}, {"inlineData": {"mimeType": "image/jpeg", "data": "..."}} ] } } OpenAI API OpenAI's official Embeddings API does not support a multimodal typed-parts format [1][5]. The input parameter for the embeddings endpoint is strictly limited to a string, an array of strings, or an array of token integers [1]. While OpenAI supports multimodal inputs (like images and text) in their Chat Completions API (using the same typed-parts format you referenced), this format is not compatible with their dedicated embedding endpoints [1][5][6]. Anthropic API Anthropic's Claude API does not currently offer a standalone multimodal embedding endpoint that supports a typed-parts input format [2]. While Anthropic's Messages API uses the [{type: "text", text: "..."}, {type: "image",...}] format for chat interactions [7][6], it does not provide native support for embedding requests in this format [2][8]. Users seeking multimodal embeddings typically rely on third-party models or specialized services like Voyage AI, which also use standard input arrays rather than the typed-parts block format for their embedding endpoints [2].

Citations:


🏁 Script executed:

# Find where buildTestRequest is defined and called
rg -n 'buildTestRequest' --type=go

Repository: QuantumNous/new-api

Length of output: 308


🏁 Script executed:

# Search for embedding model definitions across all channel implementations
fd --type=f --extension=go 'relay/channel' | head -20

Repository: QuantumNous/new-api

Length of output: 421


🏁 Script executed:

# Check specifically for Volcengine embedding model definition
rg -n 'embedding' relay/channel/ --type=go -i | grep -i volcengine

Repository: QuantumNous/new-api

Length of output: 1811


🏁 Script executed:

# See what embedding models are defined per channel type
rg -n 'embedding' relay/channel/ --type=go -i -B2 -A2 | head -100

Repository: QuantumNous/new-api

Length of output: 7110


🏁 Script executed:

# Look at the buildTestRequest function to understand the full context
sed -n '682,810p' controller/channel-test.go

Repository: QuantumNous/new-api

Length of output: 4068


🏁 Script executed:

# Check if channel.Type is used elsewhere in this file
rg -n 'channel.Type' controller/channel-test.go | head -20

Repository: QuantumNous/new-api

Length of output: 874


🏁 Script executed:

# Verify what constant.ChannelTypeVolcEngine is
rg -n 'ChannelTypeVolcEngine' --type=go | head -10

Repository: QuantumNous/new-api

Length of output: 1062


🏁 Script executed:

# Check if channel parameter is available and used in buildTestRequest
sed -n '682,750p' controller/channel-test.go | grep -E '(channel|Channel)'

Repository: QuantumNous/new-api

Length of output: 173


🏁 Script executed:

# Search for any other embedding models with vision/multimodal in model constants
rg -n "vision|multimodal" relay/channel/ --type=go -i

Repository: QuantumNous/new-api

Length of output: 2363


🏁 Script executed:

# Double-check: are there any other channels with embedding capability and vision/multimodal models?
rg -n 'embedding' relay/channel/ --type=go -l | xargs -I {} sh -c 'echo "=== {} ===" && rg -i "vision|multimodal" {}'

Repository: QuantumNous/new-api

Length of output: 2263


🏁 Script executed:

# Check if Gemini has embedding support and what format it uses
rg -A10 'ConvertEmbeddingRequest' relay/channel/gemini/ --type=go

Repository: QuantumNous/new-api

Length of output: 872


🏁 Script executed:

# Verify what models are in Gemini constants
rg -n 'embedding' relay/channel/gemini/ --type=go -i

Repository: QuantumNous/new-api

Length of output: 3397


🏁 Script executed:

# Check Gemini's ConvertEmbeddingRequest to see how it handles the input format
sed -n '196,260p' relay/channel/gemini/adaptor.go

Repository: QuantumNous/new-api

Length of output: 2395


🏁 Script executed:

# Verify if there are any Gemini multimodal embedding models already defined
rg -n 'embedding' relay/channel/gemini/constant.go --type=go -B2 -A2

Repository: QuantumNous/new-api

Length of output: 346


Scope the multimodal typed-parts format to Volcengine only.

This logic emits the typed-parts input shape [{type:"text", text:"hello world"}] for ANY embedding model containing "vision" or "multimodal, regardless of channel type. While this format is correct for Volcengine's /api/v3/embeddings/multimodal endpoint, it is Volcengine-specific and would break other providers if they add multimodal embedding models in the future. For example, Gemini has embedding models and uses a completely different input structure ({content: {parts: [{text: "..."}]}}`); applying the Volcengine format to a future Gemini multimodal embedding model would cause failures.

The channel parameter is available in this function and is already used elsewhere in the file. Add a channel-type check to limit this behavior to Volcengine:

Suggested fix
 	lower := strings.ToLower(model)
-	if strings.Contains(lower, "vision") || strings.Contains(lower, "multimodal") {
+	// Volcengine multimodal embeddings require typed-parts input format
+	if channel != nil && channel.Type == constant.ChannelTypeVolcEngine &&
+		(strings.Contains(lower, "vision") || strings.Contains(lower, "multimodal")) {
 		return &dto.EmbeddingRequest{
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/channel-test.go` around lines 761 - 778, The multimodal
typed-parts payload is being applied for any model name containing "vision" or
"multimodal" (variable model / lower) but must be limited to Volcengine channel
only; update the conditional that builds the dto.EmbeddingRequest to also check
the channel variable (e.g., channel == "volcengine" or whatever canonical
channel constant is used in this file) so the typed-parts input is emitted only
for Volcengine multimodal models and not for other providers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Volcengine vision/multimodal embedding models broken: wrong endpoint + wrong test payload

1 participant