Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions controller/channel-test.go
Original file line number Diff line number Diff line change
Expand Up @@ -758,6 +758,24 @@ func buildTestRequest(model string, endpointType string, channel *model.Channel,
if strings.Contains(strings.ToLower(model), "embedding") ||
strings.HasPrefix(model, "m3e") ||
strings.Contains(model, "bge-") {
// Volcengine multimodal embedding endpoint (/api/v3/embeddings/multimodal)
// requires input to be an array of typed parts (e.g. [{type:"text",text:"..."}]),
// not a plain string array. Sending the standard `["hello world"]` shape
// against vision/multimodal models triggers a 400 from the upstream
// "we could not parse the JSON body of your request" — so emit the
// multimodal-compatible shape when the model name signals it.
lower := strings.ToLower(model)
if strings.Contains(lower, "vision") || strings.Contains(lower, "multimodal") {
return &dto.EmbeddingRequest{
Model: model,
Input: []any{
map[string]any{
"type": "text",
"text": "hello world",
},
},
}
}
Comment on lines +761 to +778
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for embedding model definitions that contain vision/multimodal keywords
# across different channel adaptors to assess cross-provider impact

rg -nC3 --type=go 'ModelList.*=.*\[' relay/channel/ | \
  rg -i 'embedding.*(vision|multimodal)|(vision|multimodal).*embedding'

Repository: QuantumNous/new-api

Length of output: 45


🌐 Web query:

Do OpenAI, Anthropic, or other major LLM providers support typed-parts input format [{type:"text", text:"..."}] for multimodal embedding requests?

💡 Result:

As of May 20, 2026, support for the specific typed-parts input format [{type: "text", text: "..."},...] varies significantly among major LLM providers [1][2][3]. Google Gemini API The Gemini API uses a native multimodal structure that supports the requested format for embedding requests. Specifically, the Gemini Embedding 2 model accepts a content structure consisting of parts [3][4]. Users can pass multiple input types (text, images, etc.) within the parts array of a Content object to generate an aggregated embedding [3]: { "model": "models/gemini-embedding-2", "content": { "parts": [ {"text": "Your text here"}, {"inlineData": {"mimeType": "image/jpeg", "data": "..."}} ] } } OpenAI API OpenAI's official Embeddings API does not support a multimodal typed-parts format [1][5]. The input parameter for the embeddings endpoint is strictly limited to a string, an array of strings, or an array of token integers [1]. While OpenAI supports multimodal inputs (like images and text) in their Chat Completions API (using the same typed-parts format you referenced), this format is not compatible with their dedicated embedding endpoints [1][5][6]. Anthropic API Anthropic's Claude API does not currently offer a standalone multimodal embedding endpoint that supports a typed-parts input format [2]. While Anthropic's Messages API uses the [{type: "text", text: "..."}, {type: "image",...}] format for chat interactions [7][6], it does not provide native support for embedding requests in this format [2][8]. Users seeking multimodal embeddings typically rely on third-party models or specialized services like Voyage AI, which also use standard input arrays rather than the typed-parts block format for their embedding endpoints [2].

Citations:


🏁 Script executed:

# Find where buildTestRequest is defined and called
rg -n 'buildTestRequest' --type=go

Repository: QuantumNous/new-api

Length of output: 308


🏁 Script executed:

# Search for embedding model definitions across all channel implementations
fd --type=f --extension=go 'relay/channel' | head -20

Repository: QuantumNous/new-api

Length of output: 421


🏁 Script executed:

# Check specifically for Volcengine embedding model definition
rg -n 'embedding' relay/channel/ --type=go -i | grep -i volcengine

Repository: QuantumNous/new-api

Length of output: 1811


🏁 Script executed:

# See what embedding models are defined per channel type
rg -n 'embedding' relay/channel/ --type=go -i -B2 -A2 | head -100

Repository: QuantumNous/new-api

Length of output: 7110


🏁 Script executed:

# Look at the buildTestRequest function to understand the full context
sed -n '682,810p' controller/channel-test.go

Repository: QuantumNous/new-api

Length of output: 4068


🏁 Script executed:

# Check if channel.Type is used elsewhere in this file
rg -n 'channel.Type' controller/channel-test.go | head -20

Repository: QuantumNous/new-api

Length of output: 874


🏁 Script executed:

# Verify what constant.ChannelTypeVolcEngine is
rg -n 'ChannelTypeVolcEngine' --type=go | head -10

Repository: QuantumNous/new-api

Length of output: 1062


🏁 Script executed:

# Check if channel parameter is available and used in buildTestRequest
sed -n '682,750p' controller/channel-test.go | grep -E '(channel|Channel)'

Repository: QuantumNous/new-api

Length of output: 173


🏁 Script executed:

# Search for any other embedding models with vision/multimodal in model constants
rg -n "vision|multimodal" relay/channel/ --type=go -i

Repository: QuantumNous/new-api

Length of output: 2363


🏁 Script executed:

# Double-check: are there any other channels with embedding capability and vision/multimodal models?
rg -n 'embedding' relay/channel/ --type=go -l | xargs -I {} sh -c 'echo "=== {} ===" && rg -i "vision|multimodal" {}'

Repository: QuantumNous/new-api

Length of output: 2263


🏁 Script executed:

# Check if Gemini has embedding support and what format it uses
rg -A10 'ConvertEmbeddingRequest' relay/channel/gemini/ --type=go

Repository: QuantumNous/new-api

Length of output: 872


🏁 Script executed:

# Verify what models are in Gemini constants
rg -n 'embedding' relay/channel/gemini/ --type=go -i

Repository: QuantumNous/new-api

Length of output: 3397


🏁 Script executed:

# Check Gemini's ConvertEmbeddingRequest to see how it handles the input format
sed -n '196,260p' relay/channel/gemini/adaptor.go

Repository: QuantumNous/new-api

Length of output: 2395


🏁 Script executed:

# Verify if there are any Gemini multimodal embedding models already defined
rg -n 'embedding' relay/channel/gemini/constant.go --type=go -B2 -A2

Repository: QuantumNous/new-api

Length of output: 346


Scope the multimodal typed-parts format to Volcengine only.

This logic emits the typed-parts input shape [{type:"text", text:"hello world"}] for ANY embedding model containing "vision" or "multimodal, regardless of channel type. While this format is correct for Volcengine's /api/v3/embeddings/multimodal endpoint, it is Volcengine-specific and would break other providers if they add multimodal embedding models in the future. For example, Gemini has embedding models and uses a completely different input structure ({content: {parts: [{text: "..."}]}}`); applying the Volcengine format to a future Gemini multimodal embedding model would cause failures.

The channel parameter is available in this function and is already used elsewhere in the file. Add a channel-type check to limit this behavior to Volcengine:

Suggested fix
 	lower := strings.ToLower(model)
-	if strings.Contains(lower, "vision") || strings.Contains(lower, "multimodal") {
+	// Volcengine multimodal embeddings require typed-parts input format
+	if channel != nil && channel.Type == constant.ChannelTypeVolcEngine &&
+		(strings.Contains(lower, "vision") || strings.Contains(lower, "multimodal")) {
 		return &dto.EmbeddingRequest{
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controller/channel-test.go` around lines 761 - 778, The multimodal
typed-parts payload is being applied for any model name containing "vision" or
"multimodal" (variable model / lower) but must be limited to Volcengine channel
only; update the conditional that builds the dto.EmbeddingRequest to also check
the channel variable (e.g., channel == "volcengine" or whatever canonical
channel constant is used in this file) so the typed-parts input is emitted only
for Volcengine multimodal models and not for other providers.

// 返回 EmbeddingRequest
return &dto.EmbeddingRequest{
Model: model,
Expand Down
28 changes: 28 additions & 0 deletions relay/channel/volcengine/adaptor.go
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,15 @@ func (a *Adaptor) GetRequestURL(info *relaycommon.RelayInfo) (string, error) {
}
return fmt.Sprintf("%s/api/v3/chat/completions", baseUrl), nil
case constant.RelayModeEmbeddings:
// Volcengine multimodal embedding models (e.g. doubao-embedding-vision-*)
// require the dedicated /api/v3/embeddings/multimodal endpoint and reject
// the standard /embeddings path with a 400 InvalidParameter error.
// Heuristic: route by model name keywords. Both the request model and the
// upstream model name are checked so model_mapping aliases still work.
if isVolcengineMultimodalEmbedding(info.UpstreamModelName) ||
isVolcengineMultimodalEmbedding(info.OriginModelName) {
return fmt.Sprintf("%s/api/v3/embeddings/multimodal", baseUrl), nil
}
return fmt.Sprintf("%s/api/v3/embeddings", baseUrl), nil
//豆包的图生图也走generations接口: https://www.volcengine.com/docs/82379/1824121
case constant.RelayModeImagesGenerations, constant.RelayModeImagesEdits:
Expand Down Expand Up @@ -400,3 +409,22 @@ func (a *Adaptor) GetModelList() []string {
func (a *Adaptor) GetChannelName() string {
return ChannelName
}

// isVolcengineMultimodalEmbedding reports whether the given model name targets
// Volcengine's multimodal (image+text) embedding endpoint, which requires the
// `/api/v3/embeddings/multimodal` path instead of the standard `/embeddings`.
//
// The detection is keyword-based on the model name. We accept both the original
// "vision" series (e.g. doubao-embedding-vision-241215, doubao-embedding-vision-251215)
// and any future "multimodal" naming. Matching is case-insensitive so user-supplied
// model_mapping aliases work even with mixed case.
func isVolcengineMultimodalEmbedding(modelName string) bool {
if modelName == "" {
return false
}
lower := strings.ToLower(modelName)
if !strings.Contains(lower, "embedding") {
return false
}
return strings.Contains(lower, "vision") || strings.Contains(lower, "multimodal")
}
Loading