Skip to content

Preserve multimodal MCP tool-call results through model provider adapters #607

@eric-tramel

Description

@eric-tramel

Priority Level

Medium

Task Summary

DataDesigner should preserve multimodal content returned from MCP tool calls and pass it back to the model in a provider-appropriate shape. Today MCP tool results are flattened into strings before they re-enter the generation loop, so image content returned by a tool is either lost as model-visible image metadata or serialized as JSON/text.

This matters for VLM-backed generation flows where an MCP server returns an image, screenshot, rendered page, chart, document page, or other visual result and the model should inspect it on the next turn without any user-side special handling.

Technical Details & Implementation Plan

The current path collapses tool results too early:

  • MCPToolResult.content is typed as str in packages/data-designer-engine/src/data_designer/engine/mcp/registry.py.
  • MCPIOService._call_tool_async() calls _serialize_tool_result_content() in packages/data-designer-engine/src/data_designer/engine/mcp/io.py, which converts lists of MCP content blocks into text/JSON strings.
  • MCPFacade._execute_tool_calls_from_canonical() wraps the result with ChatMessage.as_tool(content=result.content, ...) in packages/data-designer-engine/src/data_designer/engine/mcp/facade.py.
  • ChatMessage.content already supports str | list[dict[str, Any]], but ChatMessage.as_tool() only accepts str in packages/data-designer-engine/src/data_designer/engine/models/utils.py.

Proposed shape changes:

  1. Change MCPToolResult.content to support typed content:
content: str | list[dict[str, Any]]
  1. Replace the MCP result serializer with a coercer that preserves MCP content blocks:
  • MCP TextContent(type="text", text=...) -> {"type": "text", "text": ...}
  • MCP ImageContent(type="image", data=..., mimeType=...) -> OpenAI-style canonical image block:
{
    "type": "image_url",
    "image_url": {
        "url": f"data:{mime_type};base64,{data}",
    },
}
  • Structured/non-visual content can continue to degrade to text or JSON text, but image blocks should never be stringified.
  • Preserve ordering across mixed tool results such as [text, image, text, image].
  1. Update ChatMessage.as_tool() to accept str | list[dict[str, Any]] so the generation trace can carry multimodal tool results unchanged.

  2. Keep the model facade/generation loop generic. It should append the assistant tool-call message and the corresponding tool result message as it does today, but the tool result content may now be a list of content blocks.

  3. Add tests at the MCP and facade boundaries:

  • MCP ImageContent becomes image_url data URI metadata, not JSON text.
  • Mixed text/image tool results preserve order.
  • MCPFacade.process_completion_response() returns a role="tool" ChatMessage whose content is a block list when the tool returned multimodal content.
  • Existing text-only behavior remains unchanged.

Provider API Handling

Provider adapters should lower the canonical internal block list at the edge, instead of forcing MCP/generation code to know provider-specific message grammars.

OpenAI-compatible Chat Completions:

  • DataDesigner currently sends raw HTTP chat-completions payloads, and VLM endpoints already accept image_url blocks in messages.
  • For this path, preserve the OpenAI-style block list on the role="tool" message and pass it through to compatible VLM backends.
  • Do not stringify image data. If a strict official OpenAI Chat Completions target rejects image parts on tool messages, handle that as a provider/adaptor compatibility issue rather than weakening the generic MCP result representation. A future Responses API path can map tool outputs to function_call_output.output with input_image blocks.

Anthropic Messages:

  • The Anthropic adapter already has translate_tool_result_content() and translate_image_url_block() support for non-string tool result content in packages/data-designer-engine/src/data_designer/engine/models/clients/adapters/anthropic_translation.py.
  • Once MCP tool results preserve image_url blocks, Anthropic can translate them into native tool_result.content image blocks with base64/url source metadata.

Future provider adapters:

  • Keep image_url data URI as the internal canonical representation.
  • Convert at the provider boundary into the provider's native image/document/tool-result block format.
  • If a provider has no valid multimodal tool-result representation, fail clearly or use an explicit provider-level fallback. Do not silently convert images to opaque text.

Investigation / Context

This follows MCP's standard CallToolResult.content model, where tool results are a list of content blocks and ImageContent carries base64 data plus mimeType.

LiteLLM has already hit this same issue. A prior bug reported that image content in tool messages was not passed through to Anthropic: BerriAI/litellm#6953. Current LiteLLM converts Chat Completions-style role="tool" messages containing image_url blocks into Anthropic tool_result content containing native image/document blocks: https://github.com/BerriAI/litellm/blob/e75c7a312a7a3bf9a19904557c485ac820f09d24/litellm/litellm_core_utils/prompt_templates/factory.py#L1678-L1837. LiteLLM also has regression tests for image content inside tool results: https://github.com/BerriAI/litellm/blob/e75c7a312a7a3bf9a19904557c485ac820f09d24/tests/test_litellm/litellm_core_utils/prompt_templates/test_litellm_core_utils_prompt_templates_factory.py#L1414-L1577.

OpenAI's strict SDK types currently define Chat Completions tool message content as text-only parts, while user messages support image parts. However, OpenAI-compatible VLM APIs and adapter libraries commonly use image_url content blocks as the interchange shape. Since DataDesigner already targets OpenAI-compatible VLM chat completions, the engine should preserve the multimodal content generically and let provider adapters decide the exact wire format.

Agent Plan / Findings

Recommended implementation order:

  1. Introduce a content-block-preserving MCP result coercer in engine/mcp/io.py.
  2. Widen MCPToolResult.content and ChatMessage.as_tool() types.
  3. Update MCPFacade to pass tool result content through without string assumptions.
  4. Add focused MCP/facade tests for text-only, image-only, and mixed text/image results.
  5. Add provider translation tests for Anthropic tool results and OpenAI-compatible passthrough.
  6. Document that MCP image results require a VLM-capable model/provider to be consumed semantically.

Dependencies

None known.

Metadata

Metadata

Assignees

No one assigned

    Labels

    taskInternal development task

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions