Priority Level
Medium
Task Summary
DataDesigner should preserve multimodal content returned from MCP tool calls and pass it back to the model in a provider-appropriate shape. Today MCP tool results are flattened into strings before they re-enter the generation loop, so image content returned by a tool is either lost as model-visible image metadata or serialized as JSON/text.
This matters for VLM-backed generation flows where an MCP server returns an image, screenshot, rendered page, chart, document page, or other visual result and the model should inspect it on the next turn without any user-side special handling.
Technical Details & Implementation Plan
The current path collapses tool results too early:
MCPToolResult.content is typed as str in packages/data-designer-engine/src/data_designer/engine/mcp/registry.py.
MCPIOService._call_tool_async() calls _serialize_tool_result_content() in packages/data-designer-engine/src/data_designer/engine/mcp/io.py, which converts lists of MCP content blocks into text/JSON strings.
MCPFacade._execute_tool_calls_from_canonical() wraps the result with ChatMessage.as_tool(content=result.content, ...) in packages/data-designer-engine/src/data_designer/engine/mcp/facade.py.
ChatMessage.content already supports str | list[dict[str, Any]], but ChatMessage.as_tool() only accepts str in packages/data-designer-engine/src/data_designer/engine/models/utils.py.
Proposed shape changes:
- Change
MCPToolResult.content to support typed content:
content: str | list[dict[str, Any]]
- Replace the MCP result serializer with a coercer that preserves MCP content blocks:
- MCP
TextContent(type="text", text=...) -> {"type": "text", "text": ...}
- MCP
ImageContent(type="image", data=..., mimeType=...) -> OpenAI-style canonical image block:
{
"type": "image_url",
"image_url": {
"url": f"data:{mime_type};base64,{data}",
},
}
- Structured/non-visual content can continue to degrade to text or JSON text, but image blocks should never be stringified.
- Preserve ordering across mixed tool results such as
[text, image, text, image].
-
Update ChatMessage.as_tool() to accept str | list[dict[str, Any]] so the generation trace can carry multimodal tool results unchanged.
-
Keep the model facade/generation loop generic. It should append the assistant tool-call message and the corresponding tool result message as it does today, but the tool result content may now be a list of content blocks.
-
Add tests at the MCP and facade boundaries:
- MCP
ImageContent becomes image_url data URI metadata, not JSON text.
- Mixed text/image tool results preserve order.
MCPFacade.process_completion_response() returns a role="tool" ChatMessage whose content is a block list when the tool returned multimodal content.
- Existing text-only behavior remains unchanged.
Provider API Handling
Provider adapters should lower the canonical internal block list at the edge, instead of forcing MCP/generation code to know provider-specific message grammars.
OpenAI-compatible Chat Completions:
- DataDesigner currently sends raw HTTP chat-completions payloads, and VLM endpoints already accept
image_url blocks in messages.
- For this path, preserve the OpenAI-style block list on the
role="tool" message and pass it through to compatible VLM backends.
- Do not stringify image data. If a strict official OpenAI Chat Completions target rejects image parts on
tool messages, handle that as a provider/adaptor compatibility issue rather than weakening the generic MCP result representation. A future Responses API path can map tool outputs to function_call_output.output with input_image blocks.
Anthropic Messages:
- The Anthropic adapter already has
translate_tool_result_content() and translate_image_url_block() support for non-string tool result content in packages/data-designer-engine/src/data_designer/engine/models/clients/adapters/anthropic_translation.py.
- Once MCP tool results preserve
image_url blocks, Anthropic can translate them into native tool_result.content image blocks with base64/url source metadata.
Future provider adapters:
- Keep
image_url data URI as the internal canonical representation.
- Convert at the provider boundary into the provider's native image/document/tool-result block format.
- If a provider has no valid multimodal tool-result representation, fail clearly or use an explicit provider-level fallback. Do not silently convert images to opaque text.
Investigation / Context
This follows MCP's standard CallToolResult.content model, where tool results are a list of content blocks and ImageContent carries base64 data plus mimeType.
LiteLLM has already hit this same issue. A prior bug reported that image content in tool messages was not passed through to Anthropic: BerriAI/litellm#6953. Current LiteLLM converts Chat Completions-style role="tool" messages containing image_url blocks into Anthropic tool_result content containing native image/document blocks: https://github.com/BerriAI/litellm/blob/e75c7a312a7a3bf9a19904557c485ac820f09d24/litellm/litellm_core_utils/prompt_templates/factory.py#L1678-L1837. LiteLLM also has regression tests for image content inside tool results: https://github.com/BerriAI/litellm/blob/e75c7a312a7a3bf9a19904557c485ac820f09d24/tests/test_litellm/litellm_core_utils/prompt_templates/test_litellm_core_utils_prompt_templates_factory.py#L1414-L1577.
OpenAI's strict SDK types currently define Chat Completions tool message content as text-only parts, while user messages support image parts. However, OpenAI-compatible VLM APIs and adapter libraries commonly use image_url content blocks as the interchange shape. Since DataDesigner already targets OpenAI-compatible VLM chat completions, the engine should preserve the multimodal content generically and let provider adapters decide the exact wire format.
Agent Plan / Findings
Recommended implementation order:
- Introduce a content-block-preserving MCP result coercer in
engine/mcp/io.py.
- Widen
MCPToolResult.content and ChatMessage.as_tool() types.
- Update
MCPFacade to pass tool result content through without string assumptions.
- Add focused MCP/facade tests for text-only, image-only, and mixed text/image results.
- Add provider translation tests for Anthropic tool results and OpenAI-compatible passthrough.
- Document that MCP image results require a VLM-capable model/provider to be consumed semantically.
Dependencies
None known.
Priority Level
Medium
Task Summary
DataDesigner should preserve multimodal content returned from MCP tool calls and pass it back to the model in a provider-appropriate shape. Today MCP tool results are flattened into strings before they re-enter the generation loop, so image content returned by a tool is either lost as model-visible image metadata or serialized as JSON/text.
This matters for VLM-backed generation flows where an MCP server returns an image, screenshot, rendered page, chart, document page, or other visual result and the model should inspect it on the next turn without any user-side special handling.
Technical Details & Implementation Plan
The current path collapses tool results too early:
MCPToolResult.contentis typed asstrinpackages/data-designer-engine/src/data_designer/engine/mcp/registry.py.MCPIOService._call_tool_async()calls_serialize_tool_result_content()inpackages/data-designer-engine/src/data_designer/engine/mcp/io.py, which converts lists of MCP content blocks into text/JSON strings.MCPFacade._execute_tool_calls_from_canonical()wraps the result withChatMessage.as_tool(content=result.content, ...)inpackages/data-designer-engine/src/data_designer/engine/mcp/facade.py.ChatMessage.contentalready supportsstr | list[dict[str, Any]], butChatMessage.as_tool()only acceptsstrinpackages/data-designer-engine/src/data_designer/engine/models/utils.py.Proposed shape changes:
MCPToolResult.contentto support typed content:TextContent(type="text", text=...)->{"type": "text", "text": ...}ImageContent(type="image", data=..., mimeType=...)-> OpenAI-style canonical image block:{ "type": "image_url", "image_url": { "url": f"data:{mime_type};base64,{data}", }, }[text, image, text, image].Update
ChatMessage.as_tool()to acceptstr | list[dict[str, Any]]so the generation trace can carry multimodal tool results unchanged.Keep the model facade/generation loop generic. It should append the assistant tool-call message and the corresponding tool result message as it does today, but the tool result
contentmay now be a list of content blocks.Add tests at the MCP and facade boundaries:
ImageContentbecomesimage_urldata URI metadata, not JSON text.MCPFacade.process_completion_response()returns arole="tool"ChatMessagewhosecontentis a block list when the tool returned multimodal content.Provider API Handling
Provider adapters should lower the canonical internal block list at the edge, instead of forcing MCP/generation code to know provider-specific message grammars.
OpenAI-compatible Chat Completions:
image_urlblocks in messages.role="tool"message and pass it through to compatible VLM backends.toolmessages, handle that as a provider/adaptor compatibility issue rather than weakening the generic MCP result representation. A future Responses API path can map tool outputs tofunction_call_output.outputwithinput_imageblocks.Anthropic Messages:
translate_tool_result_content()andtranslate_image_url_block()support for non-string tool result content inpackages/data-designer-engine/src/data_designer/engine/models/clients/adapters/anthropic_translation.py.image_urlblocks, Anthropic can translate them into nativetool_result.contentimage blocks with base64/urlsourcemetadata.Future provider adapters:
image_urldata URI as the internal canonical representation.Investigation / Context
This follows MCP's standard
CallToolResult.contentmodel, where tool results are a list of content blocks andImageContentcarries base64dataplusmimeType.LiteLLM has already hit this same issue. A prior bug reported that image content in tool messages was not passed through to Anthropic: BerriAI/litellm#6953. Current LiteLLM converts Chat Completions-style
role="tool"messages containingimage_urlblocks into Anthropictool_resultcontent containing native image/document blocks: https://github.com/BerriAI/litellm/blob/e75c7a312a7a3bf9a19904557c485ac820f09d24/litellm/litellm_core_utils/prompt_templates/factory.py#L1678-L1837. LiteLLM also has regression tests for image content inside tool results: https://github.com/BerriAI/litellm/blob/e75c7a312a7a3bf9a19904557c485ac820f09d24/tests/test_litellm/litellm_core_utils/prompt_templates/test_litellm_core_utils_prompt_templates_factory.py#L1414-L1577.OpenAI's strict SDK types currently define Chat Completions tool message content as text-only parts, while user messages support image parts. However, OpenAI-compatible VLM APIs and adapter libraries commonly use
image_urlcontent blocks as the interchange shape. Since DataDesigner already targets OpenAI-compatible VLM chat completions, the engine should preserve the multimodal content generically and let provider adapters decide the exact wire format.Agent Plan / Findings
Recommended implementation order:
engine/mcp/io.py.MCPToolResult.contentandChatMessage.as_tool()types.MCPFacadeto pass tool result content through without string assumptions.Dependencies
None known.