Preserve multimodal MCP tool-call results through model provider adapters

## Priority Level
Medium

## Task Summary

DataDesigner should preserve multimodal content returned from MCP tool calls and pass it back to the model in a provider-appropriate shape. Today MCP tool results are flattened into strings before they re-enter the generation loop, so image content returned by a tool is either lost as model-visible image metadata or serialized as JSON/text.

This matters for VLM-backed generation flows where an MCP server returns an image, screenshot, rendered page, chart, document page, or other visual result and the model should inspect it on the next turn without any user-side special handling.

## Technical Details & Implementation Plan

The current path collapses tool results too early:

- `MCPToolResult.content` is typed as `str` in `packages/data-designer-engine/src/data_designer/engine/mcp/registry.py`.
- `MCPIOService._call_tool_async()` calls `_serialize_tool_result_content()` in `packages/data-designer-engine/src/data_designer/engine/mcp/io.py`, which converts lists of MCP content blocks into text/JSON strings.
- `MCPFacade._execute_tool_calls_from_canonical()` wraps the result with `ChatMessage.as_tool(content=result.content, ...)` in `packages/data-designer-engine/src/data_designer/engine/mcp/facade.py`.
- `ChatMessage.content` already supports `str | list[dict[str, Any]]`, but `ChatMessage.as_tool()` only accepts `str` in `packages/data-designer-engine/src/data_designer/engine/models/utils.py`.

Proposed shape changes:

1. Change `MCPToolResult.content` to support typed content:

```python
content: str | list[dict[str, Any]]
```

2. Replace the MCP result serializer with a coercer that preserves MCP content blocks:

- MCP `TextContent(type="text", text=...)` -> `{"type": "text", "text": ...}`
- MCP `ImageContent(type="image", data=..., mimeType=...)` -> OpenAI-style canonical image block:

```python
{
    "type": "image_url",
    "image_url": {
        "url": f"data:{mime_type};base64,{data}",
    },
}
```

- Structured/non-visual content can continue to degrade to text or JSON text, but image blocks should never be stringified.
- Preserve ordering across mixed tool results such as `[text, image, text, image]`.

3. Update `ChatMessage.as_tool()` to accept `str | list[dict[str, Any]]` so the generation trace can carry multimodal tool results unchanged.

4. Keep the model facade/generation loop generic. It should append the assistant tool-call message and the corresponding tool result message as it does today, but the tool result `content` may now be a list of content blocks.

5. Add tests at the MCP and facade boundaries:

- MCP `ImageContent` becomes `image_url` data URI metadata, not JSON text.
- Mixed text/image tool results preserve order.
- `MCPFacade.process_completion_response()` returns a `role="tool"` `ChatMessage` whose `content` is a block list when the tool returned multimodal content.
- Existing text-only behavior remains unchanged.

## Provider API Handling

Provider adapters should lower the canonical internal block list at the edge, instead of forcing MCP/generation code to know provider-specific message grammars.

OpenAI-compatible Chat Completions:

- DataDesigner currently sends raw HTTP chat-completions payloads, and VLM endpoints already accept `image_url` blocks in messages.
- For this path, preserve the OpenAI-style block list on the `role="tool"` message and pass it through to compatible VLM backends.
- Do not stringify image data. If a strict official OpenAI Chat Completions target rejects image parts on `tool` messages, handle that as a provider/adaptor compatibility issue rather than weakening the generic MCP result representation. A future Responses API path can map tool outputs to `function_call_output.output` with `input_image` blocks.

Anthropic Messages:

- The Anthropic adapter already has `translate_tool_result_content()` and `translate_image_url_block()` support for non-string tool result content in `packages/data-designer-engine/src/data_designer/engine/models/clients/adapters/anthropic_translation.py`.
- Once MCP tool results preserve `image_url` blocks, Anthropic can translate them into native `tool_result.content` image blocks with base64/url `source` metadata.

Future provider adapters:

- Keep `image_url` data URI as the internal canonical representation.
- Convert at the provider boundary into the provider's native image/document/tool-result block format.
- If a provider has no valid multimodal tool-result representation, fail clearly or use an explicit provider-level fallback. Do not silently convert images to opaque text.

## Investigation / Context

This follows MCP's standard `CallToolResult.content` model, where tool results are a list of content blocks and `ImageContent` carries base64 `data` plus `mimeType`.

LiteLLM has already hit this same issue. A prior bug reported that image content in tool messages was not passed through to Anthropic: https://github.com/BerriAI/litellm/issues/6953. Current LiteLLM converts Chat Completions-style `role="tool"` messages containing `image_url` blocks into Anthropic `tool_result` content containing native image/document blocks: https://github.com/BerriAI/litellm/blob/e75c7a312a7a3bf9a19904557c485ac820f09d24/litellm/litellm_core_utils/prompt_templates/factory.py#L1678-L1837. LiteLLM also has regression tests for image content inside tool results: https://github.com/BerriAI/litellm/blob/e75c7a312a7a3bf9a19904557c485ac820f09d24/tests/test_litellm/litellm_core_utils/prompt_templates/test_litellm_core_utils_prompt_templates_factory.py#L1414-L1577.

OpenAI's strict SDK types currently define Chat Completions tool message content as text-only parts, while user messages support image parts. However, OpenAI-compatible VLM APIs and adapter libraries commonly use `image_url` content blocks as the interchange shape. Since DataDesigner already targets OpenAI-compatible VLM chat completions, the engine should preserve the multimodal content generically and let provider adapters decide the exact wire format.

## Agent Plan / Findings

Recommended implementation order:

1. Introduce a content-block-preserving MCP result coercer in `engine/mcp/io.py`.
2. Widen `MCPToolResult.content` and `ChatMessage.as_tool()` types.
3. Update `MCPFacade` to pass tool result content through without string assumptions.
4. Add focused MCP/facade tests for text-only, image-only, and mixed text/image results.
5. Add provider translation tests for Anthropic tool results and OpenAI-compatible passthrough.
6. Document that MCP image results require a VLM-capable model/provider to be consumed semantically.

## Dependencies

None known.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve multimodal MCP tool-call results through model provider adapters #607

Priority Level

Task Summary

Technical Details & Implementation Plan

Provider API Handling

Investigation / Context

Agent Plan / Findings

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Preserve multimodal MCP tool-call results through model provider adapters #607

Description

Priority Level

Task Summary

Technical Details & Implementation Plan

Provider API Handling

Investigation / Context

Agent Plan / Findings

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions