Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 5, 2025

Binary data properties (BlobResourceContents.Blob, ImageContentBlock.Data, AudioContentBlock.Data) were exposed as string, causing UTF-8 → UTF-16 → base64 decode conversions on every access.

Changes

Protocol types now store base64-encoded binary data as ReadOnlyMemory<byte>:

  • BlobResourceContents.Blob: Zero-copy UTF-8 representation with lazy-decoded Data property
  • ImageContentBlock.Data: Zero-copy UTF-8 representation with lazy-decoded DecodedData property
  • AudioContentBlock.Data: Zero-copy UTF-8 representation with lazy-decoded DecodedData property
  • Cache invalidation on property modification

JSON converters read/write UTF-8 bytes directly from Utf8JsonReader.ValueSpan/ValueSequence without intermediate string allocation.

API Impact

Before:

var image = new ImageContentBlock 
{ 
    Data = Convert.ToBase64String(imageBytes), // string
    MimeType = "image/png" 
};
byte[] decoded = Convert.FromBase64String(image.Data); // manual decode

After:

var image = new ImageContentBlock 
{ 
    Data = Encoding.UTF8.GetBytes(Convert.ToBase64String(imageBytes)), // UTF-8 bytes
    MimeType = "image/png" 
};
byte[] decoded = image.DecodedData.ToArray(); // lazy decode + cache

Wire format unchanged. Converters handle UTF-8 directly during JSON serialization/deserialization.

Original prompt

This section details on the original issue you should resolve

<issue_title>Content types that represent binary data are exposing UTF16 encoded strings</issue_title>
<issue_description>

/// <summary>
/// Gets or sets the base64-encoded string representing the binary data of the item.
/// </summary>
[JsonPropertyName("blob")]
public required string Blob { get; set; }

/// <summary>
/// Gets or sets the base64-encoded image data.
/// </summary>
[JsonPropertyName("data")]
public required string Data { get; set; }

/// <summary>
/// Gets or sets the base64-encoded audio data.
/// </summary>
[JsonPropertyName("data")]
public required string Data { get; set; }

Potentially more -- https://github.com/search?q=repo%3Amodelcontextprotocol%2Fcsharp-sdk%20%22base64-encoded%22&type=code

Doing this means that the UTF8 data that comes across the wire is transcoded to UTF16 just to be then base64 decoded to bytes when anyone uses it. The intermediate step to UTF16 (.NET string) is unnecessary, wastes cycles, creates garbage, and makes the API harder to use.

Instead these types should keep the data in UTF8 and then have lazy getters that will do the base64-decoding. We can tell folks to use those decoded getter properties.

https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.AI.Abstractions/Contents/DataContent.cs follows a similar pattern, though it uses the Uri as it's serialized representation. It seems these MCP types should use the UTF8 encoded data as their serialized representation.

The OpenAI SDK exposes a BinaryData as its representation to deal with this problem https://github.com/openai/openai-dotnet/blob/636075222205d626a770acb1af1e5cc060abd517/src/Custom/Images/GeneratedImage.cs#L23 however that's slightly less usable since it forces the caller to decode.

The proposal here would be to have these be instead:

/// <summary> 
/// Gets or sets the base64-encoded UTF-8 byes representing the binary data of the item.
/// </summary> 
/// <remarks>
/// This is a zero-copy representation of the wire payload of this item.  Setting this value will invalidate any cached value of <see cref="Data" />.
/// </remarks>
[JsonPropertyName("blob")] 
public required ReadOnlyMemory<byte> Blob { get; set; } 

/// <summary> 
/// Gets the decoded data represented by <see cref="Blob" />.
/// </summary>
/// <remarks>
/// Accessing this member will decode the value in <see cref="Blob" /> and cache the result.
/// </remarks>
[JsonIgnore]
public byte[] Data { get; }

</issue_description>

<agent_instructions>Please find any protocol types that expose binary data as string (typically base64 encoded) and follow the cited recommendations for how to expose them differently.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@stephentoub > The proposal here would be to have these be instead:

Should Data also be a ReadOnlyMemory<byte>?

Is Data going to cache the decoded data lazily, or is it going to produce a new array on each access? If the latter, it shouldn't be a property.

what do you think about the name Blob

I'm torn. On the one hand, I agree with you. On the other hand, it's nice to keep it the same or super close to the spec. Offline you'd proposed something like Base64Blob; is that still your first choice?</comment_new>


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 3 commits December 5, 2025 23:00
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix binary data types exposing UTF16 encoded strings Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding Dec 5, 2025
Copilot AI requested a review from stephentoub December 5, 2025 23:28
Copilot finished work on behalf of stephentoub December 5, 2025 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Content types that represent binary data are exposing UTF16 encoded strings

2 participants