Windows + MCP protocol + format handling: 5 bugs blocking real-world use

Hi — first, thanks for maintaining this fork. I spent some time getting it running on Windows with Claude Code and hit a cluster of bugs that interact. Filing as one umbrella because they overlap (all affect the same startup/IO path or the same dispatch function), with a follow-up PR providing fixes. Happy to split into separate issues/PRs if you prefer.

Environment: Windows 11, Python 3.13, Claude Code as MCP client, markitdown-mcp `main` (commit as of 2026-04-19), `markitdown[all]` installed.

---

### Bug 1 — stdio defaults break the MCP protocol on Windows

Python's text-mode stdio defaults on Windows break line-delimited JSON-RPC in two ways:
- **stdout**: CRLF translation (`\n` → `\r\n`) corrupts the framing
- **stdin**: cp1252 encoding corrupts non-ASCII bytes (e.g. a path containing `ä` arrives as `Ã¤` and the subsequent file operation fails)

Repro:
1. Run as an MCP server under any Windows client (Claude Code, Claude Desktop)
2. Send a `convert_file` request with a file path containing a non-ASCII character (`Jäger`, `Müller`, …)
3. Observe: `Security violation: invalid path` (the decoded path no longer matches the filesystem)

Root cause: `main()` relies on interpreter defaults. On Unix these are UTF-8 + LF; on Windows they're cp1252 + CRLF.

Proposed fix: reconfigure stdio at the top of `main()`:
```python
sys.stdout.reconfigure(encoding="utf-8", newline="\n")
sys.stdin.reconfigure(encoding="utf-8")
```
No-op on platforms that already default to UTF-8/LF.

---

### Bug 2 — `Path.home()` can crash server init

`get_safe_working_directories()` calls `Path.home()` unconditionally. If neither `HOME` nor `USERPROFILE` is set, `Path.home()` raises `RuntimeError` and the whole `MarkItDownMCPServer.__init__` aborts with an opaque traceback before the server ever processes a request.

Repro: launch the server with a cleared environment (`env -i python -m markitdown_mcp.server` on Unix; Claude Code currently also spawns stdio MCP servers with effectively empty `env` on Windows).

Proposed fix: wrap the call in try/except, log a warning, skip the home-subdir additions:
```python
try:
    home = Path.home()
except RuntimeError:
    logger.warning("Could not determine user home; skipping home subdirs")
    home = None
```

---

### Bug 3 — Server replies to notifications ([JSON-RPC 2.0 §4.1](https://www.jsonrpc.org/specification#notification) violation)

The message dispatch in `MarkItDownMCPServer.run()` builds a response for every incoming message, including notifications (messages without an `id`). [JSON-RPC 2.0 §4.1 "Notification"](https://www.jsonrpc.org/specification#notification): *"The Server MUST NOT reply to a Notification, including those that are within a batch request."* MCP uses `notifications/initialized` during the handshake, so this breaks any strict MCP client.

Repro:
1. Send `{"jsonrpc":"2.0","method":"notifications/initialized","params":{}}` (no `id`)
2. Observe a response with fabricated `id: "unknown"`

Current code:
```python
request = MCPRequest(
    id=message.get("id", "unknown"),  # fake id invented here
    ...
)
response = await self.handle_request(request)
# ... always writes a response
```

Proposed fix:
```python
is_notification = "id" not in message
request = MCPRequest(id=message.get("id"), ...)
response = await self.handle_request(request)
if is_notification:
    continue
# else write response
```

(`MCPRequest.id` / `MCPResponse.id` type needs to become `str | int | None` — JSON-RPC allows numeric and null ids too.)

---

### Bug 4 — `anyOf` at top of `inputSchema` rejects the `convert_file` tool on the Anthropic API

The `convert_file` tool schema uses `anyOf` at the top level of `inputSchema` to express "either `file_path` OR `file_content`+`filename`". The Anthropic Messages API (and thus Claude Code / Claude Desktop) rejects this with:

> `input_schema does not support oneOf, allOf, or anyOf at the top level`

→ the tool silently fails to load for any Anthropic-based client.

Proposed fix: drop `anyOf`, leave required-field enforcement to the runtime (the handler already validates), and clarify the either/or rule in the tool description:

```python
"description": (
    "Convert a file to Markdown using MarkItDown. "
    "Provide either 'file_path' OR both 'file_content' (base64) and 'filename'."
),
"inputSchema": {
    "type": "object",
    "properties": { "file_path": {...}, "file_content": {...}, "filename": {...} },
    # no anyOf, no required
},
```

Note: the same schema is duplicated in `get_tools()` and inline in `handle_request()`'s `tools/list` branch — both need the fix. (Separately: consider having the inline branch call `self.get_tools()` so the duplication goes away.)

---

### Bug 5 — `"xml" in mime_type` falsely matches `openxmlformats` → docx/xlsx/pptx broken

`validate_file_content_security()` dispatches to `validate_xml_security()` based on:
```python
if (mime_type and "xml" in mime_type) or file_ext in [".xml", ".xhtml"]:
```

The MIME type for `.docx` is `application/vnd.openxmlformats-officedocument.wordprocessingml.document` — which contains the substring `"xml"`. The file is then opened in text mode with `errors="ignore"`, scanned for XML entity patterns, and written back as a "sanitized" `.xml` file. MarkItDown receives a broken UTF-8 text stream that started life as a ZIP container → ~400 KB of garbled ZIP bytes instead of Markdown.

Same failure mode applies to `.xlsx` (`…spreadsheetml.sheet`) and `.pptx` (`…presentationml.presentation`) — every Office OpenXML format.

The `json`/`csv` branches below have the same substring anti-pattern; less explosive in practice but worth fixing for consistency.

Proposed fix: exact MIME matching via module-level sets:
```python
_XML_MIME_TYPES = {"text/xml", "application/xml"}
_JSON_MIME_TYPES = {"application/json", "text/json"}
_CSV_MIME_TYPES = {"text/csv", "application/csv"}
# ...
if (mime_type in _XML_MIME_TYPES) or file_ext in [".xml", ".xhtml"]:
    return validate_xml_security(file_path)
# same for json/csv
```

---

### PR

All five fixes are implemented locally and verified on Windows 11 (Claude Code, Python 3.13, `markitdown[all]`) with test files covering docx/xlsx/pdf, non-ASCII paths, and safe-dir-rejected paths. I'll open a PR referencing this issue with one commit per bug so each change can be reviewed in isolation. A separate feature-request issue will cover a configurable safe-directory env var (not filed here because it's not a bug).

Happy to split this into five issues if you'd rather have them tracked individually.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows + MCP protocol + format handling: 5 bugs blocking real-world use #36

Bug 1 — stdio defaults break the MCP protocol on Windows

Bug 2 — `Path.home()` can crash server init

Bug 3 — Server replies to notifications (JSON-RPC 2.0 §4.1 violation)

Bug 4 — `anyOf` at top of `inputSchema` rejects the `convert_file` tool on the Anthropic API

Bug 5 — `"xml" in mime_type` falsely matches `openxmlformats` → docx/xlsx/pptx broken

PR

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Windows + MCP protocol + format handling: 5 bugs blocking real-world use #36

Description

Bug 1 — stdio defaults break the MCP protocol on Windows

Bug 2 — Path.home() can crash server init

Bug 3 — Server replies to notifications (JSON-RPC 2.0 §4.1 violation)

Bug 4 — anyOf at top of inputSchema rejects the convert_file tool on the Anthropic API

Bug 5 — "xml" in mime_type falsely matches openxmlformats → docx/xlsx/pptx broken

PR

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Bug 2 — `Path.home()` can crash server init

Bug 4 — `anyOf` at top of `inputSchema` rejects the `convert_file` tool on the Anthropic API

Bug 5 — `"xml" in mime_type` falsely matches `openxmlformats` → docx/xlsx/pptx broken