feat(mcp): add ooxml_package_part for OPC part metadata by caio-pizzol · Pull Request #7 · superdoc-dev/ooxml-dev

caio-pizzol · 2026-05-12T11:40:20Z

The XSD schema graph and the prose corpus don't answer "what kind of OPC part is /customXml/item1.xml?" That's package metadata: content type, source relationship type, root namespace, typical path. Agents working with .docx / .xlsx / .pptx packages need it constantly and currently have to reconstruct it from prose search.

Adds ooxml_package_part backed by a curated static dataset of 25 OPC part types in apps/mcp-server/src/opc-parts.ts. Covers Word (document, styles, settings, numbering, comments, footnotes, endnotes, header, footer), Excel (workbook, worksheet, shared strings), PowerPoint (presentation, slide, slide layout, slide master), and cross-cutting (core / extended / custom properties, theme, image, custom XML data storage and its properties part).

Four lookup modes: exact content_type, exact relationship_type, query substring, or no args → list-all. Where the spec prose and XSD target namespace disagree (custom XML data storage properties part is named .../customXmlDataProps in §15.2.6 but the XSD targets .../customXml), rootNamespace pins the XSD URI so the value composes cleanly with ooxml_element.

Static typed data, no DB. The set is small, static across ECMA editions, and curated; the PR diff is the audit primitive. Adding a new entry is appending to OPC_PARTS — the lookup index rebuilds lazily on first access.

Hyperlinks are intentionally out of scope: they're a relationship type, not a package part. If needed later they'd warrant a different model.

Review: confirm the curated set covers your common cases; flag any wrong content type / relationship URI / namespace pins (these were transcribed from Part 1 §11.3.x / §12.3.x / §13.3.x / §15.x). Ignore the rest of ooxml-tools.ts — additive only.

Verified: 71 pass / 3 skip / 0 fail. Format / lint / typecheck / build all clean. (The 3 skips are the xsd-cache-gated smoke tests in tests/ingest-xsd/, unrelated to this PR.)

The XSD schema graph answers "what's legal inside this XML body?" The prose corpus answers "what does this spec section say?" Neither answers "what kind of OPC part is /customXml/item1.xml?" That's a package-level concern: content type, source relationship type, root namespace, typical path. Agents working with .docx / .xlsx / .pptx packages reach for this constantly and have nowhere structural to land. Adds `ooxml_package_part` backed by a curated static dataset of 25 OPC part types from ECMA-376 Part 1 §11.3.x (WML), §12.3.x (SML), §13.3.x (PML), §14.2.7.10 (theme), and §15.x (cross-cutting). Word covers document, styles, settings, numbering, comments, footnotes, endnotes, header, footer; Excel covers workbook, worksheet, shared strings; PowerPoint covers presentation, slide, slide layout, slide master; cross-cutting covers core / extended / custom properties, theme, image, custom XML data storage, custom XML data storage properties. Four lookup modes: exact content_type, exact relationship_type, query substring, or no args → list-all. Where the spec prose and the XSD target namespace disagree (the custom XML data storage properties part is named .../customXmlDataProps in §15.2.6 but the shipped XSD targets .../customXml), rootNamespace pins the XSD URI so the value composes cleanly with ooxml_element. Static typed data in apps/mcp-server/src/opc-parts.ts, no DB. The set is small, static across ECMA editions, and curated; the PR diff is the audit primitive. Add a new entry by appending to OPC_PARTS; the lookup index rebuilds lazily. Tests cover dataset consistency (unique keys, non-empty required fields, every family represented), exact and substring lookups, and the four tool dispatch modes. No DB needed for any of them.

Three issues from PR review: - relationship_type lookup collapsed shared rels. The .../relationships/ officeDocument URI points at the main part for WML, SML, and PML, but the Map<string, OpcPart> index let later entries overwrite earlier ones, so a lookup returned only the Presentation part. Index is now Map<string, OpcPart[]>; the dispatcher renders multi-match as a list with a note that the relationship is shared across families and the caller has to disambiguate by the source part. - Image content type was a wildcard display string. Real [Content_Types].xml entries record a specific media type per image (image/png, image/jpeg, ...) so an exact lookup against the display string never matched. contentType is now `string | string[]`; the Image Part enumerates the spec-§15.2.13 set (png, jpeg, gif, tiff, x-emf, x-wmf, bmp). Each entry is indexed; the formatter renders multi-content-type records under a plural label with a "+N more" indicator in the list view. - initialize handler and apps/mcp-server/README.md still advertised two tool families and omitted ooxml_package_part, hurting agent discoverability. Both updated to list three tool families and describe the package-metadata corpus. New tests cover (a) every enumerated image media type resolving exactly, (b) the shared officeDocument relationship returning all three main parts, and (c) the tool's multi-match rendering for shared rels. Existing tests updated for the new helper name / array contract.

The previous fix updated the server README and initialize text but missed the web-facing surfaces, which still advertised two tool families. Bringing every surface in sync: - apps/web/src/pages/Mcp.tsx: hero copy updated, added a Package metadata section, refreshed the trailing "what is MCP" paragraph. - apps/web/public/llms.txt: feeds llms.txt/llms-full.txt that AI crawlers and the build-time SEO pipeline consume. - apps/mcp-server/src/index.ts: header comment in the worker entry. - README.md + CLAUDE.md: project-level docs. - brand.md: brand-voice copy that lists the MCP as an AI-native differentiator. No behavior change; everything in this commit is documentation / agent-discoverability surface.

caiopizzol added 3 commits May 12, 2026 08:39

caio-pizzol merged commit 47b61c9 into main May 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcp): add ooxml_package_part for OPC part metadata#7

feat(mcp): add ooxml_package_part for OPC part metadata#7
caio-pizzol merged 3 commits into
mainfrom
caio/ooxml-package-part

caio-pizzol commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

caio-pizzol commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants