Skip to content

Escape bare & in Bibcollection#to_xml to prevent invalid XML: https://github.com/metanorma/isodoc/issues/785#128

Merged
andrew2net merged 2 commits into
mainfrom
fix-html-escaping-collection-title
May 28, 2026
Merged

Escape bare & in Bibcollection#to_xml to prevent invalid XML: https://github.com/metanorma/isodoc/issues/785#128
andrew2net merged 2 commits into
mainfrom
fix-html-escaping-collection-title

Conversation

@opoudjis
Copy link
Copy Markdown
Contributor

Problem

Bibcollection#to_xml writes the collection title and author directly into XML without escaping. When these values come from YAML (e.g. name: "A test & playground ..."), the output is a bare & character, which is invalid XML.

libxml2 responds with a FATAL xmlParseEntityRef: no name error and enters recovery mode. In recovery mode it silently drops all subsequent & entity references in the same document — so every individual bibitem title that contains & also loses its & in the generated HTML output. This is what surfaces as the bug reported in metanorma/isodoc#785.

Root cause trace

YAML: name: "A test & playground …"
  ↓ to_xml (unescaped)
documents.xml: <title>A test & playground …</title>   ← invalid XML
  ↓ Nokogiri::XML parse (libxml2 recovery mode)
  FATAL: xmlParseEntityRef: no name
  → bare & dropped from collection title
  → ALL subsequent &amp; in document dropped as collateral damage
    → individual doc <title>…<em>Test</em> &amp; <strong>Play</strong>…</title>
       becomes  <em>Test</em>  <strong>Play</strong>  (entity gone, two spaces left)
  ↓ find_html / to_h
  ↓ Liquid template
index.html: "A test  playground …"  and  "<em>Test</em>  <strong>Playground</strong>"

Fix

Add a private xml_escape helper that escapes only unencoded & — not already-encoded entity references (&amp;, &#123;, &#x1f;) — and leaves inline markup tags (<em>, <strong>, etc.) untouched. This means values from YAML are safe, and HTML fragments round-tripped via find_html pass through unchanged.

Relationship to the earlier find_text → find_html commit

The preceding commit on this branch (8ed16f5) fixed the read path (use inner_html to preserve markup when re-reading the collection title from XML). This commit fixes the write path (escape before inserting into XML). Both are needed: without the write-path fix the XML is never valid, so the read-path fix cannot help.

Verification

After applying the fix, regenerating the metanorma-pdfa test collection produces:

<!-- documents.xml — collection title now valid XML -->
<title>A test &amp; playground PDFa document from YAML</title>
<!-- index.html — collection title -->
<span class="title-first">A test &amp; playground PDFa document from YAML</span>

<!-- index.html — individual doc title, &amp; now preserved -->
<a href="./documents/test-pdfa.html"><em>Test</em> &amp; <strong>Playground</strong> PDFa Document from ADOC <tt>H1</tt></a>

Refs: metanorma/isodoc#785: metanorma/isodoc#785

🤖 Generated with Claude Code

opoudjis and others added 2 commits May 13, 2026 00:05
Switch Bibcollection.from_xml to read the collection title and author
via inner_html instead of Nokogiri's .text, so the in-memory strings
keep their XML-fragment form (markup + entities intact). Apply the
strip_html Liquid filter on the HTML <title> tag position so the
browser tab title stays plain text. Adds find_html to ElementFinder
alongside find_text. Adds a regression spec with markup and &amp; in
both the collection title and the author name.

Refs metanorma/isodoc#785.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bibcollection.rb to_xml was writing the collection title and author
directly into XML without escaping, producing bare & in the output
when the values came from YAML (e.g. name: "A test & playground ...").
A bare & is invalid XML; libxml2 in recovery mode emits FATAL
"xmlParseEntityRef: no name" and then silently drops all subsequent
&amp; entities in the same document — corrupting every individual
document title's & in the collection index HTML output.

Add a private xml_escape helper that escapes only unencoded & (not
already-encoded &amp;, &#nnn;, &#xhh;) and leaves inline markup tags
(<em>, <strong>, etc.) untouched, so valid HTML fragments round-tripped
via find_html pass through unchanged.

Fixes metanorma/isodoc#785: metanorma/isodoc#785

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@andrew2net
Copy link
Copy Markdown
Contributor

@opoudjis, this PR is against Relaton v1. Do you still use v1? Should we reimplement this in Relaton v2?

@andrew2net andrew2net merged commit 4564a10 into main May 28, 2026
14 checks passed
@opoudjis
Copy link
Copy Markdown
Contributor Author

@andrew2net Yes, sorry, I am surprised it is against v1. My debugging was against the running relaton-cli in v2, so I don't understand why the PR has landed against that, is it because the relaton-cli code is no longer maintained here?

@andrew2net
Copy link
Copy Markdown
Contributor

I think it's because the branch was created on top of the main. Relaton v2 lives in the lutaml-integration branch.
I'll port the update to v2.
Currently, I'm working on the Relaton monorepo. It will use the main branch for v2.
FYI, I'm going to integrate Pubid v2 in the next minor v2.2 release.

andrew2net added a commit that referenced this pull request May 29, 2026
Port the write-path fix from #128 to v2/lutaml-integration. The read-path
half of #128 (find_html + strip_html) was already ported via #127; this
commit ports the remaining to_xml escaping.

bibcollection.rb to_xml was writing the collection title and author
directly into XML without escaping, producing bare & in the output
when the values came from YAML (e.g. name: "A test & playground ...").
A bare & is invalid XML; libxml2 in recovery mode emits FATAL
"xmlParseEntityRef: no name" and then silently drops all subsequent
&amp; entities in the same document — corrupting every individual
document title's & in the collection index HTML output.

Add a private xml_escape helper that escapes only unencoded & (not
already-encoded &amp;, &#nnn;, &#xhh;) and leaves inline markup tags
(<em>, <strong>, etc.) untouched, so valid HTML fragments round-tripped
via find_html pass through unchanged.

Fixes metanorma/isodoc#785: metanorma/isodoc#785

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@andrew2net andrew2net deleted the fix-html-escaping-collection-title branch May 29, 2026 21:44
@andrew2net
Copy link
Copy Markdown
Contributor

Ported the update into v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants