Escape bare & in Bibcollection#to_xml to prevent invalid XML: https://github.com/metanorma/isodoc/issues/785 by opoudjis · Pull Request #128 · relaton/relaton-cli

opoudjis · 2026-05-28T15:06:27Z

Problem

Bibcollection#to_xml writes the collection title and author directly into XML without escaping. When these values come from YAML (e.g. name: "A test & playground ..."), the output is a bare & character, which is invalid XML.

libxml2 responds with a FATAL xmlParseEntityRef: no name error and enters recovery mode. In recovery mode it silently drops all subsequent & entity references in the same document — so every individual bibitem title that contains & also loses its & in the generated HTML output. This is what surfaces as the bug reported in metanorma/isodoc#785.

Root cause trace

YAML: name: "A test & playground …"
  ↓ to_xml (unescaped)
documents.xml: <title>A test & playground …</title>   ← invalid XML
  ↓ Nokogiri::XML parse (libxml2 recovery mode)
  FATAL: xmlParseEntityRef: no name
  → bare & dropped from collection title
  → ALL subsequent &amp; in document dropped as collateral damage
    → individual doc <title>…<em>Test</em> &amp; <strong>Play</strong>…</title>
       becomes  <em>Test</em>  <strong>Play</strong>  (entity gone, two spaces left)
  ↓ find_html / to_h
  ↓ Liquid template
index.html: "A test  playground …"  and  "<em>Test</em>  <strong>Playground</strong>"

Fix

Add a private xml_escape helper that escapes only unencoded & — not already-encoded entity references (&, {, ) — and leaves inline markup tags (, , etc.) untouched. This means values from YAML are safe, and HTML fragments round-tripped via find_html pass through unchanged.

Relationship to the earlier `find_text → find_html` commit

The preceding commit on this branch (8ed16f5) fixed the read path (use inner_html to preserve markup when re-reading the collection title from XML). This commit fixes the write path (escape before inserting into XML). Both are needed: without the write-path fix the XML is never valid, so the read-path fix cannot help.

Verification

After applying the fix, regenerating the metanorma-pdfa test collection produces:

<!-- documents.xml — collection title now valid XML -->
<title>A test &amp; playground PDFa document from YAML</title>

<!-- index.html — collection title -->
<span class="title-first">A test &amp; playground PDFa document from YAML</span>

<!-- index.html — individual doc title, &amp; now preserved -->
<a href="./documents/test-pdfa.html"><em>Test</em> &amp; <strong>Playground</strong> PDFa Document from ADOC <tt>H1</tt></a>

Refs: metanorma/isodoc#785: metanorma/isodoc#785

🤖 Generated with Claude Code

Switch Bibcollection.from_xml to read the collection title and author via inner_html instead of Nokogiri's .text, so the in-memory strings keep their XML-fragment form (markup + entities intact). Apply the strip_html Liquid filter on the HTML <title> tag position so the browser tab title stays plain text. Adds find_html to ElementFinder alongside find_text. Adds a regression spec with markup and & in both the collection title and the author name. Refs metanorma/isodoc#785. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

bibcollection.rb to_xml was writing the collection title and author directly into XML without escaping, producing bare & in the output when the values came from YAML (e.g. name: "A test & playground ..."). A bare & is invalid XML; libxml2 in recovery mode emits FATAL "xmlParseEntityRef: no name" and then silently drops all subsequent & entities in the same document — corrupting every individual document title's & in the collection index HTML output. Add a private xml_escape helper that escapes only unencoded & (not already-encoded &, &#nnn;, &#xhh;) and leaves inline markup tags (, , etc.) untouched, so valid HTML fragments round-tripped via find_html pass through unchanged. Fixes metanorma/isodoc#785: metanorma/isodoc#785 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

andrew2net · 2026-05-28T17:33:39Z

@opoudjis, this PR is against Relaton v1. Do you still use v1? Should we reimplement this in Relaton v2?

opoudjis · 2026-05-29T11:36:55Z

@andrew2net Yes, sorry, I am surprised it is against v1. My debugging was against the running relaton-cli in v2, so I don't understand why the PR has landed against that, is it because the relaton-cli code is no longer maintained here?

andrew2net · 2026-05-29T18:29:51Z

I think it's because the branch was created on top of the main. Relaton v2 lives in the lutaml-integration branch.
I'll port the update to v2.
Currently, I'm working on the Relaton monorepo. It will use the main branch for v2.
FYI, I'm going to integrate Pubid v2 in the next minor v2.2 release.

Port the write-path fix from #128 to v2/lutaml-integration. The read-path half of #128 (find_html + strip_html) was already ported via #127; this commit ports the remaining to_xml escaping. bibcollection.rb to_xml was writing the collection title and author directly into XML without escaping, producing bare & in the output when the values came from YAML (e.g. name: "A test & playground ..."). A bare & is invalid XML; libxml2 in recovery mode emits FATAL "xmlParseEntityRef: no name" and then silently drops all subsequent & entities in the same document — corrupting every individual document title's & in the collection index HTML output. Add a private xml_escape helper that escapes only unencoded & (not already-encoded &, &#nnn;, &#xhh;) and leaves inline markup tags (, , etc.) untouched, so valid HTML fragments round-tripped via find_html pass through unchanged. Fixes metanorma/isodoc#785: metanorma/isodoc#785 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

andrew2net · 2026-05-29T21:45:24Z

Ported the update into v2

opoudjis and others added 2 commits May 13, 2026 00:05

opoudjis assigned andrew2net May 28, 2026

opoudjis requested a review from andrew2net May 28, 2026 15:06

andrew2net approved these changes May 28, 2026

View reviewed changes

andrew2net merged commit 4564a10 into main May 28, 2026
14 checks passed

andrew2net deleted the fix-html-escaping-collection-title branch May 29, 2026 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escape bare & in Bibcollection#to_xml to prevent invalid XML: https://github.com/metanorma/isodoc/issues/785#128

Escape bare & in Bibcollection#to_xml to prevent invalid XML: https://github.com/metanorma/isodoc/issues/785#128
andrew2net merged 2 commits into
mainfrom
fix-html-escaping-collection-title

opoudjis commented May 28, 2026

Uh oh!

andrew2net commented May 28, 2026

Uh oh!

Uh oh!

opoudjis commented May 29, 2026

Uh oh!

andrew2net commented May 29, 2026

Uh oh!

andrew2net commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

opoudjis commented May 28, 2026

Problem

Root cause trace

Fix

Relationship to the earlier find_text → find_html commit

Verification

Uh oh!

andrew2net commented May 28, 2026

Uh oh!

Uh oh!

opoudjis commented May 29, 2026

Uh oh!

andrew2net commented May 29, 2026

Uh oh!

andrew2net commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Relationship to the earlier `find_text → find_html` commit