Escape bare & in Bibcollection#to_xml to prevent invalid XML: https://github.com/metanorma/isodoc/issues/785#128
Merged
Conversation
Switch Bibcollection.from_xml to read the collection title and author via inner_html instead of Nokogiri's .text, so the in-memory strings keep their XML-fragment form (markup + entities intact). Apply the strip_html Liquid filter on the HTML <title> tag position so the browser tab title stays plain text. Adds find_html to ElementFinder alongside find_text. Adds a regression spec with markup and & in both the collection title and the author name. Refs metanorma/isodoc#785. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bibcollection.rb to_xml was writing the collection title and author directly into XML without escaping, producing bare & in the output when the values came from YAML (e.g. name: "A test & playground ..."). A bare & is invalid XML; libxml2 in recovery mode emits FATAL "xmlParseEntityRef: no name" and then silently drops all subsequent & entities in the same document — corrupting every individual document title's & in the collection index HTML output. Add a private xml_escape helper that escapes only unencoded & (not already-encoded &, &#nnn;, &#xhh;) and leaves inline markup tags (<em>, <strong>, etc.) untouched, so valid HTML fragments round-tripped via find_html pass through unchanged. Fixes metanorma/isodoc#785: metanorma/isodoc#785 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
andrew2net
approved these changes
May 28, 2026
Contributor
|
@opoudjis, this PR is against Relaton v1. Do you still use v1? Should we reimplement this in Relaton v2? |
Contributor
Author
|
@andrew2net Yes, sorry, I am surprised it is against v1. My debugging was against the running relaton-cli in v2, so I don't understand why the PR has landed against that, is it because the relaton-cli code is no longer maintained here? |
Contributor
|
I think it's because the branch was created on top of the main. Relaton v2 lives in the lutaml-integration branch. |
andrew2net
added a commit
that referenced
this pull request
May 29, 2026
Port the write-path fix from #128 to v2/lutaml-integration. The read-path half of #128 (find_html + strip_html) was already ported via #127; this commit ports the remaining to_xml escaping. bibcollection.rb to_xml was writing the collection title and author directly into XML without escaping, producing bare & in the output when the values came from YAML (e.g. name: "A test & playground ..."). A bare & is invalid XML; libxml2 in recovery mode emits FATAL "xmlParseEntityRef: no name" and then silently drops all subsequent & entities in the same document — corrupting every individual document title's & in the collection index HTML output. Add a private xml_escape helper that escapes only unencoded & (not already-encoded &, &#nnn;, &#xhh;) and leaves inline markup tags (<em>, <strong>, etc.) untouched, so valid HTML fragments round-tripped via find_html pass through unchanged. Fixes metanorma/isodoc#785: metanorma/isodoc#785 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
Ported the update into v2 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Bibcollection#to_xmlwrites the collectiontitleandauthordirectly into XML without escaping. When these values come from YAML (e.g.name: "A test & playground ..."), the output is a bare&character, which is invalid XML.libxml2 responds with a FATAL
xmlParseEntityRef: no nameerror and enters recovery mode. In recovery mode it silently drops all subsequent&entity references in the same document — so every individual bibitem title that contains&also loses its&in the generated HTML output. This is what surfaces as the bug reported in metanorma/isodoc#785.Root cause trace
Fix
Add a private
xml_escapehelper that escapes only unencoded&— not already-encoded entity references (&,{,) — and leaves inline markup tags (<em>,<strong>, etc.) untouched. This means values from YAML are safe, and HTML fragments round-tripped viafind_htmlpass through unchanged.Relationship to the earlier
find_text → find_htmlcommitThe preceding commit on this branch (
8ed16f5) fixed the read path (useinner_htmlto preserve markup when re-reading the collection title from XML). This commit fixes the write path (escape before inserting into XML). Both are needed: without the write-path fix the XML is never valid, so the read-path fix cannot help.Verification
After applying the fix, regenerating the metanorma-pdfa test collection produces:
Refs: metanorma/isodoc#785: metanorma/isodoc#785
🤖 Generated with Claude Code