Summary
CrossRef API returns abstracts with HTML/JATS XML tags that need to be cleaned for display.
Problem
CrossRef abstracts often contain markup like:
<jats:p>Objective. Hippocampal ripples are high-frequency...</jats:p>
<jats:italic>in vitro</jats:italic>
When displayed on websites or in citations, these tags appear as raw text or cause rendering issues.
Proposed Solution
Add abstract cleaning functionality to scitex.scholar:
from scitex.scholar import utils
# Clean abstract from CrossRef response
clean_abstract = utils.clean_abstract(raw_abstract)
# Or integrated into Work object
work = crossref_scitex.get("10.1088/1741-2552/ac3266")
work.abstract # Already cleaned
work.abstract_raw # Original with tags (if needed)
Tags to Handle
- JATS XML tags:
<jats:p>, <jats:italic>, <jats:bold>, <jats:sup>, <jats:sub>
- HTML tags:
<p>, <i>, <b>, <em>, <strong>, <sup>, <sub>
- Preserve meaningful whitespace and paragraph breaks
Implementation Options
- Strip all tags - Simple regex/BeautifulSoup approach
- Convert to plain text - Preserve formatting intent (italic → text)
- Convert to Markdown -
<jats:italic> → _text_
Use Cases
- Publications page display (scitex-cloud)
- Citation generation
- Paper metadata export
Related
Summary
CrossRef API returns abstracts with HTML/JATS XML tags that need to be cleaned for display.
Problem
CrossRef abstracts often contain markup like:
When displayed on websites or in citations, these tags appear as raw text or cause rendering issues.
Proposed Solution
Add abstract cleaning functionality to
scitex.scholar:Tags to Handle
<jats:p>,<jats:italic>,<jats:bold>,<jats:sup>,<jats:sub><p>,<i>,<b>,<em>,<strong>,<sup>,<sub>Implementation Options
<jats:italic>→_text_Use Cases
Related
scitex.scholar.local_dbs.crossref_scitex