fix: migrate Doc2PDF from jtidy to Jsoup#2238
Conversation
- Replace jtidy 1.0.5 (unmaintained since ~2010) with Jsoup 1.17.2 - Migrate parseJSP2PDF(), parseString2PDF(), and parseString2Bin() methods - Configure Jsoup with XML syntax and XHTML entities for iText compatibility - Set prettyPrint(false) to prevent whitespace issues in iText XML parser - Fix character encoding by using UTF-8 explicitly (was platform-dependent) - Remove jtidy dependency from pom.xml - Add comprehensive integration tests with medical terminology and special characters Benefits: - Removes 15-year-old unmaintained library - Fixes character encoding bug (critical for French Canadian patient names) - Better HTML5 support with actively maintained library - Improved security with modern, maintained dependency Fixes #2154 Co-authored-by: Michael Yingbull <yingbull@users.noreply.github.com>
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the
📝 WalkthroughWalkthroughThis PR removes the legacy jtidy dependency from pom.xml and migrates HTML parsing and cleaning logic in Doc2PDF.java from Tidy to Jsoup. A new integration test suite validates PDF conversion across various HTML scenarios. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/main/java/ca/openosp/openo/util/Doc2PDF.java`:
- Around line 74-83: The debug log in Doc2PDF (the call
MiscUtils.getLogger().debug(cleanHtml)) exposes cleaned HTML (cleanHtml) which
may contain PHI; remove that debug statement and instead log non-PHI metadata
(e.g., cleanHtml.length(), a parse success boolean, or a one-way hash/ID) before
calling AddAbsoluteTag(request, cleanHtml, uri), ensuring no
patient-identifiable content is written to logs.
- Around line 190-199: The code uses platform-default encoding when converting
strings to bytes causing corruption; update all getBytes() calls in Doc2PDF to
explicitly use StandardCharsets.UTF_8: inside GetPDFBin (replace
docText.getBytes() and docBin.getBytes() usages) and inside
PrintPDFFromHTMLString (replace docText.getBytes()) so the ByteArrayInputStream
and Base64.decodeBase64 calls receive UTF-8 bytes consistently; locate these
calls in the Doc2PDF class (methods GetPDFBin and PrintPDFFromHTMLString) and
import/use java.nio.charset.StandardCharsets.
🧹 Nitpick comments (4)
src/main/java/ca/openosp/openo/util/Doc2PDF.java (1)
76-79: Extract duplicated Jsoup configuration to a helper method.The Jsoup output settings configuration is repeated identically in three methods. Extract to a private helper for maintainability.
Proposed refactor
Add this helper method:
private static void configureJsoupForXhtml(org.jsoup.nodes.Document doc) { doc.outputSettings() .syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml) .escapeMode(Entities.EscapeMode.xhtml) .prettyPrint(false); }Then replace each configuration block with:
org.jsoup.nodes.Document doc = Jsoup.parse(docText); - doc.outputSettings() - .syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml) - .escapeMode(Entities.EscapeMode.xhtml) - .prettyPrint(false); + configureJsoupForXhtml(doc);Also applies to: 168-171, 192-195
src/test-modern/java/ca/openosp/openo/util/Doc2PDFIntegrationTest.java (3)
49-70: Test method naming should follow BDD convention with underscore separator.Per coding guidelines, modern test methods must use
should<Action>_when<Condition>()format with exactly one underscore separator. Current names likeshouldParseSimpleHtmlToPdf()lack the condition clause.Suggested naming improvements
- void shouldParseSimpleHtmlToPdf() { + void shouldProducePdf_whenSimpleHtmlProvided() {Apply similar pattern to other methods:
shouldHandleMalformedHtml()→shouldProducePdf_whenMalformedHtmlProvided()shouldHandleFrenchCanadianCharacters()→shouldPreserveEncoding_whenFrenchCanadianCharactersPresent()shouldHandleMedicalSymbols()→shouldPreserveSymbols_whenMedicalCharactersPresent()shouldHandleEmptyHtml()→shouldProducePdf_whenEmptyHtmlProvided()shouldConvertHtmlToBinary()→shouldReturnBase64_whenConvertingToBinary()shouldHandleRealisticMedicalContent()→shouldProducePdf_whenRealisticMedicalContentProvided()shouldHandleComplexHtmlStructure()→shouldProducePdf_whenComplexHtmlStructureProvided()Based on learnings: "Modern JUnit 5 tests must use BDD-style naming:
shouldReturnTicklerWhenValidIdProvided()orshould<Action>_when<Condition>()format with exactly ONE underscore separator."
52-70: Extract common request setup to reduce duplication.The MockHttpServletRequest configuration is duplicated across all 8 test methods. Consider extracting to a
@BeforeEachmethod or helper.Proposed refactor
private MockHttpServletRequest request; private MockHttpServletResponse response; `@BeforeEach` void setUp() { request = new MockHttpServletRequest(); request.setProtocol("HTTP/1.1"); request.setRemoteHost("localhost"); request.setServerPort(8080); request.setContextPath("/openo"); response = new MockHttpServletResponse(); }Then simplify each test:
void shouldParseSimpleHtmlToPdf() { // Given - MockHttpServletRequest request = new MockHttpServletRequest(); - request.setProtocol("HTTP/1.1"); - request.setRemoteHost("localhost"); - request.setServerPort(8080); - request.setContextPath("/openo"); - - MockHttpServletResponse response = new MockHttpServletResponse(); - String simpleHtml = "<html><body><p>Hello World</p></body></html>";Also applies to: 75-93, 99-122, 128-152, 157-174, 180-200, 206-242, 247-275
206-242: Consider adding content verification for realistic medical content test.While the test validates PDF generation succeeds, it doesn't verify the content is correctly rendered. Consider using a PDF parsing library (e.g., Apache PDFBox) to extract text and verify key content is present, especially for encoding-sensitive characters like
°CandFrançois.This would strengthen confidence in the Jsoup migration:
// Optional: Extract and verify PDF content PDDocument pdfDoc = PDDocument.load(response.getContentAsByteArray()); PDFTextStripper stripper = new PDFTextStripper(); String text = stripper.getText(pdfDoc); assertThat(text).contains("François Côté"); assertThat(text).contains("38.5°C"); pdfDoc.close();
|
@claude review |
|
Claude finished @yingbull's task in 2m 31s —— View job Code Review Complete ✅I've completed a comprehensive security-focused review of this PR migrating Doc2PDF from jtidy to Jsoup. The migration is well-implemented and tests are thorough, but there are 2 critical security issues and several improvement opportunities. 🔴 Critical Issues (Must Fix Before Merge)1. PHI Exposure in Debug Logging (Doc2PDF.java:82)Severity: CRITICAL | File: String cleanHtml = doc.html();
MiscUtils.getLogger().debug(cleanHtml); // ⚠️ EXPOSES PHI
String documentTxt = AddAbsoluteTag(request, cleanHtml, uri);Issue: The Fix: Remove the debug statement or log only non-PHI metadata: String cleanHtml = doc.html();
MiscUtils.getLogger().debug("Parsed HTML content, length: {} bytes", cleanHtml.length());
String documentTxt = AddAbsoluteTag(request, cleanHtml, uri);2. Platform-Dependent Character Encoding (Multiple Locations)Severity: CRITICAL | Files: Issue: The code uses Lines affected:
Fix: Add explicit UTF-8 charset (already imported at line 50): // Line 253
InputStream is = new ByteArrayInputStream(docText.getBytes(StandardCharsets.UTF_8));
// Line 270
byte[] binDecodedArray = Base64.decodeBase64(docBin.getBytes(StandardCharsets.UTF_8));
// Line 319
InputStream is = new ByteArrayInputStream(docText.getBytes(StandardCharsets.UTF_8));🟡 Code Quality Improvements (Recommended)3. Duplicated Jsoup Configuration PatternSeverity: MEDIUM | Files: The Jsoup configuration is copy-pasted identically in three methods. This violates DRY principle and makes maintenance harder. Current pattern (repeated 3x): doc.outputSettings()
.syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml)
.escapeMode(Entities.EscapeMode.xhtml)
.prettyPrint(false);Suggested refactoring: /**
* Configure Jsoup document for XHTML output compatible with iText XMLWorkerHelper.
*
* @param doc The Jsoup document to configure
* @since 2026-01-29
*/
private static void configureJsoupForXhtml(org.jsoup.nodes.Document doc) {
doc.outputSettings()
.syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml)
.escapeMode(Entities.EscadeMode.xhtml)
.prettyPrint(false); // Critical: prevents whitespace issues in iText XML parser
}Then replace each configuration block with: 4. Missing JavaDoc for Public MethodsSeverity: MEDIUM | File: Per CLAUDE.md: "JavaDoc Required: All public classes and methods MUST have comprehensive JavaDoc". The migrated methods lack JavaDoc explaining parameters, return values, and the encoding behavior change. Example for /**
* Converts an HTML string to PDF and writes it to the HTTP response.
* Uses Jsoup for HTML parsing with UTF-8 encoding and XHTML output.
*
* @param request HttpServletRequest containing request context (protocol, host, port)
* @param response HttpServletResponse to write the PDF to (sets Content-Type: application/pdf)
* @param docText String containing the HTML content to convert
* @throws RuntimeException if PDF conversion fails
* @since 2026-01-29
*/
public static void parseString2PDF(HttpServletRequest request, HttpServletResponse response, String docText) {Similar JavaDoc should be added for 5. Test Naming Convention ViolationsSeverity: LOW | File: Per CLAUDE.md modern test framework guidelines: "BDD test naming must use Current names (missing condition clause):
Suggested renames:
6. Duplicated Test Setup CodeSeverity: LOW | File: All 8 test methods duplicate the identical MockHttpServletRequest setup (lines 54-58, repeated). Extract to private MockHttpServletRequest request;
private MockHttpServletResponse response;
@BeforeEach
void setUp() {
request = new MockHttpServletRequest();
request.setProtocol("HTTP/1.1");
request.setRemoteHost("localhost");
request.setServerPort(8080);
request.setContextPath("/openo");
response = new MockHttpServletResponse();
}7. Missing Content Verification in Medical TestSeverity: LOW | File: The // Optional: Extract and verify PDF content
PDDocument pdfDoc = PDDocument.load(response.getContentAsByteArray());
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(pdfDoc);
assertThat(text).contains("François Côté");
assertThat(text).contains("38.5°C");
pdfDoc.close();This would strengthen confidence in the UTF-8 migration. ✅ Positive Findings
📋 Summary
Recommendation: Fix the 2 critical security issues before merging. The medium/low improvements can be addressed in follow-up PRs if needed. 🧪 Testing RecommendationsBefore merging, manually test the messenger PDF features:
Run automated tests: mvn test -Dtest=Doc2PDFIntegrationTest |
|
@claude make a commit to this branch resolving all these issues |
|
Claude finished @yingbull's task in 4m 28s —— View job Code Review Issues Resolved ✅All identified issues from the code review have been fixed and committed. Critical Security Issues
Code Quality Improvements
Test Framework Improvements
Changes SummaryFiles Modified:
Commit: df4dd81b |
- Fix PHI exposure in debug logging (now logs only content length) - Add explicit UTF-8 encoding to all .getBytes() calls (lines 253, 270, 319) - Extract duplicated Jsoup configuration to configureJsoupForXhtml() helper - Add comprehensive JavaDoc to parseJSP2PDF, parseString2PDF, parseString2Bin - Fix BDD test naming: all tests now follow should<Action>_when<Condition> pattern - Extract common test setup to @beforeeach to reduce duplication Co-authored-by: Michael Yingbull <yingbull@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Migrates the Doc2PDF HTML-cleaning step from the unmaintained JTidy library to Jsoup, and removes the JTidy dependency from the build, with added integration coverage for HTML→PDF conversion scenarios.
Changes:
- Replaced JTidy-based cleaning with Jsoup parsing + XHTML-oriented output settings in
Doc2PDF. - Standardized UTF-8 usage when converting HTML/text to byte streams for iText XMLWorker.
- Removed
com.github.jtidy:jtidyfrompom.xmland added a modern integration test suite forDoc2PDF.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/main/java/ca/openosp/openo/util/Doc2PDF.java | Switches HTML parsing/cleanup to Jsoup and enforces UTF-8 byte conversion for PDF generation. |
| src/test-modern/java/ca/openosp/openo/util/Doc2PDFIntegrationTest.java | Adds integration tests covering basic/malformed HTML, encoding/symbols, and binary PDF output. |
| pom.xml | Removes JTidy dependency (Jsoup already present). |
|
@claude review |
|
Claude finished @yingbull's task in 2m 25s —— View job Code Review Complete ✅I've completed a comprehensive security-focused review of this PR migrating Doc2PDF from jtidy to Jsoup. The recent commit (df4dd81) successfully addressed all critical security issues identified in previous reviews. Here's my detailed assessment: ✅ Previous Critical Issues - All Resolved1. PHI Exposure in Debug Logging ✅ FIXEDPrevious issue: Line 82 logged full HTML content containing PHI 2. Platform-Dependent Character Encoding ✅ FIXEDPrevious issue: Three
3. Code Duplication ✅ FIXEDPrevious issue: Jsoup configuration duplicated 3 times 4. Missing JavaDoc ✅ FIXEDPrevious issue: Public methods lacked comprehensive JavaDoc
5. Test Naming Convention ✅ FIXEDPrevious issue: Tests didn't follow BDD
6. Test Setup Duplication ✅ FIXEDPrevious issue: MockHttpServletRequest setup repeated in all 8 tests
|
|
@claude make a detailed separate issue ticket about the resource leak, what is affected and where the affected code is used/called from, and the details needed to correct it. |
|
Claude finished @yingbull's task in 4m 2s —— View job Resource Leak Issue Created ✅I've created a comprehensive issue ticket documenting the resource leak in Issue #2244: fix: resource leak in Doc2PDF.parseJSP2PDF - HttpURLConnection never closed SummaryThe Problem:
Current Impact: LOW (Dormant)
Affected Code:
Recommendation: Issue DetailsThe issue ticket includes:
Labels Applied:
View issue #2244 | | Branch: claude/issue-2154-20260129-2321 |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. Scanned Files
|
Summary
This PR migrates the Doc2PDF utility class from the unmaintained jtidy library (last updated ~2010) to the modern Jsoup library (1.17.2), which is already a dependency in the project.
Changes
Code Changes
parseJSP2PDF(): Parse JSP output to PDFparseString2PDF(): Parse HTML string to PDFparseString2Bin(): Parse HTML string to binary PDFJsoup Configuration
All three methods now use the same Jsoup configuration:
The
prettyPrint(false)setting is critical because iText's XMLWorkerHelper is whitespace-sensitive.Tests
Benefits
docText.getBytes(), Jsoup uses UTF-8 by defaultCritical for Healthcare Context
docText.getBytes()bug could corrupt French Canadian patient names and medical terminology - this is fixed with Jsoup's UTF-8 handlingTesting
Manual Testing Required
Automated Testing
Run the integration tests:
mvn test -Dtest=Doc2PDFIntegrationTestOr run all tests:
Migration Notes
ConvertToEdoc.java(line 358-378), which already uses Jsoup for HTML-to-PDF conversion with Flying SaucerMsgDoc2PDF2Action,MsgAttachPDF2Action, orMsgViewPDF2Action- they just call Doc2PDF methodsFixes #2154
Generated with Claude Code
Summary by cubic
Migrated Doc2PDF from jtidy to Jsoup to modernize HTML parsing, fix UTF-8 in generated PDFs, and reduce PHI risk in logs. Fixes #2154.
Bug Fixes
Refactors
Written for commit 94184b3. Summary will update on new commits.
Summary by CodeRabbit
Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.