Skip to content

Feature/unicode support#76

Open
m-messer wants to merge 9 commits into
mainfrom
feature/unicode-support
Open

Feature/unicode support#76
m-messer wants to merge 9 commits into
mainfrom
feature/unicode-support

Conversation

@m-messer
Copy link
Copy Markdown
Member

@m-messer m-messer commented Jun 2, 2026

Problem: The pdf-generator had no automated tests and no support for non-Latin scripts. There was no way to verify that the markdown→PDF pipeline was working correctly, and documents containing Korean, Chinese, or Japanese characters would silently fail to render. CJK (Chinese, Japanese, Korean) scripts require fonts with tens of thousands of glyphs and a dedicated TeX package. Standard Latin fonts like lmodern lack these glyphs, so without explicit CJK support, the characters are either dropped or cause a compilation error. Additionally, the API lacked a way to pass the language configuration to Pandoc, making it impossible to enable CJK rendering even when the fonts were present.

Solution: Added a four-layer test suite covering the full pipeline from pure function logic through to real PDF compilation and content verification. Added a general variables API field that forwards arbitrary Pandoc template variables (enabling lang, CJKmainfont, mainfont, etc.), switched the default document font to Noto Sans (satisfying the existing sans-serif accessibility requirement while providing broad Unicode coverage), and unconditionally loaded xeCJK in the template so CJK characters render without any caller configuration. Updated the Docker image to install the required TeX packages and fonts.

An example output including Korean text can be found here: korean_test_pdf-1.pdf

Changes:

  • src/utils.test.ts — unit tests for fixInlineLatex, errorRefiner, and deleteFile
  • index.test.ts — unit tests for Zod schema validation and Lambda handler routing (Pandoc mocked); includes tests for the new variables field
  • src/pandoc.test.ts — integration tests calling real Pandoc: markdown→LaTeX conversion, math handling, implicit_figures toggle, Unicode characters, and Pandoc variable pass-through
  • src/compile.test.ts — end-to-end tests running the full Pandoc + template.latex + xelatex pipeline, with pdftotext content verification to confirm text and Unicode characters actually appear in the rendered PDF
  • index.ts — added optional variables: Record<string, string> to the request schema; handler maps entries to --variable=key:value pandoc args for both PDF and TEX paths
  • src/template.latex — default mainfont changed to Noto Sans; xeCJK now loaded unconditionally with Noto Sans CJK KR as the default CJK font; removed lmodern override
  • Dockerfile — added texlive-xecjk, texlive-ctex, google-noto-sans-fonts, google-noto-sans-cjk-ttc-fonts
  • package.json — added vitest and typescript as dev dependencies; added test:unit script
  • README.md — added Testing section documenting dependencies, commands, and the purpose of each test file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants