| title | PDF Generation |
|---|---|
| parent | Documentation Development |
| nav_order | 8 |
| permalink | /Documentation/Development/PDF-Generation |
{: .no_toc }
Internals of the two-stage PDF pipeline: tbdocs Phase 8 assembles a sparse _site-pdf/ source tree, then book/render-book.mjs renders it into _pdf/twinBASIC Book.pdf via headless Chromium + paged.js + pdf-lib. Read this when modifying the renderer, the print stylesheet, or the paged.js bundle.
- TOC goes here {:toc}
The two stages are decoupled: tbdocs builds _site-pdf/ as part of its normal run; render-book.mjs runs only when book.bat calls it explicitly. This keeps puppeteer and pdf-lib --- both large --- out of the site generator's dependency tree.
node book/render-book.mjs <input.html> -o <output.pdf>
[--outline-tags h1,h2,h3,h4]
[-t <timeout-ms>]
[--additional-script <path>]...
| Flag | Default | Description |
|---|---|---|
<input.html> |
required | Path to the assembled HTML file (usually _site-pdf/book.html). |
-o / --output |
required | Destination PDF path. |
--outline-tags |
h1,h2,h3,h4 |
Comma-separated heading tags to include in the PDF bookmark tree. |
-t / --timeout |
0 (disabled) |
Per-operation puppeteer timeout in milliseconds. |
--additional-script |
— | Inject an extra in-page script after the paged.js bundle. Repeatable. |
book.bat runs the standard production invocation:
node ..\book\render-book.mjs _site-pdf\book.html -o "_pdf\twinBASIC Book.pdf" ^
--outline-tags h1,h2,h3,h4 ^
--additional-script ..\perf\detach-pages.jsAlways run build.bat first to populate _site-pdf/.
book/render-book.mjs runs the three phases. Its helpers live in book/lib/.
Opens a headless Chromium instance, loads book.html under file://, and calls PagedPolyfill.preview() to run the CSS Paged Media layout engine. When it returns, the DOM contains one .pagedjs_page element per output PDF page.
Chromium launch flags:
| Flag | Why |
|---|---|
--allow-file-access-from-files |
paged.js fetches print.css via XHR from a file:// URL. Without this flag Chrome rejects the request. |
--disable-gpu + --disable-software-rasterizer |
Shrinks the GPU process from ~100 MB to ~16 MB and cuts ~5 s off the generate phase by letting Skia skip a GPU init path. |
After page.goto() and before loading any scripts, the driver injects:
window.PagedConfig = { auto: false };This prevents paged.js from running automatically when its bundle loads. Then it injects scripts in order via page.addScriptTag():
lib/paged.browser.js--- the paged.js CSS Paged Media polyfill.lib/progress-handler.js--- registers a handler that logs[render-progress] page=N elapsed=Xsto the browser console after each page is laid out.- Any
--additional-scriptpaths (production addsperf/detach-pages.js).
PagedPolyfill.preview() is called next via page.evaluate(). In the vendored bundle the call is fully synchronous; the await on page.evaluate() is just the CDP round-trip puppeteer needs to bring the result back to Node.
perf/detach-pages.js implements the aggressive-detach optimisation: it physically removes each finalised page from the DOM immediately after layout, then restores all pages in order at afterRendered. This keeps getBoundingClientRect (which paged.js calls per page) at ~0.7 ms/page flat instead of growing at ~8 ms/page on a 1638-page book. CSS counters break across detached pages, so print.css uses var(--page-num) (a custom property paged.js writes per page) rather than counter(page) for running page numbers.
Extracts document metadata and builds the outline tree, then calls page.pdf() to generate the raw PDF from Chromium's internal writer.
Meta extraction via page.evaluate() returns:
{
title: string, // <title> text content
lang: string, // <html lang="..."> value
[name]: string, // one entry per <meta name="..."> tag
}Outline extraction via parseOutline(page, outlineTags) (see outline.mjs) returns a nested OutlineNode[] tree.
PDF generation via page.pdf():
page.pdf({
printBackground: true,
displayHeaderFooter: false,
preferCSSPageSize: true, // use the A4 size from print.css @page rules
margin: { top: 0, right: 0, bottom: 0, left: 0 },
})preferCSSPageSize: true makes Chromium use the dimensions declared in print.css rather than a hardcoded default. The call buffers the entire document internally before returning --- there is no intermediate progress signal. A 500 ms heartbeat writes an elapsed counter to stdout on TTYs while the ~50 s call runs.
Augments the raw PDF from Chromium with a bookmark tree and document metadata, then saves the final output.
The raw buffer from page.pdf() is a valid but minimal PDF: it has no /Outlines entry and contains Chromium's default metadata. The process phase runs four operations in sequence:
-
measureRawPdf(rawPdf)--- traverses the raw bytes without allocating any objects. ReturnsdictSlotsandarraySlotscounts used to pre-size two shim backing arrays before the load (seemeasure-pass.mjs). -
PDFDocument.load(rawPdf)--- parses the raw PDF into pdf-lib's in-memory model. The fast-* shims (see pdf-lib Patches) are already active from the import block; this call uses their optimised data structures. -
setMetadata(pdfDoc, meta)andsetOutline(pdfDoc, outline)--- write the/Infodict and the/Outlinestree into the document (seepostprocesser.mjsandoutline.mjs). -
parallelSave(pdfDoc, { objectsPerStream: 500 })--- serialises the modified document to bytes, running deflate concurrently on libuv's thread pool (seeparallel-deflate.mjs).
Two exports: parseOutline runs inside the browser via puppeteer; setOutline runs in Node against a pdf-lib document.
parseOutline(page, tags) --- queries document.querySelectorAll(tags.join(',')), traverses the results in document order, and builds a nested tree. Each node:
// OutlineNode
{
title: string, // heading innerText, HTML-stripped
destination: string, // percent-encoded heading id (# → #25)
children: OutlineNode[],
closed?: true, // present when the heading or its ancestor
// article carries data-pdf-bookmark-closed
}The function also injects a hidden <div> of <a href="#id"> links before <body> for every heading. Without these, Chromium's PDF writer does not register named destinations, so the /Dest entries in the outline would resolve nowhere.
closed nodes produce a negative /Count in the PDF /Outlines tree, which PDF readers use to display the bookmark collapsed.
setOutline(pdfDoc, outline, enableWarnings?) --- allocates a PDF reference for each outline node via pdfDoc.context.nextRef(), writes a linked PDFDict per node, and sets pdfDoc.catalog.Outlines to the root reference. Each node's Dest is a PDF name that Chromium's /Dests catalog maps to a page number and coordinates.
setMetadata(pdfDoc, meta) --- writes standard /Info dict entries from the meta object collected in Phase 2. Always sets ModDate to the current time. Appends " + Paged.js" to the Creator string inherited from Chromium and retains Chromium's "Skia/PDF mXX" Producer string.
setTrimBoxes(pdfDoc, pages) --- sets per-page /TrimBox entries from the box data PagedPolyfill exposes. Not called in the production pipeline (pages have no bleed), but available for print-ready output with crop marks.
measure(bytes) --- a no-allocate byte walker over a raw PDF buffer. Parses the PDF grammar (indirect objects, dicts, arrays, streams, embedded ObjStms) without instantiating any PDFObject. Returns:
{
indirectObjects: number,
dicts: number,
dictSlots: number, // total key + value slots across all dicts
arrays: number,
arraySlots: number, // total element slots across all arrays
refs: number,
names: number,
numbers: number,
strings: number,
hexStrings: number,
streams: number,
objStms: number,
objStmInner: number,
maxDictSlots: number,
maxArraySlots: number,
maxRecursion: number,
totalStreamBytes: number,
totalInflatedBytes: number,
}dictSlots and arraySlots are passed to setExpectedDictSlots() and setExpectedArraySlots() on the fast-dict-onebuf and fast-array-onebuf shims. Calling these before PDFDocument.load() lets each shim pre-allocate its backing array to the measured size, eliminating V8 growth resizes during parse.
The internal Measurer class keeps per-dict state (/Length, /Type, /N, /First) on depth-indexed Int32Array / Uint8Array stacks rather than per-object heap records. Stack depth is 64; maximum observed on the book is 4.
parallelSave(pdfDoc, opts?) --- replacement for pdfDoc.save({ useObjectStreams: true }). Runs the same pre-serialize steps as PDFDocument.save() (flush, updateFieldAppearances), then invokes a custom ParallelStreamWriter that splits the save into three phases:
-
Classify --- same logic as pdf-lib's
PDFStreamWriter.computeBufferSize. Partitions indirect objects intouncompressedObjects(PDF streams, encrypted refs, gen-number ≠ 0) andcompressedChunks(everything else, grouped into chunks ofobjectsPerStream). -
Parallel deflate --- instantiates all
PDFObjectStreamobjects, then firesPromise.all(streams.map(s => deflateAsync(s.getUnencodedContents()))). Each deflate runs on libuv's thread pool. Results are written directly into each stream'scontentsCache.valueso Phase 3 finds only cache hits. -
Size and emit --- same as upstream. Every
computeIndirectObjectSizecall is a Phase 2 cache hit. The xref stream (which depends on byte offsets pinned in Phase 3) is deflated synchronously viadeflateSyncimmediately after its content is finalised.
Default options and their production values:
{
objectsPerStream: 50, // production: 500
encodeStreams: true,
parallel: true,
addDefaultPage: true,
updateFieldAppearances: true,
}objectsPerStream: 500 (the production value) produces ~5% smaller PDFs than the pdf-lib default of 50 because a larger deflate window captures more repeated strings across grouped objects.
Returns { bytes: Uint8Array, streamCount: number }.
A minimal in-browser script that registers a Paged.Handler subclass with one hook:
class ProgressHandler extends Paged.Handler {
afterPageLayout(_pageElement, _page, _breakToken) {
this.count++;
const elapsed = ((performance.now() - start) / 1000).toFixed(1);
console.log(`[render-progress] page=${this.count} elapsed=${elapsed}`);
}
}
Paged.registerHandlers(ProgressHandler);render-book.mjs intercepts these console messages via page.on('console', ...) and writes a \r-overwriting progress line to stdout on TTYs, or one line per 100 pages when stdout is piped.
book/lib/paged.browser.js is a vendored, lightly patched copy of Paged.js v0.4.3 (MIT). Paged.js is a CSS Paged Media polyfill: it reads @page rules from the linked stylesheet, breaks the document into discrete DOM pages, resolves CSS counters, and copies running headers and footers from string-set declarations into each page's margin boxes. Chromium then renders the resulting DOM into a PDF.
Two globals control the polyfill:
window.PagedConfig --- configuration object read at load time.
| Key | Type | Description |
|---|---|---|
auto |
boolean |
When false, paged.js does not run automatically when the bundle loads. The driver sets this before injecting the bundle. |
window.PagedPolyfill --- the main polyfill object, available after the bundle loads.
| Member | Description |
|---|---|
PagedPolyfill.preview() |
Runs the full layout pipeline. In the vendored bundle this is fully synchronous. |
Paged.js provides a plugin API for observing and intercepting the layout process. A handler is a class that extends Paged.Handler and is registered via Paged.registerHandlers() before preview() is called.
class MyHandler extends Paged.Handler {
constructor(chunker, polisher, caller) {
super(chunker, polisher, caller);
}
afterPageLayout(pageElement, page, breakToken) {
// fires after each page is fully laid out
}
}
Paged.registerHandlers(MyHandler);Key lifecycle hooks (all optional overrides):
| Hook | Signature | When it fires |
|---|---|---|
beforeParsed |
(content) |
Before the source document is processed. |
afterParsed |
(parsed) |
After the source document has been processed, before layout begins. |
beforePageLayout |
(page) |
Before a new page is laid out. |
afterPageLayout |
(pageElement, page, breakToken) |
After each page is fully laid out. pageElement is the .pagedjs_page DOM node; breakToken holds the position where the next page starts. |
finalizePage |
(pageElement, page, breakToken) |
After a page is finalised. Called slightly later than afterPageLayout; used by detach-pages.js to remove the previous page from the DOM. |
afterRendered |
(pages) |
After all pages have been rendered, before page.pdf() runs. Used by detach-pages.js to restore pages in document order. |
After preview() completes, the document contains:
- A
.pagedjs_pagescontainer added to<body>, wrapping all pages. - One
.pagedjs_pageper output PDF page. Each page contains.pagedjs_area > .pagedjs_contentwith the sliced chapter content. - Margin boxes rendered from
@pagemargin rules (@top-right,@bottom-right, etc.) carryingstring-set-tracked running headers and footer page numbers.
render-book.mjs reads the page count after preview():
document.querySelectorAll('.pagedjs_pages > .pagedjs_page').lengthIn upstream paged.js, the layout process yields to the browser event loop every 100 objects. The vendored bundle removes these yield gates, making preview() a single synchronous call. Since the renderer runs inside headless Chromium where browser responsiveness is irrelevant, this is safe.
The await page.evaluate(...) wrapper in the driver is a puppeteer requirement for the CDP round-trip --- not a sign that preview() is async. The CDP response arrives only after the synchronous execution inside Chromium is fully complete.
Paged.js fetches the linked stylesheet via XHR to extract @page rules. Under file://, Chrome blocks this unless --allow-file-access-from-files is passed to Chromium at launch.
The key @page rules in docs/assets/css/print.css that paged.js acts on:
| Rule | Effect |
|---|---|
@page { size: A4; margin: 22mm; } |
Base page size and margins. |
@page { @bottom-right { content: string(part-title) " - " var(--page-num); } } |
Footer: part name and page number. |
@page { @top-right { content: string(chapter-title); } } |
Running header: current chapter title. |
string(chapter-title) is populated by the hidden .header-string <span> at the start of each <article class="page">, where print.css sets string-set: chapter-title content(text). var(--page-num) is a CSS custom property that paged.js writes to each .pagedjs_page element during layout; counter(page) would be the natural choice but breaks when detach-pages.js removes finalised pages from the DOM, so the custom property is used instead.
- Book Configuration -- the
_book.ymlmanifest that controls what goes intobook.html. - Pipeline Stages -- the
pdf.mjsandbook.mjsinterface contracts for Phase 8. - tbdocs Builder -- design rationale for Phase 8 in the tbdocs pipeline.
- pdf-lib Patches -- detailed description of each
fast-*.mjsshim: upstream problem, fix, and mechanism. - Paged.js Patches -- detailed description of every patch to
paged.browser.js.