Replace pdfcpu with pdfdisassembler for PDF attachment extraction by pgundlach · Pull Request #164 · speedata/einvoice

pgundlach · 2026-06-17T20:47:48Z

Switches the embedded-XML extraction in cmd/einvoice from pdfcpu to the read-only speedata/pdfdisassembler parser. The new code walks the catalog's EmbeddedFiles name tree directly (/Names and /Kids) and decodes the /EF/F stream. The public extractXMLFromPDF signature is unchanged.

Why

Investigating #162: a user reported only 8/118 ZUGFeRD 2.5 example PDFs validating, with two error classes — AFRelationship: unsupported in version 1.3 and no attachments available. Testing the current code against the official FeRD example set showed the real cause:

All 52 real _fx.pdf invoices extract fine already; the 59 "failures" are non-invoice PDFs (52 validation reports + 7 supporting attachments) that legitimately contain no embedded invoice XML.
The AFRelationship: unsupported in version 1.3 errors are not reproducible with pdfcpu 0.13.0 (even without the HeaderVersion workaround) — they come from an older pdfcpu in the reporter's prebuilt binary. The PDF-extraction code lives in cmd/einvoice (package main), so a third-party tool that only imports the library never gets it anyway.

pdfdisassembler removes the fragility at the root.

Advantages

No strict PDF-version validation, so PDF/A-3 features like AFRelationship on files declaring an older header version no longer break extraction. The previous HeaderVersion=1.4 workaround is gone and that whole failure class is structurally eliminated.
Zero third-party transitive dependencies (pure stdlib) vs. pdfcpu's image codecs (x/image, hhrutter/tiff), x/crypto/pkcs7, cobra, yaml, etc. The cmd/einvoice binary shrinks ~18.9 MB → ~9.8 MB (-48%).
Smaller supply chain / attack surface; read-only by design.
It is a speedata library, so version-strictness issues can be fixed at the source instead of waiting on upstream pdfcpu releases.

Verification

Functional test against the FeRD ZUGFeRD example set: 52/52 real invoices still extract and parse — exactly the same set of files as before. Non-invoice PDFs are rejected with a clearer message (PDF contains no embedded files).
New pdf_test.go covers extraction against minimal synthetic PDF/A-3 documents built in memory (known filename, nested /Kids name tree, .xml fallback, no attachments, no XML attachment) — so the logic runs in CI without shipping licensed binary fixtures.
go test ./... passes; gofmt clean.

Refs #162.

Switch the embedded-XML extraction in cmd/einvoice from pdfcpu to the read-only speedata/pdfdisassembler parser. The new code walks the catalog's EmbeddedFiles name tree directly (/Names and /Kids) and decodes the /EF/F stream. Advantages of pdfdisassembler here: - No strict PDF-version validation, so PDF/A-3 features like AFRelationship on files declaring an older header version no longer cause extraction to fail. This removes the need for the previous HeaderVersion=1.4 workaround and eliminates that whole failure class (see #162). - Far smaller footprint: zero third-party transitive dependencies (pure stdlib) vs. pdfcpu's image codecs, crypto/pkcs7, cobra, yaml, etc. The cmd/einvoice binary shrinks from ~18.9 MB to ~9.8 MB (-48%). - Smaller supply chain / attack surface and a read-only-by-design parser. - It is a speedata library, so version-strictness issues can be fixed at the source instead of waiting on upstream pdfcpu releases. Functionally equivalent: all 52 real invoices in the FeRD ZUGFeRD example set still extract and parse; only non-invoice PDFs (validation reports, supporting attachments) are rejected, now with a clearer message. Add pdf_test.go covering extraction against minimal synthetic PDF/A-3 documents built in memory (known filename, nested name tree, .xml fallback, no attachments, no XML attachment) so the logic is exercised in CI without shipping licensed binary fixtures.

pgundlach · 2026-06-18T06:39:41Z

@fank I have replaced pdfcpu in the command line interface. If you object, I'd delete the branch.

fank · 2026-06-18T11:40:06Z

Please keep it open, I would like to inspect it more in depth.

fank · 2026-06-18T12:08:15Z

~~@pgundlach could you give me write access to https://github.com/speedata/pdfdisassembler ?~~
Thx

pgundlach · 2026-06-18T12:19:22Z

do you have any special (writing) plans with pdfdisassembler?

v0.0.3 adds Reader.EmbeddedFiles(), which walks the catalog's EmbeddedFiles name tree itself. Replace the manual catalog navigation and recursive collectEmbeddedFiles() helper with a single call.

pgundlach added 2 commits June 17, 2026 22:42

Fix staticcheck QF1012 in pdf_test.go (use fmt.Fprintf)

1760195

Update pdfdisassembler to v0.0.3 and use EmbeddedFiles()

8531bc1

v0.0.3 adds Reader.EmbeddedFiles(), which walks the catalog's EmbeddedFiles name tree itself. Replace the manual catalog navigation and recursive collectEmbeddedFiles() helper with a single call.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace pdfcpu with pdfdisassembler for PDF attachment extraction#164

Replace pdfcpu with pdfdisassembler for PDF attachment extraction#164
pgundlach wants to merge 3 commits into
mainfrom
investigate-zugferd25-pdf-extraction

pgundlach commented Jun 17, 2026

Uh oh!

pgundlach commented Jun 18, 2026

Uh oh!

fank commented Jun 18, 2026

Uh oh!

fank commented Jun 18, 2026 •

edited

Loading

Uh oh!

pgundlach commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pgundlach commented Jun 17, 2026

Why

Advantages

Verification

Uh oh!

pgundlach commented Jun 18, 2026

Uh oh!

fank commented Jun 18, 2026

Uh oh!

fank commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pgundlach commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fank commented Jun 18, 2026 •

edited

Loading