Skip to content

Replace pdfcpu with pdfdisassembler for PDF attachment extraction#164

Open
pgundlach wants to merge 3 commits into
mainfrom
investigate-zugferd25-pdf-extraction
Open

Replace pdfcpu with pdfdisassembler for PDF attachment extraction#164
pgundlach wants to merge 3 commits into
mainfrom
investigate-zugferd25-pdf-extraction

Conversation

@pgundlach

Copy link
Copy Markdown
Member

Switches the embedded-XML extraction in cmd/einvoice from pdfcpu to the read-only speedata/pdfdisassembler parser. The new code walks the catalog's EmbeddedFiles name tree directly (/Names and /Kids) and decodes the /EF/F stream. The public extractXMLFromPDF signature is unchanged.

Why

Investigating #162: a user reported only 8/118 ZUGFeRD 2.5 example PDFs validating, with two error classes — AFRelationship: unsupported in version 1.3 and no attachments available. Testing the current code against the official FeRD example set showed the real cause:

  • All 52 real _fx.pdf invoices extract fine already; the 59 "failures" are non-invoice PDFs (52 validation reports + 7 supporting attachments) that legitimately contain no embedded invoice XML.
  • The AFRelationship: unsupported in version 1.3 errors are not reproducible with pdfcpu 0.13.0 (even without the HeaderVersion workaround) — they come from an older pdfcpu in the reporter's prebuilt binary. The PDF-extraction code lives in cmd/einvoice (package main), so a third-party tool that only imports the library never gets it anyway.

pdfdisassembler removes the fragility at the root.

Advantages

  • No strict PDF-version validation, so PDF/A-3 features like AFRelationship on files declaring an older header version no longer break extraction. The previous HeaderVersion=1.4 workaround is gone and that whole failure class is structurally eliminated.
  • Zero third-party transitive dependencies (pure stdlib) vs. pdfcpu's image codecs (x/image, hhrutter/tiff), x/crypto/pkcs7, cobra, yaml, etc. The cmd/einvoice binary shrinks ~18.9 MB → ~9.8 MB (-48%).
  • Smaller supply chain / attack surface; read-only by design.
  • It is a speedata library, so version-strictness issues can be fixed at the source instead of waiting on upstream pdfcpu releases.

Verification

  • Functional test against the FeRD ZUGFeRD example set: 52/52 real invoices still extract and parse — exactly the same set of files as before. Non-invoice PDFs are rejected with a clearer message (PDF contains no embedded files).
  • New pdf_test.go covers extraction against minimal synthetic PDF/A-3 documents built in memory (known filename, nested /Kids name tree, .xml fallback, no attachments, no XML attachment) — so the logic runs in CI without shipping licensed binary fixtures.
  • go test ./... passes; gofmt clean.

Refs #162.

Switch the embedded-XML extraction in cmd/einvoice from pdfcpu to the
read-only speedata/pdfdisassembler parser. The new code walks the catalog's
EmbeddedFiles name tree directly (/Names and /Kids) and decodes the /EF/F
stream.

Advantages of pdfdisassembler here:
- No strict PDF-version validation, so PDF/A-3 features like AFRelationship
  on files declaring an older header version no longer cause extraction to
  fail. This removes the need for the previous HeaderVersion=1.4 workaround
  and eliminates that whole failure class (see #162).
- Far smaller footprint: zero third-party transitive dependencies (pure
  stdlib) vs. pdfcpu's image codecs, crypto/pkcs7, cobra, yaml, etc. The
  cmd/einvoice binary shrinks from ~18.9 MB to ~9.8 MB (-48%).
- Smaller supply chain / attack surface and a read-only-by-design parser.
- It is a speedata library, so version-strictness issues can be fixed at the
  source instead of waiting on upstream pdfcpu releases.

Functionally equivalent: all 52 real invoices in the FeRD ZUGFeRD example set
still extract and parse; only non-invoice PDFs (validation reports, supporting
attachments) are rejected, now with a clearer message.

Add pdf_test.go covering extraction against minimal synthetic PDF/A-3
documents built in memory (known filename, nested name tree, .xml fallback,
no attachments, no XML attachment) so the logic is exercised in CI without
shipping licensed binary fixtures.
@pgundlach

Copy link
Copy Markdown
Member Author

@fank I have replaced pdfcpu in the command line interface. If you object, I'd delete the branch.

@fank

fank commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Please keep it open, I would like to inspect it more in depth.

@fank

fank commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

@pgundlach could you give me write access to https://github.com/speedata/pdfdisassembler ?
Thx

@pgundlach

Copy link
Copy Markdown
Member Author

do you have any special (writing) plans with pdfdisassembler?

v0.0.3 adds Reader.EmbeddedFiles(), which walks the catalog's
EmbeddedFiles name tree itself. Replace the manual catalog navigation
and recursive collectEmbeddedFiles() helper with a single call.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants