Replace pdfcpu with pdfdisassembler for PDF attachment extraction#164
Open
pgundlach wants to merge 3 commits into
Open
Replace pdfcpu with pdfdisassembler for PDF attachment extraction#164pgundlach wants to merge 3 commits into
pgundlach wants to merge 3 commits into
Conversation
Switch the embedded-XML extraction in cmd/einvoice from pdfcpu to the read-only speedata/pdfdisassembler parser. The new code walks the catalog's EmbeddedFiles name tree directly (/Names and /Kids) and decodes the /EF/F stream. Advantages of pdfdisassembler here: - No strict PDF-version validation, so PDF/A-3 features like AFRelationship on files declaring an older header version no longer cause extraction to fail. This removes the need for the previous HeaderVersion=1.4 workaround and eliminates that whole failure class (see #162). - Far smaller footprint: zero third-party transitive dependencies (pure stdlib) vs. pdfcpu's image codecs, crypto/pkcs7, cobra, yaml, etc. The cmd/einvoice binary shrinks from ~18.9 MB to ~9.8 MB (-48%). - Smaller supply chain / attack surface and a read-only-by-design parser. - It is a speedata library, so version-strictness issues can be fixed at the source instead of waiting on upstream pdfcpu releases. Functionally equivalent: all 52 real invoices in the FeRD ZUGFeRD example set still extract and parse; only non-invoice PDFs (validation reports, supporting attachments) are rejected, now with a clearer message. Add pdf_test.go covering extraction against minimal synthetic PDF/A-3 documents built in memory (known filename, nested name tree, .xml fallback, no attachments, no XML attachment) so the logic is exercised in CI without shipping licensed binary fixtures.
Member
Author
|
@fank I have replaced pdfcpu in the command line interface. If you object, I'd delete the branch. |
Collaborator
|
Please keep it open, I would like to inspect it more in depth. |
Collaborator
|
|
Member
Author
|
do you have any special (writing) plans with pdfdisassembler? |
v0.0.3 adds Reader.EmbeddedFiles(), which walks the catalog's EmbeddedFiles name tree itself. Replace the manual catalog navigation and recursive collectEmbeddedFiles() helper with a single call.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Switches the embedded-XML extraction in
cmd/einvoicefrom pdfcpu to the read-only speedata/pdfdisassembler parser. The new code walks the catalog'sEmbeddedFilesname tree directly (/Namesand/Kids) and decodes the/EF/Fstream. The publicextractXMLFromPDFsignature is unchanged.Why
Investigating #162: a user reported only 8/118 ZUGFeRD 2.5 example PDFs validating, with two error classes —
AFRelationship: unsupported in version 1.3andno attachments available. Testing the current code against the official FeRD example set showed the real cause:_fx.pdfinvoices extract fine already; the 59 "failures" are non-invoice PDFs (52 validation reports + 7 supporting attachments) that legitimately contain no embedded invoice XML.AFRelationship: unsupported in version 1.3errors are not reproducible with pdfcpu 0.13.0 (even without the HeaderVersion workaround) — they come from an older pdfcpu in the reporter's prebuilt binary. The PDF-extraction code lives incmd/einvoice(packagemain), so a third-party tool that only imports the library never gets it anyway.pdfdisassembler removes the fragility at the root.
Advantages
HeaderVersion=1.4workaround is gone and that whole failure class is structurally eliminated.x/image,hhrutter/tiff),x/crypto/pkcs7,cobra,yaml, etc. Thecmd/einvoicebinary shrinks ~18.9 MB → ~9.8 MB (-48%).Verification
PDF contains no embedded files).pdf_test.gocovers extraction against minimal synthetic PDF/A-3 documents built in memory (known filename, nested/Kidsname tree,.xmlfallback, no attachments, no XML attachment) — so the logic runs in CI without shipping licensed binary fixtures.go test ./...passes;gofmtclean.Refs #162.