ENH: Font: Initialise a Font from an embedded font file#3704
Open
PJBrs wants to merge 6 commits intopy-pdf:mainfrom
Open
ENH: Font: Initialise a Font from an embedded font file#3704PJBrs wants to merge 6 commits intopy-pdf:mainfrom
PJBrs wants to merge 6 commits intopy-pdf:mainfrom
Conversation
This patch adds a character_map when initialising from an embedded font. This patch uses getGlyphOrder when trying to create a character_map from an embedded font file. getGlyphOrder is sure to include all glyphs in a font, not just the ones that are mapped by a unicode code point. For now, this does not make a lot of difference, but in the future it might make it easier to collect all widths in the font.
This patch more comprehensivel tries to detect font flags. Furthermore, it adds some checks to deal with missing tables in truetype fonts. It is a bit of a question what to do when the cmap itself is missing. In this version, we just continue, but perhaps we should raise a warning or even an error, because, in practice, it would mean that the font that results isn't usable.
This patch adds a test and a file with some sample font resources that all have specific font flags and/ or specific missing tables, to test all the if conditions in _font.py. The font resources were added using pypdf itself, and lifted from pdf files used as part of the current test suite.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3704 +/- ##
==========================================
+ Coverage 97.43% 97.45% +0.02%
==========================================
Files 55 55
Lines 10016 10117 +101
Branches 1841 1855 +14
==========================================
+ Hits 9759 9860 +101
Misses 149 149
Partials 108 108 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds the capability to initialise a Font instance from an embedded font file. It adds fonttools as an optional dependency to parse the font file.
I tested this to see if the resulting Font instances can be used for text extraction, and it mostly can, barring a couple of exceptions. Then again, this ultimately isn't intended for text extraction but for creating new appearance streams.
I also added a fontsampler file that I created using pypdf and that contains selected fonts from the existing test files. These embedded fonts are used for all the different if conditions in this PR for dealing with font flags, as well as one font that apparently does not include a cmap. This PR raises a KeyError in that case.
This PR is a small part of #3652 and it includes all work from #3602. I created it to make review more manageable. It should be ready as is!