Skip to content

Add test cases for AGL. #101

@dannywinrow

Description

@dannywinrow

I tried to run pdPageExtractText on the pdf located:
https://www.gov.im/media/1360682/isle-of-man-inflation-report-november-2021.pdf

However, every character of the text was being interpreted as "\0"

After much pain and effort trailing through the PDFIO code, I have identified the problem as being what is returned by the fum function in PDFont. In particular when the cn"Encoding" object contains a /Differences object with values such as /uni0047 which just represent the unicode character U+47 ('p'). Since the AGL_Glyph_To_Unicode dictionary (not sure where this comes from) doesn't contain the simple unicode mappings then the zero(Char) is returned instead.

One solution might be to just compare the /uni0047 to the base encoding dictionary and if the 0x0047 part exists then add a dictionary entry. Another solution would be to add all of the standard unicode characters that already exist in your base encoding such as /uni0047 to the AGL_Glyph_To_Unicode dictionary.

I have made the assumption, when suggesting this solution, that the cn"Encoding" object is taken directly from the pdf file and not further processed.

If you'd like me to try to create a pull request, I'd be happy to, but I thought I'd ask first in case your more holistic view of the project leads to a more effective solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions