Skip to content

A document has every space encoded as � #2609

@nikitar

Description

@nikitar

When using page.get_text_blocks on a specific document (attached), every single space becomes a question mark (65533), e.g. "The�Count�of�Monte�Cristo\n". I'm aware that this is how mupdf/pymupdf denotes glyphs it cannot understand, but it's odd that the same document can be read fine with apple's Preview and google's Chrome/pdfium.

797The-Count-of-Monte-Cristo.pdf

To Reproduce (mandatory)

    flags = (fitz.TEXT_DEHYPHENATE | fitz.TEXT_MEDIABOX_CLIP)
    with fitz.open(PDF_PATH) as doc:
        page = doc[0]
        blocks = page.get_text_blocks(flags=flags)
        for i, block in enumerate(blocks):
            print(f'{i} - {block}')

Your configuration (mandatory)

  • Mac and ubuntu
  • Python 3.11
3.11.3 (v3.11.3:f3909b8bc8, Apr  4 2023, 20:12:10) [Clang 13.0.0 (clang-1300.0.29.30)] 
 darwin 
 
PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
Version date: 2023-06-21 00:00:01.
Built for Python 3.11 on darwin (64-bit).

Additional question

Does (py)mupdf have any sort of 'drop invalid characters' option? Ideally we'd drop both �'s and others (e.g. split surrogates from #2608, or Private Use ones such as U+10FC31).

Of course, I can do string.replace('\uFFFD', ' '), but that messes with page.get_text_words result, plus I'd need to compile a list of all 'bad characters', which seems wrong.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions