-
Notifications
You must be signed in to change notification settings - Fork 678
Closed
Description
When using page.get_text_blocks on a specific document (attached), every single space becomes a question mark (65533), e.g. "The�Count�of�Monte�Cristo\n". I'm aware that this is how mupdf/pymupdf denotes glyphs it cannot understand, but it's odd that the same document can be read fine with apple's Preview and google's Chrome/pdfium.
797The-Count-of-Monte-Cristo.pdf
To Reproduce (mandatory)
flags = (fitz.TEXT_DEHYPHENATE | fitz.TEXT_MEDIABOX_CLIP)
with fitz.open(PDF_PATH) as doc:
page = doc[0]
blocks = page.get_text_blocks(flags=flags)
for i, block in enumerate(blocks):
print(f'{i} - {block}')
Your configuration (mandatory)
- Mac and ubuntu
- Python 3.11
3.11.3 (v3.11.3:f3909b8bc8, Apr 4 2023, 20:12:10) [Clang 13.0.0 (clang-1300.0.29.30)]
darwin
PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
Version date: 2023-06-21 00:00:01.
Built for Python 3.11 on darwin (64-bit).
Additional question
Does (py)mupdf have any sort of 'drop invalid characters' option? Ideally we'd drop both �'s and others (e.g. split surrogates from #2608, or Private Use ones such as U+10FC31).
Of course, I can do string.replace('\uFFFD', ' '), but that messes with page.get_text_words result, plus I'd need to compile a list of all 'bad characters', which seems wrong.
Metadata
Metadata
Assignees
Labels
No labels