A document has every space encoded as �

When using `page.get_text_blocks` on a specific document (attached), every single space becomes a question mark (65533), e.g. `"The�Count�of�Monte�Cristo\n"`. I'm aware that this is how mupdf/pymupdf denotes glyphs it cannot understand, but it's odd that the same document can be read fine with apple's Preview and google's Chrome/pdfium.

[797The-Count-of-Monte-Cristo.pdf](https://github.com/pymupdf/PyMuPDF/files/12404408/797The-Count-of-Monte-Cristo.pdf)



## To Reproduce (mandatory)

```
    flags = (fitz.TEXT_DEHYPHENATE | fitz.TEXT_MEDIABOX_CLIP)
    with fitz.open(PDF_PATH) as doc:
        page = doc[0]
        blocks = page.get_text_blocks(flags=flags)
        for i, block in enumerate(blocks):
            print(f'{i} - {block}')
```


## Your configuration (mandatory)
 - Mac and ubuntu
 - Python 3.11

```
3.11.3 (v3.11.3:f3909b8bc8, Apr  4 2023, 20:12:10) [Clang 13.0.0 (clang-1300.0.29.30)] 
 darwin 
 
PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
Version date: 2023-06-21 00:00:01.
Built for Python 3.11 on darwin (64-bit).
```



### Additional question
Does (py)mupdf have any sort of 'drop invalid characters' option? Ideally we'd drop both �'s and others (e.g. split surrogates from #2608, or Private Use ones such as [U+10FC31](https://www.compart.com/en/unicode/U+10FC31)).

Of course, I can do `string.replace('\uFFFD', ' ')`, but that messes with `page.get_text_words` result, plus I'd need to compile a list of all 'bad characters', which seems wrong.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A document has every space encoded as � #2609

To Reproduce (mandatory)

Your configuration (mandatory)

Additional question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A document has every space encoded as � #2609

Description

To Reproduce (mandatory)

Your configuration (mandatory)

Additional question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions