Skip to content

Some missing spaces in get_text output #2437

@henrygriffiths

Description

@henrygriffiths

Describe the bug (mandatory)

Output from .get_text is missing some random spaces between words on the same line of the text in the PDF.

To Reproduce (mandatory)

import fitz
doc = fitz.open('file.pdf`)
for page in doc:
  for block in page.get_text("dict", flags=31)["blocks"]:
    print(block)

Expected behavior (optional)

Text contains all the spaces that the PDF does.
eg. The quick brown fox jumps over the lazy dog
is output instead as Thequick brown fox jumps overthe lazy dog (Removing spaces on the same PDF line)

Screenshots (optional)

N/A

Your configuration (mandatory)

  • Operating system, potentially version and bitness : Linux 6.3.3-arch1-1 x86_64
  • Python version, bitness : Python 3.10.11 (main, May 25 2023, 13:44:59) [GCC 13.1.1 20230429] x86_64
  • PyMuPDF version, installation method (wheel or generated from source) : 1.22.3 installed from wheel (using pip 23.1.2, setuptools 67.8.0 and wheel 0.40.0

Additional context (optional)

I have reviewed the bug report from #456 and #364 and tested using mutool as recommended. Using mutool 1.22.0 (as is used by PyMuPDF 1.22.3), the output of the PDF (using mutool draw -o test.html file.pdf 1) contains all of the spaces.
I am unsure if this is a duplicate of #2400, as I don't have enough information to determine if the same issue (an empty gap between those spaces), and I apologize if it is.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions