Skip to content

Unexpected new line in the parsed text since v1.24.0 #3353

@Matmaus

Description

@Matmaus

Description of the bug

File: Simple PDF 2.0 file.pdf (taken from PDF association GitHub page with example PDFs)

Since version v1.24.0 I see unexpected new line in the parsed text. Here is a text object of the PDF above:

6 0 obj
<< /Length 166 >>
stream
% A text block that shows "Hello World"
% No color is set, so this defaults to black in DeviceGray colorspace
BT
  /F1 24 Tf
  100 100 Td
  (Hello World) Tj
ET
endstream
endobj

Screenshot from 2024-03-27 09-55-39

How to reproduce the bug

To reproduce

import fitz as pymupdf

doc = pymupdf.open('Simple PDF 2.0 file.pdf')  # see section above

Version 1.23.26:

>>> doc.load_page(0).get_text('text')
'Hello World\n'

Version 1.24.0:

>>> doc.load_page(0).get_text('text')
'Hello \nWorld\n'

Expected behaviour

I would say that the additional new line should not be there.

PyMuPDF version

1.24.1

Operating system

Linux

Python version

3.10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions