Skip to content

get_text('block') is much slower on jp2 images #3078

@VeryLazyBoy

Description

@VeryLazyBoy

Description of the bug

Given a pdf made of 1 jp2 image, the get_text('block') is slower compared with pdfs made of png or jpg images.

  • jp2 speed: 0.6 seconds
  • png speed: 0.00019 seconds
  • jpeg speed: 0.00012 seconds

How to reproduce the bug

import fitz
import time

def image_pdf_speed(image_file):
    img_doc = fitz.Document(image_file)
    pdf_bytes = img_doc.convert_to_pdf()
    pdf_doc = fitz.Document(stream=pdf_bytes)
    st = time.time()
    pdf_doc[0].get_text('blocks')
    print(time.time() - st)


jp2_file = 'debug.jp2'
print('JPEG 2000 speed:')
image_pdf_speed(jp2_file) # 0.6 seconds

png_file = 'debug.png'
print('PNG speed:')
image_pdf_speed(png_file) # 0.00019 seconds

jpeg_file = 'debug.jpeg'
print('JPEG speed:')
image_pdf_speed(jpeg_file) # 0.00012 seconds

Here are the images I used to test
images.zip

PyMuPDF version

1.23.8 or earlier.

I noticed that since 1.23.9, get_text('block') no longer return image blocks. For newer versions, the speed difference should be checked by get_text('dict').

Operating system

MacOS

Python version

3.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    not a bugnot a bug / user error / unable to reproduce

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions