-
Notifications
You must be signed in to change notification settings - Fork 678
Closed
Labels
not a bugnot a bug / user error / unable to reproducenot a bug / user error / unable to reproduce
Description
Description of the bug
Given a pdf made of 1 jp2 image, the get_text('block') is slower compared with pdfs made of png or jpg images.
- jp2 speed: 0.6 seconds
- png speed: 0.00019 seconds
- jpeg speed: 0.00012 seconds
How to reproduce the bug
import fitz
import time
def image_pdf_speed(image_file):
img_doc = fitz.Document(image_file)
pdf_bytes = img_doc.convert_to_pdf()
pdf_doc = fitz.Document(stream=pdf_bytes)
st = time.time()
pdf_doc[0].get_text('blocks')
print(time.time() - st)
jp2_file = 'debug.jp2'
print('JPEG 2000 speed:')
image_pdf_speed(jp2_file) # 0.6 seconds
png_file = 'debug.png'
print('PNG speed:')
image_pdf_speed(png_file) # 0.00019 seconds
jpeg_file = 'debug.jpeg'
print('JPEG speed:')
image_pdf_speed(jpeg_file) # 0.00012 secondsHere are the images I used to test
images.zip
PyMuPDF version
1.23.8 or earlier.
I noticed that since 1.23.9, get_text('block') no longer return image blocks. For newer versions, the speed difference should be checked by get_text('dict').
Operating system
MacOS
Python version
3.8
Metadata
Metadata
Assignees
Labels
not a bugnot a bug / user error / unable to reproducenot a bug / user error / unable to reproduce