-
Notifications
You must be signed in to change notification settings - Fork 678
Closed
Labels
not a bugnot a bug / user error / unable to reproducenot a bug / user error / unable to reproduce
Description
Description of the bug
As far as I know, the page.get_text("dict") API can access to spans bounding boxes (which contain texts with the same font styles). However, some pdfs are somewhat strangely encoded, and it seems like PyMuPDF cannot detect spans for these files.
How to reproduce the bug
When i use the span detection with the below code for the
no_bug.pdf file, the code works just fine and the spans are detected relatively accurate (as shown in this image)

)
import fitz # PyMuPDF
import sys
import io
from PIL import Image, ImageDraw
def visualize_span_bbox(page, img_path, dpi=200):
"""Visualize the span bounding boxes on the image."""
# Get the page dimensions and set scale factor for DPI
zoom_x = dpi / 72.0
zoom_y = dpi / 72.0
mat = fitz.Matrix(zoom_x, zoom_y) # Create a transformation matrix for DPI scaling
# Convert PDF page to image with specified DPI
pix = page.get_pixmap(matrix=mat)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
draw = ImageDraw.Draw(img)
blocks = page.get_text("dict")["blocks"]
for block in blocks:
try:
for line in block["lines"]:
for span in line['spans']:
span_bbox = [int(coord * zoom_x) for coord in span['bbox']]
text = span['text']
fontname = span["font"]
font_flags = span["flags"]
size = span["size"]
is_bold = "Bold" in fontname
is_italic = "Italic" in fontname or "Oblique" in fontname
print(f"{text} : {fontname}: {size} : ({is_bold}, {is_italic})")
draw.rectangle(span_bbox, outline="blue", width=2)
except:
print("ERROR")
# Save the image with span bounding boxes
img.save(img_path)
if __name__ == "__main__":
doc = fitz.open(sys.argv[1])
for page in doc:
print("Text from page %i:" % page.number)
visualize_span_bbox(page, f"result.png", dpi=200)
but when i changed into this pdf file (bug.pdf), it breaks (the bounding boxes are separated in a weird way, and the font styles bold/italic are also inaccurate):

Can someone tell me why does this happen?
PyMuPDF version
1.24.9
Operating system
Linux
Python version
3.10
Metadata
Metadata
Assignees
Labels
not a bugnot a bug / user error / unable to reproducenot a bug / user error / unable to reproduce