Skip to content

Spans detected in page.get_text("dict") fails in a weird pdf format  #3783

@lnmduc2

Description

@lnmduc2

Description of the bug

As far as I know, the page.get_text("dict") API can access to spans bounding boxes (which contain texts with the same font styles). However, some pdfs are somewhat strangely encoded, and it seems like PyMuPDF cannot detect spans for these files.

How to reproduce the bug

When i use the span detection with the below code for the
no_bug.pdf file, the code works just fine and the spans are detected relatively accurate (as shown in this image)
no_bug

)

import fitz  # PyMuPDF
import sys
import io
from PIL import Image, ImageDraw

def visualize_span_bbox(page, img_path, dpi=200):
		"""Visualize the span bounding boxes on the image."""
		# Get the page dimensions and set scale factor for DPI
		zoom_x = dpi / 72.0
		zoom_y = dpi / 72.0
		mat = fitz.Matrix(zoom_x, zoom_y)  # Create a transformation matrix for DPI scaling
		
		# Convert PDF page to image with specified DPI
		pix = page.get_pixmap(matrix=mat)
		img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
		draw = ImageDraw.Draw(img)
		
		blocks = page.get_text("dict")["blocks"]
		for block in blocks:
				try:
					for line in block["lines"]:
							for span in line['spans']:
									span_bbox = [int(coord * zoom_x) for coord in span['bbox']]
									text = span['text']
									fontname = span["font"]
									font_flags = span["flags"]
									size = span["size"]

									is_bold = "Bold" in fontname
									is_italic = "Italic" in fontname or "Oblique" in fontname
									print(f"{text} : {fontname}: {size} : ({is_bold}, {is_italic})")
									draw.rectangle(span_bbox, outline="blue", width=2)
				except:
					print("ERROR")
								
		# Save the image with span bounding boxes
		img.save(img_path)

if __name__ == "__main__":
		doc = fitz.open(sys.argv[1])
		for page in doc:
				print("Text from page %i:" % page.number)
				visualize_span_bbox(page, f"result.png", dpi=200)

but when i changed into this pdf file (bug.pdf), it breaks (the bounding boxes are separated in a weird way, and the font styles bold/italic are also inaccurate):
bug

Can someone tell me why does this happen?

PyMuPDF version

1.24.9

Operating system

Linux

Python version

3.10

Metadata

Metadata

Assignees

No one assigned

    Labels

    not a bugnot a bug / user error / unable to reproduce

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions