Skip to content

Exploring Text and Bounding Box Extraction Anomalies in PDFs with PyMuPDF #3248

@satvik-27199

Description

@satvik-27199

Exploring Text and Bounding Box Extraction Anomalies in PDFs with PyMuPDF

Thank you for the excellent work. I'd like to mention an issue related to text and bounding box (bbox) extraction. I've been attempting to extract text values and their corresponding position vectors from tables, and the solution works wonderfully. However, I've encountered a problem with some PDFs (sample attached). When trying to extract information using PyMuPDF, despite there being a significant whitespace between the words "GP" and "Unreserved," it groups them into one block. To understand the root cause, I conducted a word-level bbox extraction and discover ed that the space between "GP" and "Unreserved" is only 2-3 points in the x-coordinate space, which visually does not seem accurate. For example, the space for the "Reserved" vector spans approximately 30 points (274.56 - 244.95). So, why is the gap between "GP" and "Unreserved" only around 3 points (289.20 - 286.53)?

(242.27, 422.79, 339.06, 429.30, ' Reserved GP Unreserved GP\n')

(244.95, 422.79, 274.56, 429.30, 'Reserved', 9, 0, 0)
(277.23, 422.79, 286.53, 429.30, 'GP', 9, 0, 1)
(289.20, 422.79, 327.09, 429.30, 'Unreserved', 9, 0, 2)
(329.76, 422.79, 339.06, 429.30, 'GP', 9, 0, 3)

7.pdf

Screen Shot 2024-03-09 at 12 42 38 PM

How to reproduce the bug

import fitz # PyMuPDF

def extract_text_from_pdf(pdf_path):
# Open the provided PDF file
pdf_document = fitz.open(pdf_path)
text = ""
text_block = []

# Iterate through each page of the PDF
for page_num in range(len(pdf_document)):
    # Get the page
    page = pdf_document.load_page(page_num)
    # Extract text from the page
    text += page.get_text()
    print(page.get_text("blocks", sort=False))
    x = page.get_text("blocks", sort=False)
    for word in page.get_text("words", sort=False):
      print(word)
 
# Close the document
pdf_document.close()
return x

Specify the path to your PDF file

pdf_path = '/content/7.pdf'

Extract text

extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text)

PyMuPDF version

1.23.26

Operating system

MacOS

Python version

3.10

Metadata

Metadata

Assignees

No one assigned

    Labels

    not a bugnot a bug / user error / unable to reproduce

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions