-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Exploring Text and Bounding Box Extraction Anomalies in PDFs with PyMuPDF
Thank you for the excellent work. I'd like to mention an issue related to text and bounding box (bbox) extraction. I've been attempting to extract text values and their corresponding position vectors from tables, and the solution works wonderfully. However, I've encountered a problem with some PDFs (sample attached). When trying to extract information using PyMuPDF, despite there being a significant whitespace between the words "GP" and "Unreserved," it groups them into one block. To understand the root cause, I conducted a word-level bbox extraction and discover ed that the space between "GP" and "Unreserved" is only 2-3 points in the x-coordinate space, which visually does not seem accurate. For example, the space for the "Reserved" vector spans approximately 30 points (274.56 - 244.95). So, why is the gap between "GP" and "Unreserved" only around 3 points (289.20 - 286.53)?
(242.27, 422.79, 339.06, 429.30, ' Reserved GP Unreserved GP\n')
(244.95, 422.79, 274.56, 429.30, 'Reserved', 9, 0, 0)
(277.23, 422.79, 286.53, 429.30, 'GP', 9, 0, 1)
(289.20, 422.79, 327.09, 429.30, 'Unreserved', 9, 0, 2)
(329.76, 422.79, 339.06, 429.30, 'GP', 9, 0, 3)
How to reproduce the bug
import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path):
# Open the provided PDF file
pdf_document = fitz.open(pdf_path)
text = ""
text_block = []
# Iterate through each page of the PDF
for page_num in range(len(pdf_document)):
# Get the page
page = pdf_document.load_page(page_num)
# Extract text from the page
text += page.get_text()
print(page.get_text("blocks", sort=False))
x = page.get_text("blocks", sort=False)
for word in page.get_text("words", sort=False):
print(word)
# Close the document
pdf_document.close()
return x
Specify the path to your PDF file
pdf_path = '/content/7.pdf'
Extract text
extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text)
PyMuPDF version
1.23.26
Operating system
MacOS
Python version
3.10