Exploring Text and Bounding Box Extraction Anomalies in PDFs with PyMuPDF

### Exploring Text and Bounding Box Extraction Anomalies in PDFs with PyMuPDF

Thank you for the excellent work. I'd like to mention an issue related to text and bounding box (bbox) extraction. I've been attempting to extract text values and their corresponding position vectors from tables, and the solution works wonderfully. However, I've encountered a problem with some PDFs (sample attached). When trying to extract information using PyMuPDF, despite there being a significant whitespace between the words "GP" and "Unreserved," it groups them into one block. To understand the root cause, I conducted a word-level bbox extraction and discover ed that the space between "GP" and "Unreserved" is only 2-3 points in the x-coordinate space, which visually does not seem accurate. For example, the space for the "Reserved" vector spans approximately 30 points (274.56 - 244.95). So, why is the gap between "GP" and "Unreserved" only around 3 points (289.20 - 286.53)?

(242.27, 422.79, 339.06, 429.30, ' Reserved GP Unreserved GP\n')

(**244.95**, 422.79, **274.56**, 429.30, '**Reserved**', 9, 0, 0)
(277.23, 422.79, **286.53**, 429.30, '**GP**', 9, 0, 1)
(**289.20**, 422.79, 327.09, 429.30, '**Unreserved**', 9, 0, 2)
(329.76, 422.79, 339.06, 429.30, 'GP', 9, 0, 3)


[7.pdf](https://github.com/pymupdf/PyMuPDF/files/14547241/7.pdf)

<img width="564" alt="Screen Shot 2024-03-09 at 12 42 38 PM" src="https://github.com/pymupdf/PyMuPDF/assets/83714258/bbd1a6ac-f19b-4e59-b32a-dfda960cc4e4">

### How to reproduce the bug

import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    # Open the provided PDF file
    pdf_document = fitz.open(pdf_path)
    text = ""
    text_block = []
    
    # Iterate through each page of the PDF
    for page_num in range(len(pdf_document)):
        # Get the page
        page = pdf_document.load_page(page_num)
        # Extract text from the page
        text += page.get_text()
        print(page.get_text("blocks", sort=False))
        x = page.get_text("blocks", sort=False)
        for word in page.get_text("words", sort=False):
          print(word)
     
    # Close the document
    pdf_document.close()
    return x

# Specify the path to your PDF file
pdf_path = '/content/7.pdf'
# Extract text
extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text)

### PyMuPDF version

1.23.26

### Operating system

MacOS

### Python version

3.10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exploring Text and Bounding Box Extraction Anomalies in PDFs with PyMuPDF #3248