-
Notifications
You must be signed in to change notification settings - Fork 678
Closed
Description
Description of the bug
Goal: Pymupdf highlight difference between 2 pdf pages, version (1.22.1)
Trying to compare 2 pdf pages - p1 and p2 and highlight the difference in p1
Algorithm:
1. Get text_blocks with bounding_box from each_page
2. Compare text_blocks of p1 with p2
3. for every text_block which is different use the respective bounding_box to highlight the diffeerence
Code:
def get_text_blocks(page):
blocks = []
blocks_bbox = []
blocks = page.get_text_blocks()
for block in blocks:
#appending the bounding box of the block
blocks_bbox.append(block[0:4])
#appending the text from the block
blocks.append(block[4])
return blocks, blocks_bbox
difference psuedo_code:
diff = [list of text_blocks IN p1 and NOT IN p2]
for each_diff in diff:
#get the bounding_box of the difference block
rect = fitz.rect(bounding_box)
annot = p1.add_highlight_annot(rect)
annot.update()
This works. But in certain cases though the contents are identical they get grouped into different text blocks so while comparing it is highlighting wrong.
Example:
p1:
block_1: line1, line2
block_2: line3
p2:
block_1: line1, line2, line3
Though the identical 3 lines (back-to-back) - line1, line2, line3 are present in both the pages p1 and p2 since the blocks are different it is getting flagged
Also, tried the get_text and compare line by line approach, it is not working.
Any suggestions on how to fix this will be helpful?
How to reproduce the bug
explained above
PyMuPDF version
1.23.5 or earlier
Operating system
Windows
Python version
3.8
Metadata
Metadata
Assignees
Labels
No labels