I'm trying to take a non-searchable pdf and convert it to searchable pdf by superimposing an invisible text layer.
#Parse blocks received from textract response
blocks = response['Blocks']
for block in blocks:
if block['BlockType'] == 'WORD':
page_number = block['Page'] - 1 # pages in PyMuPDF start from 0
pdf_page = doc.load_page(page_number)
bbox = block['Geometry']['BoundingBox']
bbox_mupdf = fitz.Rect(
bbox['Left'] * pdf_page.rect.width,
bbox['Top'] * pdf_page.rect.height,
(bbox['Left'] + bbox['Width']) * pdf_page.rect.width,
(bbox['Top'] + bbox['Height']) * pdf_page.rect.height
)
pdf_page.insert_textbox(
rect=bbox_mupdf,
buffer=block['Text'],
color=None, # Invisible color
overlay=True # Overlay the text on top of existing content
)
I want the invisible bboxes to align exactly with the original bboxes (extracted by Textract), but with this code the resulting bboxes are off (smaller).
Is insert_textbox the right way to do this?
I don't want to specify a font and font_size because then the bboxes wouldn't align perfectly.
I'm trying to take a non-searchable pdf and convert it to searchable pdf by superimposing an invisible text layer.
I want the invisible bboxes to align exactly with the original bboxes (extracted by Textract), but with this code the resulting bboxes are off (smaller).
Is insert_textbox the right way to do this?
I don't want to specify a font and font_size because then the bboxes wouldn't align perfectly.