-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Is your feature request related to a problem? Please describe.
I'm testing out extract texts and tables using pymupdf. Some pages in the PDF may contain both texts and a table.
Example (tables starting from page 3):
https://www.aetnamedicare.com/documents/individual/2024/summaryofbenefits/Y0001_H5521_127_PQ05_SB24_M.pdf
Pymupdf works great with extracting the tables using Page.find_tables() and it correctly identifies rows/columns. However I haven't found a great way to extract both texts outside of tables and the tables on the same page.
Ideally, I would expect a function something like get_text_and_tables() which will return a list of either text or tables in natural reading order. Then based on the type of the element I can determine what to do with the text or the table.
The closest thing I can think of for now is the following, but it's probably going to be error prone.
- Call
Page.get_text()to extract all the text (which will contain texts from the tables on the page) - Call
Page.find_tables()to extract the tables - Figure out the first cell and the last cell of each table, and delete the corresponding texts from
Page.get_text(). Then try to combine the texts and the tables together.