Skip to content

Feature request: Extract both texts and tables on the same page #3093

@jimmyzzxhlh

Description

@jimmyzzxhlh

Is your feature request related to a problem? Please describe.
I'm testing out extract texts and tables using pymupdf. Some pages in the PDF may contain both texts and a table.
Example (tables starting from page 3):
https://www.aetnamedicare.com/documents/individual/2024/summaryofbenefits/Y0001_H5521_127_PQ05_SB24_M.pdf

Pymupdf works great with extracting the tables using Page.find_tables() and it correctly identifies rows/columns. However I haven't found a great way to extract both texts outside of tables and the tables on the same page.

Ideally, I would expect a function something like get_text_and_tables() which will return a list of either text or tables in natural reading order. Then based on the type of the element I can determine what to do with the text or the table.

The closest thing I can think of for now is the following, but it's probably going to be error prone.

  • Call Page.get_text() to extract all the text (which will contain texts from the tables on the page)
  • Call Page.find_tables() to extract the tables
  • Figure out the first cell and the last cell of each table, and delete the corresponding texts from Page.get_text(). Then try to combine the texts and the tables together.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions