-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Description of the bug
Hi Team,
I am using PyMuPDF to parse data from pdf which contains text, table and image.
When I am trying to use below code just for parsing text, I am able to parse text in right sequence:
def extract_text_from_pdf(pdf_path):
import fitz
doc = fitz.open(pdf_path)
text = ''
for page_number in range(doc.page_count):
page = doc[page_number]
text += page.get_text()
doc.close()
return text
However, when I trying to alter the code as below, I am getting tables content listed twice(one by get_text function and other by .find_tables() function). Also, I am not getting text and tables in correct sequence. Is there any way I can parse the table data just once?
import fitz # PyMuPDF
import matplotlib.pyplot as plt
import pandas as pd
def parse_pdf(pdf_path):
doc = fitz.open(pdf_path)
# Initialize variables to store extracted data
parsed_data=[]
for page_num in range(doc.page_count):
page = doc[page_num]
# Extract text
text = page.get_text()
if text:
parsed_data.append({'type': 'text', 'content': text})
#Find Tables
tabs = page.find_tables()
#print(tabs)
if tabs:
for tab in tabs:
table=[]
for line in tab.extract():
table.append(line)
parsed_data.append({'type': 'table', 'content': table})
doc.close()
return parsed_data
Calling the function:
pdf_path = "EOS-User-Manual.pdf"
parse_data=parse_pdf(pdf_path)
#calling sub-set
parsed_data=parse_data[0:100000]
Access the parsed data & display it
for entry in parsed_data:
if entry['type'] == 'text':
print(entry['content'])
elif entry['type'] == 'table':
data=entry['content']
df=pd.DataFrame(data)
print(df)
print()
Can you please advise how I can parse text,table,images in correct sequence using PyMuPDF?
Thank you
Reema Jain
How to reproduce the bug
Complete Code:
!pip install fitz
!pip install PyMuPDF
!pip install PyMuPDF Pillow
import fitz # PyMuPDF
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
import io
from io import BytesIO
def parse_pdf(pdf_path):
doc = fitz.open(pdf_path)
# Initialize variables to store extracted data
parsed_data=[]
for page_num in range(doc.page_count):
page = doc[page_num]
# Extract text
text = page.get_text()
if text:
parsed_data.append({'type': 'text', 'content': text})
#Find Tables
tabs = page.find_tables()
#print(tabs)
if tabs:
for tab in tabs:
table=[]
for line in tab.extract():
table.append(line)
parsed_data.append({'type': 'table', 'content': table})
doc.close()
return parsed_data
Calling the function:
pdf_path = "EOS-User-Manual.pdf"
parse_data=parse_pdf(pdf_path)
#calling sub-set
parsed_data=parse_data[0:100000]
Access the parsed data & display it
for entry in parsed_data:
if entry['type'] == 'text':
print(entry['content'])
elif entry['type'] == 'table':
data=entry['content']
df=pd.DataFrame(data)
print(df)
print()
PyMuPDF version
1.23.7
Operating system
Windows
Python version
3.10