-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Description of the bug
I want to identify different sections in a PDF for latter usage.
Consider following content with 2 sections:
**1.1 Uitgangspunten
Deze service, EftOphfCrOvkORCA mag enkel vanuit UPF aangeroepen worden. Er is daarom
ook geen publieke service omschrijving beschikbaar.
1.2 Controles
Alle beschreven meldingen in dit document worden in de monitor-tabel gezet.
De volgende (standaard) controles worden altijd uitgevoerd.
• Elk gegeven moet het juiste formaat hebben.
Melding: ‘Ongeldige invoer: heeft niet het juiste formaat.’**
1.1 Uitgangspunten and 2 lines after that is the first section. 1.2 Controles and 4 lines after that is the second section.
Is there any unique value is set by PyMuPDF when reading for each sections? I can't find any such within the properties.
I'm using following code:
page_blocks = page.get_text("dict")["blocks"]
for block in page_blocks:
if "lines" in block.keys():
spans = block['lines']
for span in spans:
span_info = span['spans']
for text_info in span_info:
text = text_info['text']
@JorjMcKie Please help on this.
How to reproduce the bug
NA
PyMuPDF version
1.23.25
Operating system
Windows
Python version
3.8