Skip to content

Identify different sections in a PDF #3311

@santanuOUP

Description

@santanuOUP

Description of the bug

I want to identify different sections in a PDF for latter usage.
Consider following content with 2 sections:

**1.1 Uitgangspunten
Deze service, EftOphfCrOvkORCA mag enkel vanuit UPF aangeroepen worden. Er is daarom
ook geen publieke service omschrijving beschikbaar.

1.2 Controles
Alle beschreven meldingen in dit document worden in de monitor-tabel gezet.
De volgende (standaard) controles worden altijd uitgevoerd.
• Elk gegeven moet het juiste formaat hebben.
Melding: ‘Ongeldige invoer: heeft niet het juiste formaat.’**

1.1 Uitgangspunten and 2 lines after that is the first section. 1.2 Controles and 4 lines after that is the second section.
Is there any unique value is set by PyMuPDF when reading for each sections? I can't find any such within the properties.

I'm using following code:

page_blocks = page.get_text("dict")["blocks"]                    
for block in page_blocks:
     if "lines" in block.keys():
         spans = block['lines']                           
         for span in spans:
              span_info = span['spans']                                    
              for text_info in span_info:
                    text = text_info['text']

@JorjMcKie Please help on this.

How to reproduce the bug

NA

PyMuPDF version

1.23.25

Operating system

Windows

Python version

3.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions