-
Notifications
You must be signed in to change notification settings - Fork 678
Description
First of all thank you very much for this great work. I particularly appreciate your layout preserving text extraction method.
My question is: does pymupdf support TOC creation for a pdf document?
get_toc() method used such as:
import fitz
pdf_filename = 'my.pdf'
with fitz.open(pdf_filename) as doc:
print(doc.get_toc())seems to give results only if the TOC is already present at the beginning of the document.
On this pdf: https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/text-extraction/Dart.pdf for example, is there a method in pymupdf to generate the pdf outline? In this case only the numbered titles of the paragraphs.
As well as for pdf with more complex formatting such as: https://blog.xpgreat.com/file/lstm.pdf, with numbered parts and sub-parts.
I also tested mupdf with the command: mutool show my.pdf outline but it returns nothing with or without TOC inside the pdf file in my case.
Your configuration (mandatory)
In my case, I made installation on macOS arm64 M2 (not Intel).
I create a conda osx-64 environment inspired from this amazing solution, and its work for me.
conda create -n pymupdf
conda activate pymupdf
conda config --env --set subdir osx-64
conda install python=3.9
python -m pip install --upgrade pymupdf
print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) gives me:
3.9.13 (main, Oct 13 2022, 16:12:30)
[Clang 12.0.0 ]
darwin
PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library.
Version date: 2022-08-13 00:00:01.
Built for Python 3.9 on darwin (64-bit).
Please feel free to modify the README.md to notify macOS users with the apple chip that it also works by following this steps, I'm sure it will be useful for some :).