Refactoring of `pdf_extract.py` script by AdevGarcia · Pull Request #114 · opendatalab/PDF-Extract-Kit

AdevGarcia · 2024-09-04T10:19:29Z

Description:
This PR refactors the pdf_extract.py script to improve readability and maintainability of the code.
In order not to affect the current code, the app.py script and the app_tools library have been created.
app.py performs the same process as pdf_extract.py.
The app_tools library incorporates the refactorings of the different steps.

If you find it interesting you can replace app.py with pdf_extract.py

Motivation:
I love the project, I would like to thank you for the great work done.
Refactoring is done to continue working to create an api with fastAPI and Docker.

Main changes:

The script app.py has been created with the pipeline of pdf_extract.py.
The library app_tools has been created that contains the classes and methods to perform each step of the pipeline.
pdf.py: Provides a set of app_tools for working with PDF files.
layout_analysis.py: Analyzes the layout of documents by detecting the layout of each page in a document image.
formula_analysis.py: Is designed to handle formula detection and recognition in images.
ocr_analysis.py: OCR Processor. It is responsible for performing OCR recognition.
table_analysis.py: Represents a Table Processor that is used for table recognition in documents.
visualize.py: It generates visualizations of the document layout
config.py: Configure model parameters and logs
utils.py: save results in json

Functionality impact: No change to existing functionality is expected, as the refactoring does not introduce new features or modify existing ones.

Instructions for Reviewers:

Review the app.py and app_tools scripts to ensure that the logic has been ported correctly.
Verifies that there are no observable changes in the system's behavior when running the tests.

Example of Use:

python app.py --pdf 1706.03762.pdf

Added detailed logging configurations to improve visibility and debugging. Refactored PDF handling and processing into separate utility functions for better code organization and maintainability.

Relocate logging configuration into utils/config.py and move model initialization functions to utils/model_tools.py. Additionally, separate detection and recognition functionalities into distinct modules to enhance code readability and modularity.

Separated OCR recognition and table recognition into distinct functions. This improves code readability and maintainability by isolating each recognition task, enabling easier debugging and future enhancements.

Replaced standalone functions in `pdf_tools.py` with a new `PDFProcessor` class to encapsulate PDF processing logic. Adjusted `app.py` to use the new `PDFProcessor` class methods, improving code organization and maintainability.

Deleted redundant utility files and integrated functionality into new, focused modules under `app_tools`. Introduced `TableProcessor`, `LayoutAnalyzer`, `FormulaProcessor`, and `OCRProcessor` classes to handle specific operations. Updated `app.py` to reflect these changes and streamline the process flow.

Refactored several Python modules to simplify documentation strings and improve readability. Added argparse to app.py for better handling of command line arguments. Improved error handling and logging in several files. Revised documentation.

Updated library versions for consistency and reproducibility. Added new dependencies: torch, torchvision, numpy, opencv-python, Pillow, PyYAML, and pytz.

AdevGarcia added 8 commits September 2, 2024 08:50

Add initial version of PDF processing script

8f7b5ed

Introduce logging and refactor PDF processing

ca84c57

Added detailed logging configurations to improve visibility and debugging. Refactored PDF handling and processing into separate utility functions for better code organization and maintainability.

Refactor OCR and table recognition logic

cbaf25d

Separated OCR recognition and table recognition into distinct functions. This improves code readability and maintainability by isolating each recognition task, enabling easier debugging and future enhancements.

Refactor PDF tools to use a class-based structure

bea8f11

Replaced standalone functions in `pdf_tools.py` with a new `PDFProcessor` class to encapsulate PDF processing logic. Adjusted `app.py` to use the new `PDFProcessor` class methods, improving code organization and maintainability.

Add specific library versions to requirements.txt

945d16b

Updated library versions for consistency and reproducibility. Added new dependencies: torch, torchvision, numpy, opencv-python, Pillow, PyYAML, and pytz.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring of `pdf_extract.py` script#114

Refactoring of `pdf_extract.py` script#114
AdevGarcia wants to merge 8 commits intoopendatalab:mainfrom
AdevGarcia:main

AdevGarcia commented Sep 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AdevGarcia commented Sep 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant