Refactoring of pdf_extract.py script#114
Open
AdevGarcia wants to merge 8 commits intoopendatalab:mainfrom
Open
Refactoring of pdf_extract.py script#114AdevGarcia wants to merge 8 commits intoopendatalab:mainfrom
pdf_extract.py script#114AdevGarcia wants to merge 8 commits intoopendatalab:mainfrom
Conversation
Added detailed logging configurations to improve visibility and debugging. Refactored PDF handling and processing into separate utility functions for better code organization and maintainability.
Relocate logging configuration into utils/config.py and move model initialization functions to utils/model_tools.py. Additionally, separate detection and recognition functionalities into distinct modules to enhance code readability and modularity.
Separated OCR recognition and table recognition into distinct functions. This improves code readability and maintainability by isolating each recognition task, enabling easier debugging and future enhancements.
Replaced standalone functions in `pdf_tools.py` with a new `PDFProcessor` class to encapsulate PDF processing logic. Adjusted `app.py` to use the new `PDFProcessor` class methods, improving code organization and maintainability.
Deleted redundant utility files and integrated functionality into new, focused modules under `app_tools`. Introduced `TableProcessor`, `LayoutAnalyzer`, `FormulaProcessor`, and `OCRProcessor` classes to handle specific operations. Updated `app.py` to reflect these changes and streamline the process flow.
Refactored several Python modules to simplify documentation strings and improve readability. Added argparse to app.py for better handling of command line arguments. Improved error handling and logging in several files. Revised documentation.
Updated library versions for consistency and reproducibility. Added new dependencies: torch, torchvision, numpy, opencv-python, Pillow, PyYAML, and pytz.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description:
This PR refactors the
pdf_extract.pyscript to improve readability and maintainability of the code.In order not to affect the current code, the
app.pyscript and theapp_toolslibrary have been created.app.pyperforms the same process aspdf_extract.py.The
app_toolslibrary incorporates the refactorings of the different steps.app_tools
|- pdf.py
|- layout_analysis.py
|- formula_analysis.py
|- ocr_analysis.py
|- table_analysis.py
|- visualize.py
|- config.py
|- utils.py
If you find it interesting you can replace
app.pywithpdf_extract.pyMotivation:
I love the project, I would like to thank you for the great work done.
Refactoring is done to continue working to create an api with fastAPI and Docker.
Main changes:
app.pyhas been created with the pipeline ofpdf_extract.py.app_toolshas been created that contains the classes and methods to perform each step of the pipeline.pdf.py: Provides a set of app_tools for working with PDF files.layout_analysis.py: Analyzes the layout of documents by detecting the layout of each page in a document image.formula_analysis.py: Is designed to handle formula detection and recognition in images.ocr_analysis.py: OCR Processor. It is responsible for performing OCR recognition.table_analysis.py: Represents a Table Processor that is used for table recognition in documents.visualize.py: It generates visualizations of the document layoutconfig.py: Configure model parameters and logsutils.py: save results in jsonFunctionality impact: No change to existing functionality is expected, as the refactoring does not introduce new features or modify existing ones.
Instructions for Reviewers:
app.pyandapp_toolsscripts to ensure that the logic has been ported correctly.Example of Use: