PaddleOCRX is an enhanced library built upon PaddleOCR to improve OCR accuracy while preserving the structure of detected text, especially in unstructured documents.
✅ Enhances PaddleOCR’s accuracy for unstructured document processing. ✅ Maintains the document structure while extracting text. ✅ Handles multi-page PDFs efficiently.
The class DEFAULT (a dataclass) manages all file paths used in the project. Update the following paths before running the script:
class DEFAULT:
single_page_pdf: str = "single_page_pdf" # 📁 Create this folder and set its absolute path here.
input_reports: str = "input_reports" # 📁 Create this folder and set its absolute path here.
output_reports: str = "output_reports" # 📁 Set this to your desired output folder path.
input: str = "input" # 📁 Set this to the folder containing all input PDFs.Once paths are set, run the script using:
python ocr_tool.pyor any preferred method.
python ocr_tool.py --input_folder path/to/input --output_folder path/to/output --sort_along row --lang en --save_as csv --clean_text --input_type pdf| Argument | Type | Default | 📝 Description |
|---|---|---|---|
--input_folder |
str | Required | 📂 Path to input folder containing images or PDFs |
--output_folder |
str | output_reports |
📂 Path to output folder |
--sort_along |
str | row |
📊 Sort extracted text along rows or columns (row or col) |
--lang |
str | en |
🌍 OCR language (use devanagari for Indic languages) |
--save_as |
str | csv |
💾 Output format (csv, json, or text) |
--clean_text |
flag | False |
🧹 Enable text cleaning (optional) |
--input_type |
str | pdf |
📄 Type of input file (image or pdf) |
You can also import the function in a Python script:
from ocr_tool import run_ocr
run_ocr(
input_folder="sample_pdfs",
output_folder="results",
sort_along="row",
lang="en",
save_as="json",
clean_text=True,
input_type="pdf"
)If you encounter:
Error processing Report1_page_1.pdf: Unable to get page count. Is poppler installed and in PATH?
🔹 Poppler is required for PDF processing. Follow these steps to install it:
- Download Poppler for Windows.
- Extract it to a known location, e.g.,
C:\Program Files\poppler-xx\. - Add
C:\Program Files\poppler-xx\binto your systemPATH:- Open System Properties → Advanced → Environment Variables.
- Under System Variables, find
Path, click Edit, and addC:\Program Files\poppler-xx\bin.
Run:
sudo apt update
sudo apt install poppler-utilsRun:
brew install popplerRun:
pdfinfo -vIf Poppler is installed correctly, it will display version details.
💡 Still facing issues? Check out this guide.
📝 This project is licensed under the MIT License.
💡 Contributions are welcome! Feel free to submit issues and pull requests.
📧 For issues, open a GitHub issue or reach out to the maintainers. 🚀