Skip to content

Aleptonic/paddleOCRX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏆 PaddleOCRX

PaddleOCRX is an enhanced library built upon PaddleOCR to improve OCR accuracy while preserving the structure of detected text, especially in unstructured documents.

🚀 Features

✅ Enhances PaddleOCR’s accuracy for unstructured document processing. ✅ Maintains the document structure while extracting text. ✅ Handles multi-page PDFs efficiently.


📂 Setup & Usage

🔧 1. Configure File Paths

The class DEFAULT (a dataclass) manages all file paths used in the project. Update the following paths before running the script:

class DEFAULT:
    single_page_pdf: str = "single_page_pdf"  # 📁 Create this folder and set its absolute path here.
    input_reports: str = "input_reports"  # 📁 Create this folder and set its absolute path here.
    output_reports: str = "output_reports"  # 📁 Set this to your desired output folder path.
    input: str = "input"  # 📁 Set this to the folder containing all input PDFs.

⚠️ Ensure all specified directories exist before execution.


▶️ 2. Run the Script

Once paths are set, run the script using:

python ocr_tool.py

or any preferred method.

📌 Command-line usage:

python ocr_tool.py --input_folder path/to/input --output_folder path/to/output --sort_along row --lang en --save_as csv --clean_text --input_type pdf

📝 Arguments Explained

Argument Type Default 📝 Description
--input_folder str Required 📂 Path to input folder containing images or PDFs
--output_folder str output_reports 📂 Path to output folder
--sort_along str row 📊 Sort extracted text along rows or columns (row or col)
--lang str en 🌍 OCR language (use devanagari for Indic languages)
--save_as str csv 💾 Output format (csv, json, or text)
--clean_text flag False 🧹 Enable text cleaning (optional)
--input_type str pdf 📄 Type of input file (image or pdf)

📥 Import as a Python Module

You can also import the function in a Python script:

from ocr_tool import run_ocr

run_ocr(
    input_folder="sample_pdfs",
    output_folder="results",
    sort_along="row",
    lang="en",
    save_as="json",
    clean_text=True,
    input_type="pdf"
)

🛠 Troubleshooting

Poppler Path Error

If you encounter:

Error processing Report1_page_1.pdf: Unable to get page count. Is poppler installed and in PATH?

🔹 Poppler is required for PDF processing. Follow these steps to install it:

🖥️ Windows

  1. Download Poppler for Windows.
  2. Extract it to a known location, e.g., C:\Program Files\poppler-xx\.
  3. Add C:\Program Files\poppler-xx\bin to your system PATH:
    • Open System PropertiesAdvancedEnvironment Variables.
    • Under System Variables, find Path, click Edit, and add C:\Program Files\poppler-xx\bin.

🐧 Linux (Debian/Ubuntu)

Run:

sudo apt update
sudo apt install poppler-utils

🍏 macOS

Run:

brew install poppler

🔍 Verify Installation

Run:

pdfinfo -v

If Poppler is installed correctly, it will display version details.

💡 Still facing issues? Check out this guide.


📜 License

📝 This project is licensed under the MIT License.


🤝 Contributing

💡 Contributions are welcome! Feel free to submit issues and pull requests.


📞 Support

📧 For issues, open a GitHub issue or reach out to the maintainers. 🚀

About

This is a library built upon @paddleocr to increase accuracy and maintain the output detection structure in case on unstructured documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages