🏆 PaddleOCRX

PaddleOCRX is an enhanced library built upon PaddleOCR to improve OCR accuracy while preserving the structure of detected text, especially in unstructured documents.

🚀 Features

✅ Enhances PaddleOCR’s accuracy for unstructured document processing. ✅ Maintains the document structure while extracting text. ✅ Handles multi-page PDFs efficiently.

📂 Setup & Usage

🔧 1. Configure File Paths

The class DEFAULT (a dataclass) manages all file paths used in the project. Update the following paths before running the script:

class DEFAULT:
    single_page_pdf: str = "single_page_pdf"  # 📁 Create this folder and set its absolute path here.
    input_reports: str = "input_reports"  # 📁 Create this folder and set its absolute path here.
    output_reports: str = "output_reports"  # 📁 Set this to your desired output folder path.
    input: str = "input"  # 📁 Set this to the folder containing all input PDFs.

⚠️ Ensure all specified directories exist before execution.

▶️ 2. Run the Script

Once paths are set, run the script using:

python ocr_tool.py

or any preferred method.

📌 Command-line usage:

python ocr_tool.py --input_folder path/to/input --output_folder path/to/output --sort_along row --lang en --save_as csv --clean_text --input_type pdf

📝 Arguments Explained

Argument	Type	Default	📝 Description
`--input_folder`	str	Required	📂 Path to input folder containing images or PDFs
`--output_folder`	str	`output_reports`	📂 Path to output folder
`--sort_along`	str	`row`	📊 Sort extracted text along rows or columns (`row` or `col`)
`--lang`	str	`en`	🌍 OCR language (use `devanagari` for Indic languages)
`--save_as`	str	`csv`	💾 Output format (`csv`, `json`, or `text`)
`--clean_text`	flag	`False`	🧹 Enable text cleaning (optional)
`--input_type`	str	`pdf`	📄 Type of input file (`image` or `pdf`)

📥 Import as a Python Module

You can also import the function in a Python script:

from ocr_tool import run_ocr

run_ocr(
    input_folder="sample_pdfs",
    output_folder="results",
    sort_along="row",
    lang="en",
    save_as="json",
    clean_text=True,
    input_type="pdf"
)

🛠 Troubleshooting

❌ Poppler Path Error

If you encounter:

Error processing Report1_page_1.pdf: Unable to get page count. Is poppler installed and in PATH?

🔹 Poppler is required for PDF processing. Follow these steps to install it:

🖥️ Windows

Download Poppler for Windows.
Extract it to a known location, e.g., C:\Program Files\poppler-xx\.
Add C:\Program Files\poppler-xx\bin to your system PATH:
- Open System Properties → Advanced → Environment Variables.
- Under System Variables, find Path, click Edit, and add C:\Program Files\poppler-xx\bin.

🐧 Linux (Debian/Ubuntu)

Run:

sudo apt update
sudo apt install poppler-utils

🍏 macOS

Run:

brew install poppler

🔍 Verify Installation

Run:

pdfinfo -v

If Poppler is installed correctly, it will display version details.

💡 Still facing issues? Check out this guide.

📜 License

📝 This project is licensed under the MIT License.

🤝 Contributing

💡 Contributions are welcome! Feel free to submit issues and pull requests.

📞 Support

📧 For issues, open a GitHub issue or reach out to the maintainers. 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Dev		Dev
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ocr_tool.py		ocr_tool.py
requirements.txt		requirements.txt
script.py		script.py
script2.py		script2.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏆 PaddleOCRX

🚀 Features

📂 Setup & Usage

🔧 1. Configure File Paths

▶️ 2. Run the Script

📌 Command-line usage:

📝 Arguments Explained

📥 Import as a Python Module

🛠 Troubleshooting

❌ Poppler Path Error

🖥️ Windows

🐧 Linux (Debian/Ubuntu)

🍏 macOS

🔍 Verify Installation

📜 License

🤝 Contributing

📞 Support

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏆 PaddleOCRX

🚀 Features

📂 Setup & Usage

🔧 1. Configure File Paths

▶️ 2. Run the Script

📌 Command-line usage:

📝 Arguments Explained

📥 Import as a Python Module

🛠 Troubleshooting

❌ Poppler Path Error

🖥️ Windows

🐧 Linux (Debian/Ubuntu)

🍏 macOS

🔍 Verify Installation

📜 License

🤝 Contributing

📞 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages