This project implements a complete Optical Character Recognition (OCR) system designed to extract both printed and handwritten text from images. It combines the power of OpenCV for image preprocessing and Tesseract OCR (LSTM engine) for accurate text recognition. The system is tested using the IAM Handwritten Dataset, a leading dataset for handwriting research.
- Extracts text from printed and handwritten images
- Uses a full OpenCV preprocessing pipeline for improved accuracy
- Utilizes Tesseract OCR (LSTM neural engine) for recognition
- Automatically downloads dataset using KaggleHub
- Works with multiple image formats (PNG, JPG, scanned documents)
- Displays original and processed images for comparison
- Suitable for real-world OCR tasks like document scanning and handwriting digitization
The project uses the IAM Handwriting Dataset, which contains thousands of real handwritten English text samples. This dataset is widely used in academic and industrial handwriting recognition research due to its quality and variety.
The IAM dataset is automatically pulled from Kaggle using KaggleHub, providing seamless access to handwriting samples.
To improve recognition accuracy, images undergo:
- Grayscale conversion
- Noise removal
- Blurring
- Thresholding (Otsu or adaptive)
- Contrast enhancement
These steps create cleaner, OCR-ready images.
The processed images are passed to the Tesseract OCR engine, configured to read lines of printed or handwritten text. Tesseract's LSTM-based model improves the recognition of handwritten characters.
The system displays:
- The original input image
- The preprocessed image
- The final extracted text
This helps users compare and understand the OCR pipeline.
The OCR system provides clear and readable text output for:
- IAM handwritten samples
- Scanned documents
- Printed text Accuracy varies depending on handwriting clarity, but preprocessing greatly improves recognition quality.
- OpenCV – Image preprocessing
- Tesseract OCR – Text recognition engine
- pytesseract – Tesseract interface
- KaggleHub – Dataset download
- Matplotlib – Visualization
- Python – Implementation
- Handwriting digitization
- Document scanning systems
- Automated form processing
- Archiving handwritten notes
- Real-time OCR systems
- Text extraction for AI NLP pipelines
Contributions are welcome! Feel free to submit issues, improvements, or pull requests.
This project is licensed under the MIT License (or your preferred license).